Nokia Siemens Networks LTE
Radio Access, Rel. RL10,
Operating Documentation,
Issue 02
LTE iOMS Alarms and Troubleshooting
DN0962937
Issue 01A
Approval Date 2010-10-29
Confidential
LTE iOMS Alarms and Troubleshooting
The information in this document is subject to change without notice and describes only the
product defined in the introduction of this documentation. This documentation is intended for the
use of Nokia Siemens Networks customers only for the purposes of the agreement under which
the document is submitted, and no part of it may be used, reproduced, modified or transmitted
in any form or means without the prior written permission of Nokia Siemens Networks. The
documentation has been prepared to be used by professional and properly trained personnel,
and the customer assumes full responsibility when using it. Nokia Siemens Networks welcomes
customer comments as part of the process of continuous development and improvement of the
documentation.
The information or statements given in this documentation concerning the suitability, capacity,
or performance of the mentioned hardware or software products are given "as is" and all liability
arising in connection with such hardware or software products shall be defined conclusively and
finally in a separate agreement between Nokia Siemens Networks and the customer. However,
Nokia Siemens Networks has made all reasonable efforts to ensure that the instructions
contained in the document are adequate and free of material errors and omissions. Nokia
Siemens Networks will, if deemed necessary by Nokia Siemens Networks, explain issues which
may not be covered by the document.
Nokia Siemens Networks will correct errors in this documentation as soon as possible. IN NO
EVENT WILL Nokia Siemens Networks BE LIABLE FOR ERRORS IN THIS DOCUMENTATION OR FOR ANY DAMAGES, INCLUDING BUT NOT LIMITED TO SPECIAL, DIRECT, INDIRECT, INCIDENTAL OR CONSEQUENTIAL OR ANY LOSSES, SUCH AS BUT NOT LIMITED
TO LOSS OF PROFIT, REVENUE, BUSINESS INTERRUPTION, BUSINESS OPPORTUNITY
OR DATA,THAT MAY ARISE FROM THE USE OF THIS DOCUMENT OR THE INFORMATION
IN IT.
This documentation and the product it describes are considered protected by copyrights and
other intellectual property rights according to the applicable laws.
The wave logo is a trademark of Nokia Siemens Networks Oy. Nokia is a registered trademark
of Nokia Corporation. Siemens is a registered trademark of Siemens AG.
Other product names mentioned in this document may be trademarks of their respective
owners, and they are mentioned for identification purposes only.
Copyright © Nokia Siemens Networks 2010. All rights reserved
f
Important Notice on Product Safety
Elevated voltages are inevitably present at specific points in this electrical equipment.
Some of the parts may also have elevated operating temperatures.
Non-observance of these conditions and the safety instructions can result in personal
injury or in property damage.
Therefore, only trained and qualified personnel may install and maintain the system.
The system complies with the standard EN 60950 / IEC 60950. All equipment connected
has to comply with the applicable safety standards.
The same text in German:
Wichtiger Hinweis zur Produktsicherheit
In elektrischen Anlagen stehen zwangsläufig bestimmte Teile der Geräte unter Spannung. Einige Teile können auch eine hohe Betriebstemperatur aufweisen.
Eine Nichtbeachtung dieser Situation und der Warnungshinweise kann zu Körperverletzungen und Sachschäden führen.
Deshalb wird vorausgesetzt, dass nur geschultes und qualifiziertes Personal die
Anlagen installiert und wartet.
Das System entspricht den Anforderungen der EN 60950 / IEC 60950. Angeschlossene
Geräte müssen die zutreffenden Sicherheitsbestimmungen erfüllen.
2
Id:0900d805807fe38b
Confidential
DN0962937
Issue 01A
LTE iOMS Alarms and Troubleshooting
Table of contents
This document has 183 pages.
Summary of changes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1
1.1
1.2
1.3
1.4
1.5
1.5.1
1.5.2
Overview of iOMS Troubleshooting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Troubleshooting Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Information Sources in Fault Situations . . . . . . . . . . . . . . . . . . . . . . . . . 11
Problem Types. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Generic Troubleshooting Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Problem Reporting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Introduction to Problem Reporting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Producing OMS Log Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2
2.1
2.2
Operating system troubleshooting in iOMS . . . . . . . . . . . . . . . . . . . . . . 20
Operating system start-up fails . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Kernel in LTE iOMS fails . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3
3.1
Troubleshootig in iOMS installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Troubleshooting in LTE iOMS installation. . . . . . . . . . . . . . . . . . . . . . . 25
4
4.1
4.2
4.3
4.4
4.5
4.6
4.7
4.8
4.9
Troubleshooting in Increment Installation . . . . . . . . . . . . . . . . . . . . . . . 26
Troubleshooting in increment installation. . . . . . . . . . . . . . . . . . . . . . . . 26
Not enough space on boot partition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Commands su/su -. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Remove Increment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Master syslog. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Tracelogs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Core dumps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Account locked . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
RPM Database Gets Locked . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5
5.1
Licence troubleshooting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Connection to eNB Is Not Established. . . . . . . . . . . . . . . . . . . . . . . . . . 37
6
6.1
Software Management Troubleshooting in iOMS . . . . . . . . . . . . . . . . . 38
SW package download fails with "File not selected!Select TargetBD.xml
from SW package and try again", "Select Target Network Elements!", or
"Select TargetBD XML file!" . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
SW package download fails with feedback table error status "Mediator connection error" . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
SW package download fails with feedback table error status "NE: busy"40
No Feedback for Network Element . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
SW package download fails with feedback table error status "Mediator timeout". . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
SW package download fails with feedback table error status "Operation
failed" . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.2
6.3
6.4
6.5
6.6
DN0962937
Issue 01A
7
File transfer from LTE iOMS fails. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
8
Troubleshooting WebUI and LTE iOMS main page. . . . . . . . . . . . . . . . 45
9
IP Connection Troubleshooting in iOMS . . . . . . . . . . . . . . . . . . . . . . . . 47
Id:0900d805807fe38b
Confidential
3
LTE iOMS Alarms and Troubleshooting
4
9.1
IPSec is not working properly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
10
10.1
10.2
10.3
Naming service and synchronisation troubleshooting. . . . . . . . . . . . . . . 50
Naming service is not functioning properly . . . . . . . . . . . . . . . . . . . . . . . 50
LTE iOMS clock shows wrong time . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Network Element cannot get correct time from LTE iOMS . . . . . . . . . . . 54
11
11.1
11.2
11.3
11.4
Backup and restore troubleshooting in iOMS . . . . . . . . . . . . . . . . . . . . . 55
Backup of LTE iOMS fails . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Restoring of database in LTE iOMS fails . . . . . . . . . . . . . . . . . . . . . . . . 57
Restoring of LDAP directory of LTE iOMS fails . . . . . . . . . . . . . . . . . . . 59
Restoring of system image, single file, or directory in LTE iOMS fails . . 61
12
12.1
12.2
Log management troubleshooting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
LTE iOMS syslog is not working properly . . . . . . . . . . . . . . . . . . . . . . . . 63
Gathering LTE iOMS trace logs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
13
13.1
13.2
13.3
13.4
13.5
13.5.1
13.6
13.6.1
13.6.2
13.6.3
13.6.4
Troubleshooting Element Manager in iOMS . . . . . . . . . . . . . . . . . . . . . . 68
Starting the Element Manager fails. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Starting the Element Manager application fails. . . . . . . . . . . . . . . . . . . . 70
Element Manager fails to connect to iOMS. . . . . . . . . . . . . . . . . . . . . . . 72
System response is delayed after Element Manager user actions . . . . . 73
Application Launcher troubleshooting. . . . . . . . . . . . . . . . . . . . . . . . . . . 74
Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
Printouts and error codes in Element Manager . . . . . . . . . . . . . . . . . . . 76
Application is already running . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Connection to the Network Element could not be established. Network cable may be broken . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
Given Network Element was unknown to the system . . . . . . . . . . . . . . . 78
Setting NE account to iOMS unsuccessful . . . . . . . . . . . . . . . . . . . . . . . 79
14
14.1
14.2
Troubleshooting iOMS Fault Management application . . . . . . . . . . . . . . 80
iOMS alarm system is not responding . . . . . . . . . . . . . . . . . . . . . . . . . . 80
Fault Management GUI is not updating the list of alarms . . . . . . . . . . . . 82
15
Checking for problems in iOMS processes . . . . . . . . . . . . . . . . . . . . . . . 83
16
Gathering error information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
17
17.1
17.2
Using EnvCam script . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
EnvCam script overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Using the EnvCam tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
18
Replacing a faulty disk in HP BladeSystem iOMS hardware . . . . . . . . . 88
19
19.1
19.2
19.3
19.4
19.5
19.6
19.7
LTE iOMS alarms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
70001 CONFIGURATION OF SNMP MEDIATOR IS OUT OF ORDER 92
70002 INVALID SNMP TRAP COMMUNITY STRING . . . . . . . . . . . . . . 94
70003 NO REPLY TO SNMP REQUEST . . . . . . . . . . . . . . . . . . . . . . . . 96
70004 UNKNOWN SNMP TRAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
70005 INCORRECT ALARM DATA . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
70007 AUTHENTICATION FAILURE IN ETHERNET DEVICE . . . . . . 102
70011 NODE NOT RESPONDING . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
Id:0900d805807fe38b
Confidential
DN0962937
Issue 01A
LTE iOMS Alarms and Troubleshooting
19.8
19.9
19.10
19.11
19.12
19.13
19.14
19.15
19.16
19.17
19.18
19.19
19.20
19.21
19.22
19.23
19.24
19.25
19.26
19.27
19.28
19.29
19.30
19.31
19.32
19.33
19.34
19.35
19.36
19.37
19.38
19.39
19.40
19.41
19.42
19.43
19.44
19.45
19.46
19.47
19.48
19.49
19.50
19.51
DN0962937
Issue 01A
70025 POSSIBLE SECURITY THREAT IN NETWORK ELEMENT . . 107
70030 DISK DATABASE IS GETTING FULL . . . . . . . . . . . . . . . . . . . 108
70064 BACKUP ERROR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
70110 CONFIGURATION OF NWI3 ADAPTER IS OUT OF ORDER. 111
70111 FAILED TO CREATE NETACT CONNECTION . . . . . . . . . . . . 114
70156 DISK DATABASE WATCHDOG START-UP FAILED . . . . . . . 117
70157 CPU USAGE OVER LIMIT . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
70158 FILE SYSTEM USAGE OVER LIMIT . . . . . . . . . . . . . . . . . . . . 120
70159 MANAGED OBJECT FAILED . . . . . . . . . . . . . . . . . . . . . . . . . . 122
70160 MEMORY USAGE OVER LIMIT. . . . . . . . . . . . . . . . . . . . . . . . 127
70161 OPERATING SYSTEM MONITORING FAILURE . . . . . . . . . . 128
70162 RAID ARRAY HAS BEEN DEGRADED . . . . . . . . . . . . . . . . . . 129
70163 ETHERNET INTERFACE USAGE OVER LIMIT . . . . . . . . . . . 130
70164 ETHERNET LINK FAILURE . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
70166 MANAGED OBJECT LOCKED. . . . . . . . . . . . . . . . . . . . . . . . . 132
70168 CLUSTER STARTED (RESTARTED) . . . . . . . . . . . . . . . . . . . 133
70173 BACKEND DATABASE REQUIRED BY CORBA NAMING SERVICE IS UNAVAILABLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
70186 CLUSTER OPERATION INITIATED BY OPERATOR . . . . . . . 137
70188 MANAGED OBJECT SHUTDOWN BY OPERATOR . . . . . . . . 138
70189 MANAGED OBJECT UNLOCKED BY OPERATOR. . . . . . . . . 139
70236 LDAP DATABASE CORRUPTED. . . . . . . . . . . . . . . . . . . . . . . 140
70237 CORRUPTED LDAP DATABASE RECOVERED. . . . . . . . . . . 143
70242 ALARM LOG FILE INACCESSIBLE . . . . . . . . . . . . . . . . . . . . . 145
70243 ALARM PROCESSOR CONFIGURATION IS OUT OF ORDER . .
147
70244 CORRUPTED ALARM DATA . . . . . . . . . . . . . . . . . . . . . . . . . . 149
70245 ILLEGAL INTERNAL USAGE OF EXTERNAL ALARM NOTIFICATION FORMAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
70246 ALARM SYSTEM HEARTBEAT . . . . . . . . . . . . . . . . . . . . . . . 152
70247 ALARM SYSTEM HEARTBEATING SWITCHED OFF . . . . . . 154
70256 RESOURCE ALLOCATION OR DE-ALLOCATION FAILURE . 156
70265 RECOVERY ACTIONS BANNED FOR MANAGED OBJECT . 159
70267 EXTERNAL USER ACCOUNT VALIDATION FAILED . . . . . . . 161
70268 EXTERNAL LDAP FAILURE . . . . . . . . . . . . . . . . . . . . . . . . . . 164
70269 INVALID ACTIVE SESSIONS. . . . . . . . . . . . . . . . . . . . . . . . . . 167
70280 UNKNOWN SPECIFIC PROBLEM. . . . . . . . . . . . . . . . . . . . . . 170
71000 PM FTP CONNECTION FAILED . . . . . . . . . . . . . . . . . . . . . . . 173
71001 MEASUREMENT DATA NOT TRANSFERRED. . . . . . . . . . . . 174
71002 MEASUREMENT DATA ERROR . . . . . . . . . . . . . . . . . . . . . . . 175
71003 OMS MEASUREMENT DATA PROCESSING OVERLOAD . . 176
71052 OMS FTP CONNECTION COULD NOT BE OPENED. . . . . . . 177
71054 O&M MEDIATION FAILURE. . . . . . . . . . . . . . . . . . . . . . . . . . . 178
71057 NWI3 NOTIFICATION MISSING . . . . . . . . . . . . . . . . . . . . . . . 179
71058 NE O&M CONNECTION FAILURE . . . . . . . . . . . . . . . . . . . . . 180
71101 OMS ALARM UPLOAD FROM NE FAILED . . . . . . . . . . . . . . . 181
71103 ID CONFLICT IN BTS O&M CONNECTION . . . . . . . . . . . . . . 182
Id:0900d805807fe38b
Confidential
5
LTE iOMS Alarms and Troubleshooting
Related information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
6
Id:0900d805807fe38b
Confidential
DN0962937
Issue 01A
LTE iOMS Alarms and Troubleshooting
List of tables
Table 1
Table 2
Table 3
Table 4
Table 5
Table 6
DN0962937
Issue 01A
Nokia Siemens Networks problem classification . . . . . . . . . . . . . . . . . 16
Application is already running . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Connection to the Network Element could not be established. Network cable may be broken. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
Given Network Element was unknown to the system. . . . . . . . . . . . . . . 78
Setting NE Account to iOMS unsuccessful. . . . . . . . . . . . . . . . . . . . . . . 79
Valid and default attribute values of the NWI3 adapter configuration file .
111
Id:0900d805807fe38b
Confidential
7
Summary of changes
LTE iOMS Alarms and Troubleshooting
Summary of changes
Changes between document issues are cumulative. Therefore, the latest document
issue contains all changes made to previous issues.
Changes between issues 01 (2010-05-21, RL10) and 01A (2010-10-15, RL10)
New alarms
•
71002 MEASUREMENT DATA ERROR
Modified alarms
•
8
71058 NE O&M CONNECTION FAILURE
Id:0900d805808064b9
Confidential
DN0962937
Issue 01A
LTE iOMS Alarms and Troubleshooting
Overview of iOMS Troubleshooting
1 Overview of iOMS Troubleshooting
1.1
Troubleshooting Recommendations
If you have a contract with Nokia Siemens Networks for the operation and maintenance
of the network (or some other agreement), the actions you need to take in a fault situation may be different from the ones suggested in the troubleshooting instructions. If the
general principles are in conflict with the operation and maintenance contract or any
other contract, carry out actions as agreed in the contract.
The operation and maintenance personnel that carry out troubleshooting should be
familiar with the hardware and software of Nokia Siemens Networks network elements.
Electrostatic precautions
When handling plug-in units, it is important to use Electrostatic Precautions (ESP). This
means that you must be earthed to equipment racks using an approved wrist strap and
connecting lead. Approved ESP equipment makes a resistive connection to ensure the
safety of the personnel and to prevent a sudden static discharge during connection to
the earthing point.
Security procedures
You are recommended to establish security procedures at your site to ensure appropriate staff and terminal access to the personnel.
Disaster recovery plan
Establish a disaster recovery plan to help the personnel to deal with emergency situations. Remember that emergency situations can be best avoided by detecting abnormal
conditions early. A disaster recovery plan should cover various disaster scenarios and
disaster recovery procedures for personnel. The operation and maintenance personnel
should also be able to contact the persons who are capable of dealing with the problem
in question. Therefore, each site should have an escalation plan available with appropriate contact information.
Escalation plan
An escalation plan offers contact lists of internal and external support personnel and
services available to tackle problems. It should contain information on who to contact
and in what kind of situations, for example, air conditioning, power back-up system and
Nokia Siemens Networks Emergency/Help Desk numbers.
Preventive maintenance
Perform preventive maintenance routines on a regular basis. For example, carry out
regular alarm and unit state surveillance.
Fallback procedure
Fallback procedure in iOMS is automated. For more information, see Backup and
Restore in Administering and Security in LTE OMS.
Performance monitoring
The purpose of performance monitoring is to measure the overall quality of the system.
Performance monitoring can help you to detect very low rate or intermittent problems
and possible degradation of some part of the system.
DN0962937
Issue 01A
Id:0900d8058075abaa
Confidential
9
Overview of iOMS Troubleshooting
LTE iOMS Alarms and Troubleshooting
For more information, see performance management -related documentation.
Documentation
Establish a procedure for keeping the documentation up-to-date and make sure that the
operation and maintenance personnel have access to all relevant external and internal
documents.
Network element diary
It is recommended that you maintain a network element diary. The diary should be
network element -specific, but you can store it in the Operation and Maintenance Centre
if the network element is not usually manned.
Start filling in the network element diary already when the network element is being set
up and installed.
You are recommended to record the following events in the network element diary:
•
•
•
•
•
•
Hardware changes
Software and hardware updates (for example, change notes and correction deliveries)
Essential modifications to the configuration or routing in the network element
Safecopying
Operational failures
Any other relevant information
A network element diary can provide useful information on the system's performance in
the past and hints on what might cause the current problems.
10
Id:0900d8058075abaa
Confidential
DN0962937
Issue 01A
LTE iOMS Alarms and Troubleshooting
1.2
Information Sources in Fault Situations
Use at least the following information sources when carrying out troubleshooting.
Alarms
Alarms are the primary source of information in most situations where troubleshooting
is needed. Alarms are printed out on the alarm printer and/or other devices that you have
specified for the network element.
Recovery history
Recovey history contains records from system recovery actions about faults on unit or
entire system.
Error messages
General error messages of the system tell why the system cannot carry out a task. They
can appear in the supplementary information fields of alarms, in the printouts of the
starting phases monitored through a service terminal, and in MML and Element
Manager command outputs. You can use service terminal extension MRSTRE (in DMX
units) to open the error message name and possible instructions.
Statistical reports
Different statistical reports contain useful data, for example, on traffic on speech and signalling circuits, use of services and load and availability measurements. Monitor and
assess statistical data regularly as this data can indicate forthcoming problems before
they affect the traffic. For more information, see performance management documentation.
Logs and other relevant statistical information
Different logs (for example, computer and operating system logs and MML session
reports) contain useful data that can be attached to the fault report when you need the
help of Nokia Siemens Networks to solve a problem. Take always OMU logs and also
DSP logs if the object of the alarm is DSP.
For more information on different logs and statistical information, see:
•
•
Operating System Troubleshooting
Problem Reporting
You should also check the unit state.
DN0962937
Issue 01A
Id:0900d8058075ad47
Confidential
11
LTE iOMS Alarms and Troubleshooting
1.3
Problem Types
Here are some problem types you may encounter.
Reproducible problems
You can reproduce the symptoms using a set of actions. Reproducible problems can be
solved by narrowing down the possible causes of the problem to a single cause or to a
number of causes and applying corrective actions. This requires knowledge of how the
system works and tests to eliminate wrong conclusions.
Intermittent problems
You cannot reproduce the symptoms consistently using any set of actions. However, an
intermittent problem can reproduce itself randomly. In such a case, some kind of tracing
or monitoring of the system may lead you to the origin of the trouble.
If an intermittent problem occurs very seldom and it has no serious consequences, it
may be best to just ignore the problem. You can also perform general maintenance and
see if the problem disappears. If the problem occurs occasionally, try to conclude which
factors seem to affect or contribute to the appearance of the problem.
Several related or isolated troubles active at the same time
Study whether the symptoms relate to each other or not and try to isolate the problems
if possible.
12
Id:0900d80580786481
Confidential
DN0962937
Issue 01A
LTE iOMS Alarms and Troubleshooting
1.4
Generic Troubleshooting Procedure
Description
Depending on whether you have a contract with Nokia Siemens Networks on the operation and maintenance of the network (or some other agreement), the actions that you
need to take in a fault situation may be different from the ones presented here.
When you suspect that a Nokia Siemens Networks network element is not performing
as it should, carry out at least the following checks.
Symptoms
A Nokia Siemens Networks network element is not performing as it should.
Recovery procedures
Troubleshooting process before calling Nokia Siemens Networks help desk
Steps
1
Evaluate how serious the consequences of the trouble are.
If the problem has very serious consequences, you may have to call for expert help
or apply an emergency plan immediately.
2
Analyze the situation where the problem or failure first appeared.
Consider the following before you carry out any corrective actions.
•
•
•
•
•
•
3
Try to eliminate the possibility of human error.
•
•
•
DN0962937
Issue 01A
What is the problem?
Where is the problem?
When did the problem occur?
What were the circumstances that led to the problem?
What is the impact of the problem (for example, to what extent does the fault
affect the end customers)?
Who is responsible for taking care of the problem?
For example, recent changes in the software or hardware configuration of the
system (for example, equipping changed) are possible sources of problems. The
changes may have been carried out incorrectly.
Human error is a very common cause of problems – therefore, check and
double-check every possible problem source. For example, check the MML
commands that have been entered recently, using the IGO command:
ZIGO:<start date>,<start time>,<end date>,<end time>;
where the start date should correspond to the day preceding the problem occurrence, and the end date should be the day when the problem occurred. For
example, if the problem situation emerged on 2008-04-28 at 00:10 and ceased
on 2008-04-28 at 03:20, enter the command:
ZIGO:2008-04-28,00-10,2008-04-28,03-20;
A failure can also occur spontaneously (for example, the remote end system
may have problems or a service breakdown, or a plug-in unit may fail due to
Id:0900d80580786484
Confidential
13
LTE iOMS Alarms and Troubleshooting
ageing). Check the alarms, clear codes, unit and link states and logs as
described below.
4
Make an accurate description of the symptoms.
You may not be able to solve the problem yourself. A detailed description of the situation where the symptoms occurred can help an expert solve the problem. Gather
also data on the failure event. For instructions on storing data in failure situations,
see Service terminal troubleshooting.
A symptom description should contain all the basic facts, such as:
•
•
•
•
5
Date, name of the person who detected the trouble, phone number and e-mail
address
Details of the system; for example, what equipment and software is in use
Description of the symptoms (alarms, error messages, clear codes, faulty states
of the units and links and so on)
Any other relevant information, for example, log and message monitoring files.
All data may be valuable even if they seem irrelevant at the time. Store this information preferably in electronic format.
Check and analyze the alarm situation.
Check the alarms that are currently on (command AAP). You are recommended to
also study the alarm history (command AHP). Display the alarm history so that it
shows all alarm events from the time period which starts one hour before the occurrence of the problem situation, and ends one hour after the problem situation was
over. You should display the alarm history with the MML command:
ZAHP:::<start date>,<start time>,<end date>,<end time>;
For example, if the problem situation emerged on 2008-04-27 at 00:10 and ceased
on 2008-04-27 at 03:20, enter the command:
ZAHP:::2008-04-27,00-10-00,2008-04-28,03-20-00;
The system may set an alarm and cancel it immediately. You can find these alarms
in the alarm history. Alarms behaving in this way may indicate that some part of the
system is about to break down or its functionality has been reduced.
6
If the fault can be located based on the alarm situation
Then
Carry out the appropriate maintenance actions.
•
•
7
If you have ended up with more than one probable and possible cause for the
trouble, change only one thing at a time – otherwise you cannot be sure of which
change corrected the failure or problem.
Remember that random actions can make problems worse. Generally, you
should not take any radical corrective actions if you are not sure what the
problem is and what the consequences of the corrective actions are. Losing
traffic because of incorrect actions is not what you want.
If you cannot locate the fault based solely on the alarm situation
Then
14
Id:0900d80580786484
Confidential
DN0962937
Issue 01A
LTE iOMS Alarms and Troubleshooting
Try to narrow down the possible problem source.
•
•
•
8
Analyze and categorize symptoms and list possible causes for the symptoms.
Sometimes there can be several related or isolated troubles active at the same
time. Study whether the symptoms relate to each other or not. Prioritize
symptoms and collect further facts if needed.
Based on tests and your knowledge of the system, eliminate symptoms that are
not relevant to the trouble you are trying to solve. This way you can focus on
symptoms and causes that are more likely to produce a solution to the problem.
Examine what works and what does not.
However, even though you may not be able to analyze what the cause of the
trouble is, you can carry out general maintenance to eliminate some trivial
causes of troubles, such as loose cables and bad connections.
Use measurements to trace any abnormal trends.
For more information, see performance management documentation.
9
Check the unit state and links.
10 Fill in a problem report if needed.
Describe the problem in detail in the problem report. Include all relevant information
that you have available from the problem situation and describe also the corrective
measures that you have carried out after the problem occurred.
For more information, see Introduction to Problem Reporting.
DN0962937
Issue 01A
Id:0900d80580786484
Confidential
15
LTE iOMS Alarms and Troubleshooting
1.5
Problem Reporting
1.5.1
Introduction to Problem Reporting
Problem reports are used to communicate problems and failures to service personnel.
Report only one fault in one problem report.
To make the investigation of a problem faster, include the following information in the
problem report:
•
•
•
•
•
•
A title that gives a brief description of the problem
A clear and exact description of the problem itself. At least the following information
should be provided in addition to the actual problem description:
• Situation in the beginning, for example, the first symptoms of the failure
• Operations you made which possibly caused the failure
• Situation after the failure
• Recovery actions that you made
• Name and version of the new software modules you possibly installed
In a multi-vendor environment, include detailed information of the other products in
the description field of the problem report.
The release number of the network element and the version number of the software.
For software, identify the software build in use and, if possible, the versions of the
program blocks/processes that you suspect to be faulty.
Severity of the problem as defined in Table Nokia Siemens Networks problem classification
Log files, monitoring files, and alarm history from the units where the problem
occurred
When you send out a problem report, make sure that all the possible attachments are
included in the problem report, to avoid unnecessary information requests.
Nokia Siemens Networks problem classification
Nokia Siemens Networks
problem class
Definition of problem report
severity as defined in "TL 9000
Quality Management System,
Measurement Handbook,
Release 3"
Examples
A-CRITICAL
Critical (Emergency duty contacted) problems severely affect
service, capacity/traffic, billing,
and maintenance capabilities and
require immediate corrective
action, regardless of time of day
or day of the week as viewed by a
customer upon discussion with
the collaborator.
•
•
Only total or major outages that
are not avoidable with a workaround solution.
Table 1
16
•
•
System restart, all links down
More than 50 per cent of traffic
handling capacity out of use
Subscriber related network
element functionality is not
working
Network element cannot be
accessed or monitored from
NetAct or OMS
Nokia Siemens Networks problem classification
Id:0900d8058075ad49
Confidential
DN0962937
Issue 01A
LTE iOMS Alarms and Troubleshooting
Nokia Siemens Networks
problem class
Definition of problem report
severity as defined in "TL 9000
Quality Management System,
Measurement Handbook,
Release 3"
Examples
B-MAJOR
Major problems cause conditions
that seriously affect system operation, maintenance, and administration and require immediate
attention as viewed by the
customer upon discussion with
the collaborator. The urgency is
less than in critical situations
because of a lesser immediate or
impending effect on system performance, customers, and the
customer’s operation and
revenue.
•
•
•
The fault affects traffic
randomly or problem leads only
to degradation of network performance or the fault makes it
difficult for the customer to
operate the network element.
•
•
•
•
•
•
•
•
C-MINOR
Minor fault not affecting operation or service quality
Other problems that a customer
does not view as critical or major
are considered minor.
Minor problems do not significantly impair the functioning of
the system and do not significantly affect service to customers.
These problems are tolerable
during system use.
•
•
•
•
Single restart of computer units
Problems with back-up
Configuration changes (network,
HW, and SW) are not working
Problems seriously affecting end
user service, but avoidable with a
workaround solution
Capacity/quality related functionality is not working
Performance measurement or
alarm management is not working
Activation of a new feature fails
Subscriber related functions are
not working completely
Alarm management of objects
(BTS, functional units) is not
working completely
Major errors in documentation, for
example, an alarm or description
is missing from documentation
Vital documents are missing from
the documentation library
Failures not seriously affecting
traffic
Errors in MML syntax
Cosmetic errors in MML/Statistic
output
Minor errors in documentation
Engineering complaints are classified as minor unless otherwise
negotiated between the customer
and the supplier.
Table 1
Nokia Siemens Networks problem classification (Cont.)
Hardware failure
When you suspect that a failure or problem is caused by the hardware of the Adapter,
but you cannot locate the fault, you can report the fault using the problem reporting practice. You can fill in the hardware failure report, attach it to the hardware and send it to
your local technical support.
Set the case type of the problem report to value Hardware.
System failure or problem
The most common use of the problem report is reporting a system defect or problems
related to the software or data configuration of the Adapter.
DN0962937
Issue 01A
Id:0900d8058075ad49
Confidential
17
LTE iOMS Alarms and Troubleshooting
Set the case type of the failure report to value Software.
Documentation failure
If you notice deficiencies in the documentation of the network element, raise a problem
report. Identify where in documentation the failure occurs and which documentation set
or library the document belongs to, and, of course, the actual failure.
Set the case type of the failure report to value Documentation.
Improvement proposal
When there is no actual fault but you want to suggest some improvement to the way the
network element functions, you can do this using problem reporting. New feature proposals should be directed directly to Product Marketing.
Set the case type of the problem report to value General.
Change in source data
If the source data (for example some system files) has been changed without using an
official SW Change Note (for example due to an urgent problem), send a problem report
on the changes to Nokia Siemens Networks.
Nokia Siemens Networks needs the information on changed system files when creating
the next software build. If this information is not available, some functionalities may fail
in the next software upgrade at the site.
Set the case type of the problem report to value Source Data.
Other
If you have a failure or problem that cannot be categorized to any of the above, set the
case type of the failure report to value Other and describe the problem in detail.
18
Id:0900d8058075ad49
Confidential
DN0962937
Issue 01A
LTE iOMS Alarms and Troubleshooting
1.5.2
Producing OMS Log Files
Purpose
When you send a problem report to the service personnel, you should always attach all
the related log files to the report in order to make the process of investigating the
problem faster.
For more information about creating iOMS log files, see Tracelogs.
DN0962937
Issue 01A
Id:0900d8058075adfb
Confidential
19
Operating system troubleshooting in iOMS
LTE iOMS Alarms and Troubleshooting
2 Operating system troubleshooting in iOMS
2.1
Operating system start-up fails
Description
Failures in the OS start-up should be rare. Usually failures in this phase are caused by
configuration errors or a hardware fault. For more information on the OS start-up, see
Operating system start-up and shutdown in LTE iOMS.
Symptoms
The possible symptoms for an OS start-up failure are the following:
•
•
•
single node restart problems
excessive restarts in the system
alarm 70011 NODE NOT RESPONDING
For instructions on how to recover from a situation when the alarm 70011 NODE
NOT RESPONDING occurs, or when a node is in a reset loop but there are no
alarms, see alarm description 70011 NODE NOT RESPONDING.
Recovery procedures
Recovering from an OS start-up failure
Check that:
You have external monitor and keyboard connected to the node that is in the reset loop
to be able to investigate the phase printouts
Steps
1
In case the instuctions in 70011 NODE NOT RESPONDING do not help.
If the instructions in 70011 NODE NOT RESPONDING do not help, or if there are
other symptoms than an alarm
Then
Continue with the following instructions.
2
Collect a verbatim copy of the printouts on the external monitor about the fault
situation.
3
Check in which sub-phase the failure occurred.
If it is not clear in which start-up sub-phase the failure occurred, check the subphase from the phase printouts on the monitor. You should have an external monitor
and keyboard connected to the node that is in the reset loop to be able to investigate
the printouts. The start-up sub-phases are the following:
•
•
20
BIOS
• Node is restarted before the version of GRUB is printed on the monitor
Boot loader (GRUB)
• OS is not started
• OS is started, but login prompt never appears on the monitor.
Id:0900d805807fe390
Confidential
DN0962937
Issue 01A
LTE iOMS Alarms and Troubleshooting
4
Operating system troubleshooting in iOMS
Check the version information.
Check the versions of the BIOS (BIOS identification string), boot loader and OS of
the node in a restart loop.
a) Check the version of the BIOS from the printouts on the external monitor.
b) Check the version of the boot loader from the printouts on the external monitor.
c) Check the version of the OS with the following command:
uname -a
5
Contact your Nokia Siemens Networks representative.
Provide the Nokia Siemens Networks representative with all the information
gathered in the previous steps (version information, kernel printouts and printouts
on the external monitor related to the failure in question).
When sending files to your Nokia Siemens Networks representative, please ensure
that any large files are compressed before sending them.
DN0962937
Issue 01A
Id:0900d805807fe390
Confidential
21
LTE iOMS Alarms and Troubleshooting
2.2
Kernel in LTE iOMS fails
Description
If there are fatal errors inside the kernel, a kernel panic occurs. It is an unrecoverable
Linux kernel crash and if it occurs, the kernel fails. The system stops running and must
be restarted.
In a lockup situation, the kernel is not processing any data. The reason for a lockup can
be a hardware lockup or a software lockup with interrupts enabled or interrupts disabled.
In a software lockup with interrupts set to enabled, the kernel is still partly working and
you can get some information from it. Normally, the HW watchdog resets the node in a
lockup situation.
Symptoms
When a kernel crashes, it normally creates an Oops message which is a kernel information dump that contains details of the system failure, such as the contents of CPU registers and the location of page descriptor tables. The Oops message is triggered by an
exception in the system and is a dump of the CPU state and kernel stack at that instant.
It is written to a syslog file in /var/log/syslog and it is also sent to the system
console at the time of the crash. For example:
Oct 4 15:02:51 SLOT1 kernel: Unable to handle kernel paging
request
at virtual address 76656433
Oct 4 15:02:51 SLOT1 kernel: printing eip:
Oct 4 15:02:51 SLOT1 kernel: c024a0c8
Oct 4 15:02:51 SLOT1 kernel: *pde = 00000000
Oct 4 15:02:51 SLOT1 kernel: Oops: 0000
Oct 4 15:02:51 SLOT1 kernel: CPU:
1
Oct 4 15:02:51 SLOT1 kernel: EIP:
0010:[md_update_sb+728/896]
Not tainted
Oct 4 15:02:51 SLOT1 kernel: EFLAGS: 00010292
Oct 4 15:02:51 SLOT1 kernel: eax: f6523a94
ebx: f5a04000
ecx: 00000006
edx: f648e000
Oct 4 15:02:51 SLOT1 kernel: esi: 7665642f
edi: 00000006
ebp: f6523a80
esp: f648ff64
Oct 4 15:02:51 SLOT1 kernel: ds: 0018
es: 0018
ss: 0018
Oct 4 15:02:51 SLOT1 kernel: Process raid1d (pid: 47,
stackpage=f648f000)
Oct 4 15:02:51 SLOT1 kernel: Stack: f648e000 00000001 f6644700
ffffe000 f6523a94 00000064
00000001 c023f2f1
Oct 4 15:02:51 SLOT1 kernel:
f6523a80 f648e000 00000001
f6644700
ffffe000 00000000 f687cb80
f648e000
Oct 4 15:02:51 SLOT1 kernel:
0000002f 00000246 f6523a80
c024db95
f649e800 00000100 f67ade3c
22
Id:0900d805807ed8f9
Confidential
DN0962937
Issue 01A
LTE iOMS Alarms and Troubleshooting
f6644700
Oct 4 15:02:51 SLOT1 kernel: Call Trace: [raid1d+29/936]
[md_thread+361/476]
[kernel_thread+43/96]
[md_thread+0/476]
Oct 4 15:02:51 SLOT1 kernel:
Oct 4 15:02:51 SLOT1 kernel: Code: 39 46 04 0f 85 5f fe ff ff
83 7c 24 18
00 0f 84 89 00 00 00
When a kernel panic has occured, the kernel tries to make an lcore file after kernel
restart. Use the command crash to examine the content of the lcore file in the
/var/crash directory.
The lcore files show where the error is, for example:
crash boot/cpi1/System.map boot/cpi1/vmlinux lcore.cr.0
crash 3.3-20
<..>
WARNING: cannot access vmalloc'd module memory
SYSTEM MAP: boot/cpi1/System.map
DEBUG KERNEL: boot/cpi1/vmlinux
DUMPFILE: lcore.cr.0
CPUS: 4
DATE: Sat Jan 9 03:28:43 2006
UPTIME: 00:02:22
LOAD AVERAGE: 0.08, 0.06, 0.01
TASKS: 70
NODENAME: TA-0
RELEASE: 2.6.9-22.ELsmp
VERSION: #1 SMP Mon Sep 19 18:32:14 EDT 2005
MACHINE: i686 (1599 Mhz)
MEMORY: 2 GB
PANIC: "WDMana reset - reason: no input from HAS
Starter"
PID: 218
COMMAND: "WDMana"
TASK: f6220000
CPU: 3
STATE: TASK_RUNNING (PANIC)
crash> bt
PID: 218 TASK: f6220000 CPU: 3 COMMAND: "WDMana"
The PANIC field shows the reason for the kernel panic.
Recovery procedures
Recovering from a kernel panic situation
Steps
1
Collect all the available information about the error.
Collect the following information:
DN0962937
Issue 01A
Id:0900d805807ed8f9
Confidential
23
LTE iOMS Alarms and Troubleshooting
•
•
•
2
syslog entries around the time of the crash
possible /var/crash/lcore.* files
console outputs, if available
Contact your Nokia Siemens Networks representative.
Provide your Nokia Siemens Networks representative with all the information you
have collected in Step 1.
Recovering from a failure where the kernel is not processing data
Steps
1
If the reset does not happen automatically,
Then
Try to reset the node manually.
2
If the problem still exists,
Then
Contact your Nokia Siemens Networks representative.
24
Id:0900d805807ed8f9
Confidential
DN0962937
Issue 01A
LTE iOMS Alarms and Troubleshooting
Troubleshootig in iOMS installation
3 Troubleshootig in iOMS installation
3.1
Troubleshooting in LTE iOMS installation
Description
The iOMS installation starts by running the following script: InstallGOMS_USB.sh.
If the partition table configuration changes between installations (current vs.coming), the
installation is likely to fail in partition creation and cause the following error message: No
such file /var/mnt/local/backup/SS_Backup...
If it happens, run the InstallGOMS_USB.sh script again.
Installation log
The installation log can be found on the USB stick in /logs/Install-<timestamp>.log
For more details see Installing iOMS software from USB stick in Installing and Commissioning LTE iOMS Using USB Stick.
DN0962937
Issue 01A
Id:0900d805807fe392
Confidential
25
Troubleshooting in Increment Installation
LTE iOMS Alarms and Troubleshooting
4 Troubleshooting in Increment Installation
4.1
Troubleshooting in increment installation
g If you have problems with increment installation you can find sw-management
logfile in the following folder:
/var/mnt/local/sysimg/opt/Nokia/var/swmgmt/
☞ Remember to use tabulator key after commands to avoid misspellings and get alternatives.
Examples:
# [root@CLA-0(OMS-1) /root]
# current (press tab-key) -> gives alternatives to currentdelivery,
currentset
26
Id:0900d805807fe3aa
Confidential
DN0962937
Issue 01A
LTE iOMS Alarms and Troubleshooting
4.2
Not enough space on boot partition
Description
Number of sets is limited by boot partition disk size. Set size can vary, but usually
system can handle 6-10 sets. Use listsets command to view the amount of the sets.
Symptoms
When set limit is reached, creating a new set fails with the following error message:
ERROR Not enough space on boot partition /dev/cciss/c0d0p1
(free: 16288)
g If makeset fails, software set is not created.
Recovery procedures
When encountering this error, do not try to clear /dev/sda1 manually, but remove
unneeded sets by executing fsswcli --set --remove command. If error was generated by single correction installation, uninstall correction. If it is not removed due to
rollback, reinstall it and then proceed normally. If the error was generated by installing
multiple corrections, new set can be generated normally right after removing the older
sets.
#mkdir /root/mount
#mount /dev/cciss/c0d0p1 /root/mount
#df -h /dev/cciss/c0d0p1
#umount /root/mount
#rmdir /root/mount
Example:
Example of the output:
# df -h /dev/cciss/c0d0p1
Filesystem
Size Used Avail Use% Mounted on
/dev/cciss/c0d0p1 99M
46M
49M 49% /var/mnt/local/localimg/root/mount
Creating a new set requires at least 25Mb of free space.
DN0962937
Issue 01A
Id:0900d805807ef38b
Confidential
27
LTE iOMS Alarms and Troubleshooting
4.3
Commands su/su Description
Root rights have to be used while installing packages or increments. Command su changes rights and user interface to root.
Command su gives root rights, but does not change the interface. If you are logged to
LTE iOMS as Nemuadmin, and then execute command su -> Nemuadmin , you will
get the root user rights but the interface is still Nemuadmin.
The difference between commands might occur with correction installation.
Symptoms
The correction installation fails.
Recovery procedures
If the correction installation fails, you can repair the situation as follows:
Steps
28
1
Log in to LTE iOMS.
2
Change user rights and interface to root by command su – and give the root
password.
3
Uninstall corrupted correction installation by command
uninstallincrement.
4
Install the correction again.
Id:0900d805807ef38d
Confidential
DN0962937
Issue 01A
LTE iOMS Alarms and Troubleshooting
4.4
Remove Increment
Purpose
Removal is possible, if you have installed increments after commissioning. Increments
which were included into the commissioning, are integrated into the software set and are
therefore impossible to remove. To remove increments use following commands:
Steps
1
Check current set with command fsswcli --set --current using root interface su-.
Example:
fsswcli --set --current
CLA-0
R_GOMS5_1.25.release_oms.corr10
2
Activate previous set with command fsswcli --set --activate <previous
set delivery name>.
☞ You can check previous set name by using fsswcli --set --list command.
Example:
fsswcli --set --activate R_GOMS5_1.25.debug_oms.corr10
g Activating command: fsswcli --set --activate reboots OMS.
3
Check that current set is previous set.
Checking is necessary to verify that correct set is active because you cannot uninstall
active delivery or set.
4
Remove old corr and set.
a) The latest incremental delivery can be uninstalled with the fsswcli --delivery
--uninstall <delivery name> command.
Example:
fsswcli --delivery --uninstall R_GOMS5_1.25.debug_oms.corr10
b) The latest incremental set can be uninstalled with the fsswcli --set --remove
<delivery name> command.
Example:
fsswcli --set --remove R_GOMS5_1.25.debug_oms.corr10
Further information
g The number of incremental deliveries can be uninstalled with the fsswcli -delivery --uninstall –N <count> command.
Example:
There are 10 incremental deliveries () installed into one software set
R_GOMS4_1.24.1.5.debug_oms.corr10. R_GOMS5_1.25.debug_oms.corr1...
R_GOMS5_1.25.debug_oms.corr10
To downgrade four levels to level 6, enter the following commands:
DN0962937
Issue 01A
Id:0900d8058080149a
Confidential
29
LTE iOMS Alarms and Troubleshooting
fsswcli --delivery --uninstall -N 4
fsswcli --set --make R_GOMS5_1.25.debug_oms.corr6
fsswcli --ldap --upgrade R_GOMS5_1.25.debug_oms.corr6
fsswcli --set --activate R_GOMS5_1.25.debug_oms.corr6
fsswcli --set --remove R_GOMS5_1.25.debug_oms.corr10
30
Id:0900d8058080149a
Confidential
DN0962937
Issue 01A
LTE iOMS Alarms and Troubleshooting
4.5
Master syslog
Master-syslog is a logfile which holds the most important data about the system
events like alarms, errors and user operations.
To monitor the syslog use the following command: tail -f /var/log/mastersyslog or view the full log with the less command or filter output by using the grep
command.
DN0962937
Issue 01A
Id:0900d805807ef391
Confidential
31
LTE iOMS Alarms and Troubleshooting
4.6
Tracelogs
Tracelogs are debug logs which are produced by server processes running on LTE
iOMS. They are useful for developers when troubleshooting LTE iOMS issues.
g Tracelogs default value is off.
To Enable/disable LTE iOMS Logging use following procedure:
1. Open Parameter tool from Application laucher.
2. Change value dn: omsParameterId=dwFlags, omsFragmentId=Any,
omsFragmentId=TraceConfig, omsFragmentId=System,
fsFragmentId=OMS, fsClusterId=ClusterRoot from 0 to 3 (and disable
from 3 to 0)
☞ You can find LTE iOMS trace logs in folder /var/log/oms.
32
Id:0900d805807ef393
Confidential
DN0962937
Issue 01A
LTE iOMS Alarms and Troubleshooting
4.7
Core dumps
When a program terminates abnormally, it makes a core dump. Core dump is a memory
image of the process. It tells what the process was doing before the crash.
Core dumps are located in the /var/crash/ folder.
As the core dump file (*.core) is of a large size (approximately 500 Mb), every full hour
system generates *.core.tar.gz file. It includes backtrace logs, syslog of the core
and all iOMS tracefiles and it is much more compact (approximately 1Mb of size). At the
same time, *.core file is being deleted.
You might take a core dump files and send them to designers. Without the backtrace, it
is almost impossible to trace the error from the core dump.
Example: Core dump file.
Core dump zip of meahandlerserve-20827.core.tar.gz
DN0962937
Issue 01A
Id:0900d805807f00fb
Confidential
33
LTE iOMS Alarms and Troubleshooting
4.8
Account locked
Description
Account can be locked if you try to login by using wrong password several times.
g Note that EM interface gives exactly the same message whether the password is
wrong or account locked:
Invalid username or password.
Locked account will unlock after an hour of inactivity or it can be unlocked instantly.
Symptoms
If you try to login by using locked account you get error:
Your account is locked. Maximum amount of failed attempts was
reached.
Example:
login as: Nemuadmin
Using keyboard-interactive authentication.
Password:
Your account is locked. Maximum amount of failed attempts was
reached.
Access denied
Recovery procedures
If your account has been locked and you need to unlock it instantly:
Steps
1
Log in by using other account (like root, Nemuadmin or _nokfsoperator
accounts)
2
Change user rights to root by executing: su -
3
Execute: pam_tally2 -f /var/log/tallylog -u <locked account> -r
Example:
pam_tally2 -f /var/log/tallylog -u _nokfsoperator -r
34
Id:0900d805807ef395
Confidential
DN0962937
Issue 01A
LTE iOMS Alarms and Troubleshooting
4.9
RPM Database Gets Locked
Description
In some situations the RPM database gets blocked. This problem occurs when process
or processes working on the RPM database are killed or crashed. Killed process leaves
a lock state information behind which results in blocking access to the database for other
processes.
Symptoms
There are three known problems which are caused by the RPM database lock state:
•
•
•
RPM query operation stucks.
Command rpm -qa gets stuck
Listing the current delivery command gets stuck.
Command currentdelivery returns the following error:
ERROR: RPM database corrupted
Command installincrement fails.
Example:
installincrement -N R_OMS1_3.38.release_oms.corr25-inc-1.rpm
rpmdb: Lock table is out of available locker entries
error: db4 error(22) from db-close: Invalid argument
error: cannot open Packages index using db3 - Cannot
allocate memory (12)
error: cannot open Packages database in
/var/mnt/local/sysimg/flexiserver/var/lib/rpm
/opt/Nokia_BP/sbin/installincrement: Cannot read the
installed deliveries from the RPM database
Recovery procedures
Steps
1
Check the RPM database files.
Execute the following command:
ll /var/lib/rpm/__*
Example of the output:
-rw-r--r-- 1 root root 16384
Apr 3 09:59
/var/lib/rpm/__db.001
-rw-r--r-- 1 root root 1318912 Apr 3 09:59
/var/lib/rpm/__db.002
-rw-r--r-- 1 root root 450560 Apr 3 09:59
/var/lib/rpm/__db.003
2
Remove all RPM database files.
Execute the following command:
rm /var/lib/rpm/__db.*
DN0962937
Issue 01A
Id:0900d80580715e8e
Confidential
35
LTE iOMS Alarms and Troubleshooting
RPM automatically creates a new RPM database.
36
Id:0900d80580715e8e
Confidential
DN0962937
Issue 01A
LTE iOMS Alarms and Troubleshooting
Licence troubleshooting
5 Licence troubleshooting
5.1
Connection to eNB Is Not Established
Description
In order to establish connection to the Network Elements, the LTE technology licence
installation is required. If the licence was not installed from USB during installation
phase, you can install it manually using the following procedure. For more information
on installing the licence during the commissioning, see Copying the iOMS Software to
USB Stick.
Symptoms
•
There is no connection to evolved node B.
Recovery procedures
Installing the LTE technology licence to LTE iOMS.
Steps
1
Transfer the licence file to iOMS folder:
/home/Nemuadmin/
2
Log in to the iOMS as root user.
3
Import the licence using the following command:
# lmcli importLicence /home/Nemuadmin/F5200012.XML
4
Activate the licence executing command:
# lmcli activateFeature 1815
5
Check if activation was successful using the following command:
# lmcli listAllLicences
Expected outcome:
Total number of licences: 1
LICENCE FILENAME LICENCE CODE
STATUS
F5200012.XML TESTRL000097 VALID
g If the activation was not successful, try to delete the licence by executing:
# lmcli deleteLicence /home/Nemuadmin/F5200012.XML
Then execute the above-mentioned steps once again.
DN0962937
Issue 01A
Id:0900d805807fe3ac
Confidential
37
Software Management Troubleshooting in iOMS
LTE iOMS Alarms and Troubleshooting
6 Software Management Troubleshooting in
iOMS
6.1
SW package download fails with "File not selected!Select
TargetBD.xml from SW package and try again", "Select
Target Network Elements!", or "Select TargetBD XML file!"
Description
SW download operation fails and SW management GUI shows the statuses mentioned
above.
Symptoms
The SW download operation is aborted.
Recovery procedures
Read the operation manuals and follow the guidance.
38
Id:0900d805807fe40c
Confidential
DN0962937
Issue 01A
LTE iOMS Alarms and Troubleshooting
6.2
SW package download fails with feedback table error
status "Mediator connection error"
Description
Operation for one of more network elements fails. There is no connection to a network
element. LTE iOMS tries to send a request to network elements for 3 minutes but still it
is not able to reach it.
Symptoms
The SW download for network element fails.
Recovery procedures
Steps
1
Check if alarm 71058 NE O&M CONNECTION FAILURE is active in LTE iOMS
or check the connection status on the topology UI.
Alarm 71058 indicates that there are some problems with the NE O&M connection
between OMS and the Network Element.
2
DN0962937
Issue 01A
Check the connection between iOMS and SGSN.
Id:0900d805807fe3c0
Confidential
39
LTE iOMS Alarms and Troubleshooting
6.3
SW package download fails with feedback table error
status "NE: busy"
Description
Operation for one of more network elements fails. There is already SW management
operation ongoing on the network element. LTE iOMS requestes for network elements
for 3 minutes to start the SW download, but it is still handling other SW management
operation.
Symptoms
SW download for network element fails.
Recovery procedures
Steps
40
1
Check the SW management operation status on the network element.
2
Wait until the network element has done the operation and try to make the
operation again.
Id:0900d805807fe3c2
Confidential
DN0962937
Issue 01A
LTE iOMS Alarms and Troubleshooting
6.4
No Feedback for Network Element
Description
Feedback for the SW download operations seems to be missing.
Symptoms
No feedback from a network element.
Recovery procedures
g The SW packages can be quite large and the transfer of the SW package files might
take time.
Steps
1
Check the size of the SW package.
2
Check the throughput of TCP/IP between LTE iOMS and the network element.
3
If the throughput is low
Then
wait until the network element has transferred files.
File transfer can last tens of minutes.
4
Check the NE response to SW download.
Sometimes NE fails to handle SW download. Retrying may take a long time so that
the feedback seems to be missing.
DN0962937
Issue 01A
Id:0900d805807a4f3c
Confidential
41
LTE iOMS Alarms and Troubleshooting
6.5
SW package download fails with feedback table error
status "Mediator timeout"
Description
Operation for one of more network elements fails. Network element has not sent
acknowledgement message to LTE iOMS before a time-out.
Symptoms
SW download for network element fails.
Recovery procedures
Step
1
Check if alarm 71058 NE O&M CONNECTION FAILURE is active in LTE iOMS
or check the connection status for the topology UI.
Alarm 71058 indicates that there are some problems with the NE O&M connection
between iOMS and the network element.
2
42
Check the status on the network element.
Id:0900d805807fe3c4
Confidential
DN0962937
Issue 01A
LTE iOMS Alarms and Troubleshooting
6.6
SW package download fails with feedback table error
status "Operation failed"
Description
Operation for one of more network elements fails. Network element has failed to make
the SW download.
Symptoms
The SW download for network element fails.
Recovery procedures
Steps
1
Check the original failure reason from the network element.
2
Check SW package files in the LTE iOMS.
If the original failure reason was a file transfer problem
Then
Check whether there are SW package files in the LTE iOMS disk on path
/var/opt/OMSftproot/SWPackages and follow the instructions described
in File transfer from LTE iOMS fails.
DN0962937
Issue 01A
Id:0900d805807fe3c6
Confidential
43
File transfer from LTE iOMS fails
LTE iOMS Alarms and Troubleshooting
7 File transfer from LTE iOMS fails
Description
File transfer from LTE iOMS fails (LTE iOMS acts as a server).
Symptoms
Some operation fails because of the file transfer error.
Recovery procedures
Steps
1
Check FTP configurations.
Check in Parameter Tool of Application Launcher if the IP address, FTP user name
or FTP password are correctly configured in the:
ClusterRoot/OMS/System/Network/CUAccess/OMS/szIPAddress
ClusterRoot/OMS/System/Network/CUAccess/OMS/szFTPUserName
ClusterRoot/OMS/System/Network/CUAccess/OMS/szFTPPassword
Correct the settings by using the zmodifyOMSSettings command.
Check that the FTP username works by connecting to the LTE iOMS FTP server
from some external client.
☞ The password may be encrypted in the LTE iOMS LDAP directory. If the
password is encrypted, it has the prefix “crypt:”. If you suspect that the password
is not correct, you can replace it with a cleartext password (without the prefix
“crypt:”, for example, SYSTEM) and check whether the SW download file
transfer starts to work. After correcting the password, encrypt it.
2
Check status of the file transfer in the /var/log/auth.log file.
Example:
If you find Nov 3 14:46:24 warning CLA-0 vsftpd[25906]: Mon Nov
3 14:46:24 2008 [pid 25906] [omsFtpUser] FAIL LOGIN: Client
..., the username or password might be incorrect.
Example:
If you find Nov 3 14:55:24 info CLA-0 vsftpd[32066]: Mon Nov 3
14:55:24 2008 [pid 32066] [omsFtpUser] FAIL DOWNLOAD: Client
...", \"/test.txt\", 0.00Kbyte/sec, the file to be downloaded does not
exist or the permission is not sufficient to read the file.
3
44
If you are familiar with the file transfer protocols you can take the capture on
the file transfer and analyze the capture to discover the reason for the failure.
Id:0900d8058080a8d7
Confidential
DN0962937
Issue 01A
LTE iOMS Alarms and Troubleshooting
Troubleshooting WebUI and LTE iOMS main page
8 Troubleshooting WebUI and LTE iOMS main
page
Description
One of the following Web actions fails:
•
•
•
loading LTE iOMS Main page
WebUI login
loading WebUI page
This is caused by that the Java runtime runs out of heap memory reserved for Tomcat
webserver.
If WebUI does not respond or responds very slowly and you are using Firefox 2.0 then
Firefox might have consumed huge amounts of memory (especially when WebUI
session has been open overnight) and your client computer is running out of memory.
Symptoms
•
•
•
•
HTTP error 503: Service Temporarily Unavailable is displayed
HTTP Status 500 is displayed with root cause: java.lang.OutOfMemoryError:
PermGen space
WebUI page does not open or just empty page is displayed
WebUI does not respond or it is very slow
Recovery procedures
Restarting web browser
Steps
1
Logout from WebUI
If WebUI does not respond, then move over to the next step.
2
Close web browser
If web browser does not respond, use operating system specific way to end web
browser process.
3
Start web browser
You are able to load LTE iOMS main page, to login WebUI and WebUI pages are
loaded. If there are still problems, try Restarting TomcatPlat.
Restarting TomcatPlat
Steps
DN0962937
Issue 01A
1
Use SSH client to login LTE iOMS
2
Change user rights to root by command su - and give the root password
Id:0900d805807ef3a5
Confidential
45
Troubleshooting WebUI and LTE iOMS main page
3
LTE iOMS Alarms and Troubleshooting
fshascli -r /TomcatPlat
g Restaring will end all open WebUI sessions and LTE iOMS Main page is
unavailable during restart.
4
Confirm restarting
TomcatPlat is restarted. Wait a minute or two for TomcatPlat restart. Web services are
again available.
46
Id:0900d805807ef3a5
Confidential
DN0962937
Issue 01A
LTE iOMS Alarms and Troubleshooting
IP Connection Troubleshooting in iOMS
9 IP Connection Troubleshooting in iOMS
9.1
IPSec is not working properly
Description
If there are problems in the IPSec configuration, the ipm process will not start and all
traffic to and from the node stops. The problem is usually in the IPSec service or in the
policy file:
•
•
•
The IPSec service has not started.
The syntax in the IPSec policy file is incorrect. In this case, the system does not load
the new policy.
The IPSec configuration is incorrect. In this case, the system loads the new policy,
but does not pass any traffic through.
Memory reservation for the IPSec service can also cause problems. If a large amount
of memory is reserved for IPSec, it is possible that the amount of addressable virtual
memory in the kernel is not sufficient.
Symptoms
When the IPSec service is not running, no external incoming or outgoing traffic is
passed.
If the system does not load the new IPSec policy, the system log (/var/log/mastersyslog) contains the following types of entries:
Mar 24 09:13:06 info CLA-0 SS_IPM[<number>]: Reloading policy \
from /etc/ipsec-policy.xml
Mar 24 09:13:06 info CLA-0 SS_IPM[<number>]:
/var/mnt/local/sysimg/\
flexiserver/sets/<build>/opt/Nokia/etc/IPSec/ipsecpolicy.xml:34:
Reference to an unknown ID "VPX-1"
Mar 24 09:13:06 info CLA-0 SS_IPM[<number>]: Policy rules
loading failed
If the system loads the new IPSec policy files, but does not pass traffic through, the
IPSec configuration is incorrect.
If the system log contains an error message instead of an IPSec memory reservation
success message, there is a problem with memory reservation.
Recovery procedures
Checking IPSec service
Steps
1
Log into the LTE iOMS (using external monitor and keyboard) as root user.
Note that it must be done localy.
DN0962937
Issue 01A
Id:0900d805807fe458
Confidential
47
IP Connection Troubleshooting in iOMS
2
LTE iOMS Alarms and Troubleshooting
Check if the IPSec recovery group is active.
Enter:
fshascli -s /IPSec
Expected outcome
The status of the IPSec recovery group is active. Continue with checking the
IPSec configuration.
Unexpected outcome
The status of the IPSec recovery group is not active. For more information, check
the system log (/var/log/master-syslog) for entries with the SS_IPM keyword.
Checking IPSec configuration
For instructions on how to check the current IPSec configuration, see Checking IPSec
service in Administering and Security in LTE iOMS.
Correcting IPSec policy file
Steps
1
Open the IPSec policy file in a text editor.
Open the following file for editing:
/var/mnt/local/sysimg/flexiserver/sets/<build>/opt/Nokia/etc
/IPSec/ipsec-policy.xml
☞ You can check the correct IPSec policy files and their location by entering the
following command:
fshascli -v /CLA-0/FSIPSec*
2
Locate the error and correct it.
The error message in the system log (/var/log/master-syslog) contains the
line number and a description for identifying and correcting the problem.
3
Save the modified IPSec policy file.
4
Reload the new IPSec policy.
Enter the following command:
ipmop -reload
Handling memory reservation problems
There is an maximum limit for the amount of physical memory that can be reserved for
IPSec. This limit is determined by the amount of addressable virtual memory that is
available in the kernel.
The amount of addressable virtual memory should be 118M more than the size of the
physical memory reserved for IPSec. For example, if the system has 2048M of physical
memory of which 512M is reserved for IPSec, then at least 630M of virtual memory is
needed.
Currently the maximum amount of available virtual memory, specified by the kernel
parameter VMALLOC_RESERVE, is 704M. Therefore the maximum amount of physical
memory that can be reserved for IPSec is 586M (704 - 118).
48
Id:0900d805807fe458
Confidential
DN0962937
Issue 01A
LTE iOMS Alarms and Troubleshooting
IP Connection Troubleshooting in iOMS
This limitation results from the way the Linux kernel divides the virtual address space
between the user space and the kernel. There are methods (patches to the kernel) for
overcoming this limitation. They will be addressed in the later releases of the platform.
DN0962937
Issue 01A
Id:0900d805807fe458
Confidential
49
Naming service and synchronisation troubleshooting
LTE iOMS Alarms and Troubleshooting
10 Naming service and synchronisation troubleshooting
10.1
Naming service is not functioning properly
Description
Many CORBA applications are directly dependent on the naming service functionality.
If the naming service is not available the applications do not start up correctly.
The naming service uses the MySQL database for object reference persistency.
Possible problems from MySQL database may prevent the naming service from functioning properly.
Symptoms
The alarm 70173 BACKEND DATABASE REQUIRED BY CORBA NAMING SERVICE
IS UNAVAILABLE is raised.
If no alarm is raised but a service fails to start and the service log or the syslog contains
errors or warnings indicating that the service is not able to bind or resolve object references in the naming service, the naming service is not available or the service uses the
wrong naming service address.
Recovery procedures
Using ns_listall tool to check the naming service availability
Purpose
Use the ns_listall tool to check the contents and state of the naming service.
Steps
1
Log in as Nemuadmin.
Enter the su command to take super user privileges.
2
Invoke the ns_listall tool.
To check the content of the naming service, use the ns_listall command.
Enter the following command:
For more ns_listall options, enter the ns_listall -? command.
Expected outcome
The ns_listall tool prints out the whole naming graph of node consisting of the
private as well as the public naming graph.
If a word empty appears in the printout, it shows that the naming service recovery group
is running and it is empty.
Example:
Typical output for the ns_listall command is:
50
Id:0900d805807ff753
Confidential
DN0962937
Issue 01A
LTE iOMS Alarms and Troubleshooting
Naming service and synchronisation troubleshooting
Unexpected outcome
If the ns_listall tool shows an error message, it becomes frozen, or it fails to print
anything, then either the naming service or the backend database (MySQL database) is
not running. In either case, the naming service normally restarts automatically by the
HAS.
Example:
If the following events take place:
1. the database fails and is automatically restarted by the HAS. In this case, the
naming service detects the problem automatically and re-establishes the database
connection within 10 seconds.
2. the naming service itself fails and is automatically restarted by the HAS. This should
take only a few seconds.
If the recovery actions described above do not occur, check the syslog for any indications why the naming service did not automatically recover database connections. Typically this happens if the MySQL database does not restart after the fault.
In that case, contact your Nokia Siemens Networks representative with the outputs from
the ns_listall tool for solving the problem.
DN0962937
Issue 01A
Id:0900d805807ff753
Confidential
51
LTE iOMS Alarms and Troubleshooting
10.2
LTE iOMS clock shows wrong time
Description
LTE iOMS clock shows wrong time. The reason for failure may be that:
•
•
•
Time zone in LTE iOMS is wrong
NTP service on LTE iOMS is not running or it has a configuration problem
NTP server is not running, NTP server gives wrong time, or network connection to
NTP server is not available
Symptoms
LTE iOMS clock shows wrong time.
Recovery procedures
Checking why LTE iOMS clock shows wrong time
Steps
1
Check date and time settings.
Enter the following command: date.
a) Check that the printout matches your local date and time.
b) Modify time settings.
If the time settings are not correct, modify them with the following command:
date --set=STRING
c) Give the new time in the following format: “YYYYMMDD hh:mm” (for example
“20100421 01:22”)
2
Check status of ClusterNTP via fshascli tool.
Example of running NTP service:
fshascli -s /ClusterNTP
/ClusterNTP:
administrative(UNLOCKED)
operational(ENABLED)
usage(ACTIVE)
procedural()
availability()
unknown(FALSE)
alarm()
Verify that LTE iOMS NTP service can connect to the NTP server by ntpq
command:
# ntpq
52
Id:0900d805807ef456
Confidential
DN0962937
Issue 01A
LTE iOMS Alarms and Troubleshooting
ntpq> pe
remote
refid
st
t
when
poll
reach
delay
offset
jitter
*10.8.122.67
LOCAL(0)
6
u
747
1024
377
0.326
0.075
0.028
LOCAL(0)
.INIT.
0
l
46
64
377
0.000
0.000
0.002
ntpq> quit
Query NTP server with pe command and the ntpq tool with quit command
If ntpq can't connect NTP server check that NTP server is up and running and connection to NTP server is working (UDP port. 123). Verify LTE iOMS ntp settings.
DN0962937
Issue 01A
Id:0900d805807ef456
Confidential
53
LTE iOMS Alarms and Troubleshooting
10.3
Network Element cannot get correct time from LTE iOMS
Description
Network element has wrong time after time correction from LTE iOMS. The reason for
failure may be that:
•
•
•
Time in LTE iOMS is wrong
Network connection from LTE iOMS to network element is not available
Time settings in network element are not correct
Symptoms
Network element cannot get correct time from LTE iOMS.
Recovery procedures
Checking why network element cannot get correct time from LTE iOMS
Steps
1
Check that LTE iOMS has correct time and ClusterNTP time service is
running.
See LTE iOMS clock shows wrong time.
2
Check that the network connection from network element to LTE iOMS is
working normally.
3
Check date and time settings from network element.
Enter the following command: date.
a) Check that the printout matches your local date and time.
b) Modify time settings.
If the time settings are not correct, modify them with the following command:
date --set=STRING
c) Give the new time in the following format: “YYYYMMDD hh:mm” (for example
“20100421 01:22”)
54
Id:0900d805807ef458
Confidential
DN0962937
Issue 01A
LTE iOMS Alarms and Troubleshooting
Backup and restore troubleshooting in iOMS
11 Backup and restore troubleshooting in iOMS
11.1
Backup of LTE iOMS fails
Description
The backup process may fail if, for example:
•
•
•
the user does not have the required privileges to execute the backup.
the system is not fully functional.
there is not enough free disk space available for the backup archive.
The database backup process may also fail if, for example:
•
•
there is an internal database system error.
there is a system timeout while making a backup of a database.
Symptoms
When the backup process fails,
•
•
•
the system creates the alarm 70064 BACKUP ERROR.
the backup process is either completed or interrupted. Even though the backup of
one backup item fails, the system continues making backups of the other backup
items and creates a backup archive in the subdirectory
/var/mnt/local/backup/SS_Backup
the system displays an error message on the local output device and creates an
entry in the syslog file.
Recovery procedures
Refer to the error message shown on the local output device and check the syslog file
in /var/log to determine the cause for the error situation. Check the following points
to solve the problem.
Recovering from a backup failure
Steps
1
Check the alarm and the log files.
a) Check the alarm with the appropriate alarm management tool.
The alarm includes the name of the backup log.
b) Search for backup-related entries in the syslog.
Enter the following command:
grep -i backup /var/log/syslog
c) Search the backup log for strings ERROR or WARNING and check the status
from the end of the log.
2
In case of permission denied error.
If the error message on the local output device is permission denied
Then
check that you have the required root privileges.
DN0962937
Issue 01A
Id:0900d805807ff9a3
Confidential
55
Backup and restore troubleshooting in iOMS
3
Check that there is enough free disk space for the backup.
•
•
4
LTE iOMS Alarms and Troubleshooting
Check the amount of available disk space.
Enter:
df -h /var/mnt/local/backup
If necessary, free disk space by Transferring the Backup Archive Files from
iOMS to an External Storage Server and deleting unnecessary backup files.
In case of internal database system error.
If there is an internal database system error
Then
proceed as follows:
•
•
5
check that the databases are up and running
refer to the database-specific documentation.
In case of DBBackup: timeout error.
If the error message in the syslog is DBBackup: timeout while making a
backup for database: <name of database>
Then
contact your local Nokia Siemens Networks representative.
6
In case above instructions solved the problem.
If the above instructions solved the problem,
Then
refer to the instructions in Making a Full Software Backup in iOMS, Making a
Partial Software Backup in iOMS or Making a Custom Software Backup in
iOMS.
Else
contact your local Nokia Siemens Networks representative.
56
Id:0900d805807ff9a3
Confidential
DN0962937
Issue 01A
LTE iOMS Alarms and Troubleshooting
11.2
Restoring of database in LTE iOMS fails
Description
Restoring of database may fail if, for example:
•
•
•
•
the user does not have the required privileges to execute the restore command.
there is not enough free disk space available for the backup files.
Note that the restore operation takes more disk space than a backup archive file,
because the backup archives are unzipped during the restore operation.
the backup archive to be restored is faulty.
there is an internal database error.
Symptoms
When restoring a database fails,
•
•
the restore process is interrupted
the system displays an error message on the output device and creates an entry in
the syslog file.
Recovery procedures
Refer to the error message shown on the local output device and check the syslog file
in /var/log directory to determine the cause for the error situation. Check the following
points to solve the problem.
Recovering from a failure to restore database
Steps
1
Check the log files.
a) Search for restore-related entries in the syslog.
Enter the following command:
grep -i restore /var/log/syslog
b) Search the backup log for text strings ERROR or WARNING and check the
status from the end of the log.
2
Execute the fsrestore command again with the options --debug and -verbose and examine the output.
The options --debug and --verbose print to the screen the information that is
also written to the log.
3
In case of permission denied error.
If the error message on the local output device is permission denied
Then
check that you have root privileges.
4
In case of DBRestore: No such database error.
If the error message in the syslog is DBRestore: No such database
Then
DN0962937
Issue 01A
Id:0900d805807ef432
Confidential
57
LTE iOMS Alarms and Troubleshooting
check that you have entered the restore command correctly.
5
Check that there is enough free disk space available.
You can check the amount of available disk space using the command
df -h /var/mnt/local/backup
If necessary, free disk space with the instructions in Transferring Backup Archive
Files from iOMS to an External Storage Server and delete unnecessary backup files.
6
In case above instructions solved the problem.
If the above instructions solved the problem,
Then
refer to the instructions in Restoring Databases in iOMS
Else
contact your local Nokia Siemens Networks representative.
58
Id:0900d805807ef432
Confidential
DN0962937
Issue 01A
LTE iOMS Alarms and Troubleshooting
11.3
Restoring of LDAP directory of LTE iOMS fails
Description
Restoring of LDAP directory may fail if, for example:
•
•
•
•
the user does not have the required privileges to execute the restore command.
there is not enough free disk space available for the backup files.
Note that the restore operation takes more disk space than a backup archive file,
because the backup archives are unzipped during the restore operation.
the backup archive to be restored is faulty.
the system is unable to write into destination files or replace existing files.
Symptoms
When restoring the LDAP directory fails,
•
•
the restore process is interrupted
the system displays an error message on the output device and creates an entry in
the syslog file.
Recovery procedures
Refer to the error message shown on the local output device and check the syslog file
in /var/log to determine the cause for the error situation. Check the following points
to solve the problem.
Recovering from a failure to restore LDAP directory
Steps
1
Check the log files.
a) Search for restore-related entries in the syslog.
Enter the following command:
grep -i restore /var/log/syslog
b) Search the backup log for text strings ERROR or WARNING and check the
status from the end of the log.
2
Execute the fsrestore command again with the options --debug and -verbose and examine the output.
The options --debug and --verbose display on the screen the information that is
also written to the log.
3
Check that you have root privileges.
4
In case of invalid directory error.
If the error message is invalid directory given as parameter
Then
check that the source and destination directories exist.
5
Check that there is enough free disk space available.
You can check the amount of available disk space using the command
DN0962937
Issue 01A
Id:0900d805807ef434
Confidential
59
LTE iOMS Alarms and Troubleshooting
df -h /var/mnt/local/backup
If necessary, free disk space by Transferring Backup Archive Files from iOMS to an
External Storage Server and deleting unnecessary backup files.
6
Check that the LDAP server is up and running.
Because LTE iOMS LDAP is running under RG /Directory you need to check if
this Directory is enabled and active. Enter the following command:
fshascli -s /Directory
An example printout of the fshascli -s /Directory command:
Directory:
administrative(UNLOCKED)
operational(ENABLED)
usage(ACTIVE)
procedural()
availability()
unknown(FALSE)
alarm()
Note that when you are restoring the whole system, you can skip this step as the
LDAP server is not up and running.
Note that LDAP is a critical process of LTE iOMS and if it is not up and running LTE
iOMS will reboot.
7
In case above instructions solved the problem.
If the above instructions solved the problem
Then
refer to the instructions in Restoring LDAP Directory in iOMS.
Else
contact your local Nokia Siemens Networks representative.
60
Id:0900d805807ef434
Confidential
DN0962937
Issue 01A
LTE iOMS Alarms and Troubleshooting
11.4
Restoring of system image, single file, or directory in LTE
iOMS fails
Description
Restoring of the system image, a single file, or a directory may fail if, for example:
•
•
•
the user does not have the required privileges to execute the restore command.
there is not enough free disk space available for restoring the backup.
Note that the restore operation takes more disk space than a backup archive file,
because the backup archives are unzipped during the restore operation.
the backup archive to be restored is faulty.
Symptoms
When restoring the system image, a single file, or a directory fails,
•
•
the restore process is interrupted
the system displays an error message on the output device and creates an entry in
the syslog file
Recovery procedures
Refer to the error message shown on the local output device and check the syslog file
in /var/log to determine the cause for the error situation. Check the following points
to solve the problem.
Recovering from a failure to restore system image, a file, or a directory
Steps
1
Check the log files.
a) Search for restore-related entries in the syslog.
Enter the following command:
grep -i restore /var/log/syslog
b) Search the backup log for text strings ERROR or WARNING and check the
status from the end of the log.
2
Execute the fsrestore command again with options --debug and -verbose and examine the output.
The options --debug and --verbose print to the screen the information that is
also written to the log.
3
In case of no such file or directory error.
If the error message states that there is no such file or directory
Then
check that the backup archive file is available in the directory
/var/mnt/local/backup/SS_Backup.
4
In case of permission denied error.
If the error message is permission denied
DN0962937
Issue 01A
Id:0900d805807ef436
Confidential
61
LTE iOMS Alarms and Troubleshooting
Then
check that you have root privileges.
5
In case of disk full error.
If the error message states that the disk is full
Then
check that there is enough disk space available.
You can check the amount of available disk space using the commands
df -h /var/mnt/local/backup
df -h /var/mnt/local/sysimg
If necessary, free disk space by Transferring Backup Archive Files from iOMS to an
External Storage Server and deleting unnecessary backup files.
6
In case above instructions solved the problem.
If the above solved the problem,
Then
refer to the instructions in Restoring a Single File or Directory in iOMS.
Else
contact your local Nokia Siemens Networks representative.
62
Id:0900d805807ef436
Confidential
DN0962937
Issue 01A
LTE iOMS Alarms and Troubleshooting
Log management troubleshooting
12 Log management troubleshooting
12.1
LTE iOMS syslog is not working properly
Description
If the syslog in LTE iOMS is not working properly, it may be caused by one of the following reasons:
•
•
•
•
There is not enough disk space in the LTE iOMS file system.
The logging processes are not running.
The syslog is not listening to the correct TCP port.
The syslog configuration is incorrect.
Symptoms
•
No entries are recorded in the LTE iOMS syslog.
Recovery procedures
Restarting syslog-ng
Steps
1
Send a SIGHUP to syslog-ng.
As a first measure, try restarting the syslog-ng daemon using SIGHUP. Enter the
following command:
killall -HUP syslog-ng
Checking the disk space
Purpose
Check that there is enough disk space in the LTE iOMS file systems where the syslog
files are saved.
Steps
1
Log in to the LTE iOMS remotely as the _nokfsoperator using password
assigned during installation phase, and then change the root user permission
by entering su - command.
2
Display free disk space.
To display the free disk space in the partitions, enter the following command:
df -h
3
Check the disk space on the relevant partitions.
a) To check the disk space for the proxy syslog file, find the free disk space for the
local image partition:
/var/mnt/local/localimg
DN0962937
Issue 01A
Id:0900d805807ff9a5
Confidential
63
Log management troubleshooting
LTE iOMS Alarms and Troubleshooting
b) To check the disk space for the master syslog file, find the free disk space for
the log partition:
/var/mnt/local/log
If the partition cannot be found, find the free disk space for the system image
partition:
/var/mnt/local/sysimg
4
Delete unnecessary files and restart syslog-ng daemons.
If the disk space is insufficient
Then
Delete unnecessary files and restart syslog-ng daemons.
a) Delete all unnecessary files from the partitions that do not have disk space left.
b) To restart syslog-ng daemons, enter the following command:
killall -HUP syslog-ng
Checking that the processes are running
Steps
1
Display the running syslog processes.
Enter the following command:
ps ax | grep syslog
In the LTE iOMS node, there should be two active syslog processes: syslog
master and syslog proxy.
Example:
The ps command printout for LTE iOMS node with two active syslog processes:
1613 ?
Ss
4:51
/opt/Nokia_BP/SS_BPUtils/bin/syslog-ng -p \
/var/run/syslog-ng.pid -f /etc/syslog-ng.conf
2079 ?
Ss
4:49
/opt/Nokia_BP/SS_BPUtils/bin/syslog-ng -F -p \
/var/run/master-syslog-ng.pid -f
/var/mnt/local/sysimg/flexiserver/opt/Nokia_BP/
etc/syslog-ng.conf
2
Restart the process.
If the syslog master process is not active
Then
Restart the process.
Enter the following command:
fshascli -r /<nodename>/FSDirectoryServer/MasterSyslogDaemon
where <nodename> is the name of the LTE iOMS node.
3
Restart syslog the proxy.
If the syslog proxy is missing
64
Id:0900d805807ff9a5
Confidential
DN0962937
Issue 01A
LTE iOMS Alarms and Troubleshooting
Log management troubleshooting
Then
Restart syslog the proxy.
To restart the syslog-ng proxy, enter the following command:
service syslog-ng restart
If the syslog-ng process does not restart and it is safe to restart the LTE iOMS
node, restart the node by entering the following command:
fshascli -r <nodename>
Checking that syslog is listening to the right TCP port
Steps
1
Check that the syslog-ng processes are listening to port 601.
Enter the following command:
netstat -ltpn |grep syslog-ng
The printout should contain a line stating that the syslog-ng is listening to port 601.
Example:
tcp
0
LISTEN syslog-ng
0.0.0.0:*
172.16.25.252:601
Checking that the configurations are correct
Steps
1
View the syslog configuration file under /var/log/ and check that the configuration is correct.
Check that the syslog configuration is correct, and make the corrections if required.
For reference, see the backup copy of your configuration file or other known good
example of a configuration file.
DN0962937
Issue 01A
Id:0900d805807ff9a5
Confidential
65
LTE iOMS Alarms and Troubleshooting
12.2
Gathering LTE iOMS trace logs
Description
If something is not working as it should in LTE iOMS, it usually leaves log writings into
the syslog and into the LTE iOMS trace log.
Symptoms
Some feature or function is not working as it should. Many unexpected error writings
appear in the event log.
Recovery procedures
Retrieving sufficient debug data from error case
Steps
1
Try to document the operations that caused the problem.
2
Retrieve event log (and trace files if they were already turned on).
Enable trace log
Enable/disable LTE iOMS Logging:
1. Open Parameter tool from Application launcher.
2. Change value dn: omsParameterId=dwFlags, omsFragmentId=Any,
omsFragmentId=TraceConfig, omsFragmentId=System,
fsFragmentId=OMS, fsClusterId=ClusterRoot from 0 to 3 (and disable
from 3 to 0).
3
Cause the problem again.
4
Turn off the trace logging with Parametertool.
Change value dn: omsParameterId=dwFlags, omsFragmentId=Any,
omsFragmentId=TraceConfig, omsFragmentId=System,
fsFragmentId=OMS, fsClusterId=ClusterRoot from 3 to 0.
5
Send saved data for analysis with proper error case description.
Syslog location in /var/log and LTE iOMS traces in /var/log/oms. Keep in
mind that the more information you give, the less time is needed to start the analysis
and correcting the problem.
g Note that you can also enable/disable OMS logging using ztracecli command.
This command may be used when Parameter Tool is not available.
1
Use following command to check logging status.
ztracecli -p
Expected outcome:
# ztracecli -p
trace: off (0)
max trace log filesize (kB): 10000
66
Id:0900d805807ed682
Confidential
DN0962937
Issue 01A
LTE iOMS Alarms and Troubleshooting
[root@CLA-0(GOMS-1000) /root]
2
Turn on logging with following command:
ztracecli --on
Expected outcome:
Logging has been turned on
3
Turn off logging with following command:
ztracecli --off
Expected outcome:
Logging has been turned off.
DN0962937
Issue 01A
Id:0900d805807ed682
Confidential
67
Troubleshooting Element Manager in iOMS
LTE iOMS Alarms and Troubleshooting
13 Troubleshooting Element Manager in iOMS
13.1
Starting the Element Manager fails
Description
The failure to start Element Manager (EM) may be caused by one of the following
reasons:
•
•
•
Some parts of Application Launcher (AL) are not properly loaded from the network
element (NE) or they are corrupted.
A connection to the NE cannot be established.
You have the wrong version of AL.
Symptoms
AL does not start to a certain NE. An error message such as the following may be displayed:
•
•
•
•
•
Cannot start AL.
Unexpected error occurred.
Connection to the Network Element could not be established.
Given Network Element was unknown to the system.
Login (authentication) failure due to password expiration.
Recovery procedures
Checking the connection to the NE
If you receive an error message indicating that the connection to the NE could not be
established, or that the system does not know the given NE, check that:
•
•
•
•
•
you have not mistyped the IP address of the NE; the valid address is the dedicated
IP address (IPv4) of the /HTTPDPlat recovery group.
the NE that you are trying to connect to exists in the network
the NE is connected to the network
the network where the NE is located is accessible from the network where your
workstation is located
the network cable between the workstation and the NE is not broken.
If the NE address is correct, ask your network administrator if you have access to the
network where the NE is located.
Removing the cached parts of AL
If you receive an unspecified error message, the reason why starting EM fails may be
that some parts of AL are not properly loaded from the NE, or they are corrupted. To
correct the situation, you can reload the cached parts of AL from the NE. Reloading is
done automatically if you remove the cached parts of AL. If that does not help, you can
uninstall AL and then install it again.
For instructions, see Removing cached EM applications.
68
Id:0900d805807ff9a7
Confidential
DN0962937
Issue 01A
LTE iOMS Alarms and Troubleshooting
Troubleshooting Element Manager in iOMS
Uninstalling and installing AL
If removing and reloading the cached parts of AL does not help, you can uninstall AL
and then install it again.
DN0962937
Issue 01A
Id:0900d805807ff9a7
Confidential
69
LTE iOMS Alarms and Troubleshooting
13.2
Starting the Element Manager application fails
Description
An EM application does not start from AL. This may be caused by one of the following
reasons:
•
•
•
The application files have been corrupted.
Loading the application has failed, for example, because the network connection
has been broken while the application was being loaded from the NE.
The application is already running or closing itself.
Symptoms
•
•
•
EM application does not start when started for the first time.
EM application does not start although it has been started previously when connected to the same NE.
An error message Application is already running is displayed.
Recovery procedures
Checking if the application is already running
If you receive the error message Application is already running, you may
have started an application that is already running. Most of the applications allow you to
have only one instance running at a time.
End the previous session and try to start the application again. If you are sure that you
have closed the application, but still receive the error message, the application is still
closing itself. If the application has no windows visible and you still get the message after
minutes of waiting, close and restart AL, then start the EM application again.
Removing cached EM applications
Purpose
Delete the EM applications from the cache and let AL download the application again
during the next startup.
If the connection to the NE is fast enough (over 2 Mbit/s), removing cached EM applications does not cause a problem, because downloading the EM applications does not
take long.
Steps
1
Locate the cached EM applications.
The cached EM applications are located in C:\Documents and Settings\
<USER>\Local Settings\Application Data\Nokia Siemens
Networks\Application Launcher Client\cache\<BUILD> directory in
Windows XP environment. In Windows Vista or Windows Seven operating system
cached data is stored in C:\Users\<USER>\AppData\Local\Nokia Siemens
Networks\Application Launcher Client\cache\<BUILD>. In Linux environment it is stored in ~/.Nokia/ApplicationLauncher/cache/<BUILD>
directory, where ~/ is the user’s home directory.
70
Id:0900d805807ef4ca
Confidential
DN0962937
Issue 01A
LTE iOMS Alarms and Troubleshooting
2
Delete the cached EM applications.
Delete the entire cache.
In Windows XP, delete the directory C:\Documents and Settings\<USER>\
Local Settings\Application Data\Nokia Siemens Networks\
Application Launcher Client\cache\<BUILD>. In Windows Vista or
Windows Seven operating system delete C:\Users\<USER>\AppData\Local\
Nokia Siemens Networks\Application Launcher Client\cache\
<BUILD> and all directories under it.
In Linux, delete the directory
~/.Nokia/ApplicationLauncher/cache/<BUILD>, where ~/ is the user’s
home directory.
DN0962937
Issue 01A
Id:0900d805807ef4ca
Confidential
71
LTE iOMS Alarms and Troubleshooting
13.3
Element Manager fails to connect to iOMS
Description
Element Manager is not able to obtain the initial reference. The reason for failure may
be that:
•
•
iOMS services are not up and running.
connection to HTTPS server cannot be established.
The Element Manager uses HTTPS and CORBA to connect to iOMS. HTTPS is used
for obtaining the initial CORBA reference and CORBA for subsequent communication.
Symptoms
The Element Manager fails to connect to iOMS.
Recovery procedures
Checking why the Element Manager fails to connect to iOMS
Steps
1
Ping iOMS on network level.
2
Check iOMS services status with command zstatus -d.
3
Try to connect to the HTTPS server.
For example, go to the Element Manager homepage.
4
Check iOMS master-syslog.
You can view error logs while trying to open the Element Manager session. To do it,
monitor iOMS master-syslog using tail -f /var/log/master-syslog
command.
72
Id:0900d805807ef4cc
Confidential
DN0962937
Issue 01A
LTE iOMS Alarms and Troubleshooting
13.4
System response is delayed after Element Manager user
actions
Description
Delayed responses to user actions on AL and EM applications may be caused by one
of the following reasons:
•
•
•
The connection to the NE is broken.
The network is heavily loaded.
The EM application is not fully loaded.
Symptoms
System responses to user actions on AL and EM applications are severely delayed.
System responses after high availability services (HAS) recovery actions for
/HTTPDPlat recovery group are delayed.
Recovery procedures
Closing the EM application and AL
If the connection is not broken but the network is heavily loaded, the response may just
take a long time. If the system does not return to its normal condition, close the EM application and AL and restart them.
Removing cached EM applications
If the system is slow in loading the application from the NE, the application may not be
fully loaded. This is the case if, in the next start attempt, the application cannot be started
and you get a clear error message. To correct this, delete the application from the cache
and let the AL download the application from the beginning during the next start up.
If the connection to the NE is fast enough (over 2 Mbit/s), removing cached EM applications does not cause a problem, because downloading the EM applications does not
take long.
For instructions, see Starting the Element Manager application fails.
DN0962937
Issue 01A
Id:0900d805807ef4ce
Confidential
73
LTE iOMS Alarms and Troubleshooting
13.5
Application Launcher troubleshooting
Despite the fact that the Application Launcher and Element Manager applications have
been carefully tested, sometimes situations might occur that could not have been
prepared for beforehand. Because the details of the situations that a user might meet
are unknown, the instructions presented here are general in nature.
Application does not start from Application Launcher
If the Element Manager application does not start, but it has been previously started successfully when connected to the same iOMS, the application files might be corrupted.
You can delete the application from the cache and let Application Launcher load the
application during the next start.
If you try to start the application for the first time but it cannot be started, the loading of
the application may have failed in a way that could not have been detected. This kind of
situation may occur when the network connection breaks while the application is loaded
from the OMS. You can delete the application from the cache and let Application
Launcher try to fully load the application during the next start.
Application Launcher does not start from a certain iOMS
If Application Launcher seems to load data from OMS during the login to the iOMS, the
iOMS is among the supported elements. If you get an error message that does not
specify the problem concretely as, for example, Can't start Application
Launcher or Unexpected error occured, it is possible that some parts of Application Launcher are not properly loaded from the iOMS or they might be corrupted.
To solve the problem, you can:
1. Reload cached parts of Application Launcher from the iOMS. The reloading is done
automatically if you remove the cached parts of Application Launcher.
If this does not help, do the following:
2. Uninstall Application Launcher and then install it again.
System hangs after a user action
This kind of situation may occur if the connection to the OMS breaks while a user action
is being performed by the application. An error message cannot be shown to the user if
the underlying operation system does not notify the application about errors in the connection.
If the connection is not broken but the network is heavily loaded, the action may just take
a long time.
If the system does not return to its normal condition, you must close Application
Launcher, and start it and the application again. If the hanging occurred while loading
the application from the iOMS, the application may not be fully loaded. This is the case
if on the next start attempt the application cannot be started (you get a clear error message). To solve the problem, you can delete the application from the cache and let Application Launcher try to fully load the application during the next start.
NWI3Adapter restart operation causes an error Management service could
not be initialised
When NWI3Adapter is restarted using Application Launcher, Application Launcher also
needs to be restarted. If NWI3Adapter restart request is sent, an error message
Management service could not be initialised appears after up to 10
minutes. Before this error occurs, you cannot open any new Element Manager windows
74
Id:0900d805807ef4d0
Confidential
DN0962937
Issue 01A
LTE iOMS Alarms and Troubleshooting
(for example new Fault Management application). You can restart Application Launcher
manually in order to open new applications without waiting for an error to occur.
13.5.1
Solutions
Removing cached parts of Application Launcher
A fast connection to OMS (over 2 Mb/s) is preferred, as downloading the cached EM
applications may take some time with slower connections.
To locate the cached EM applications you do not need to remember where you have
installed AL. You can find out cache location from AL's System Info dialog. System Info
dialog is accessible from About Application Launcher dialog. Default AL cache
location is C:\Documents and Settings\<USER>\Application Data\Nokia\
ApplicationLauncher\cache\ in Windows and
~/.Nokia/ApplicationLauncher/cache/ in Linux. Under these directories you
can find subdirectories containing different builds of AL. If you do not know which one to
delete, you can delete all of them. Note that in this case also those EM applications that
AL has previously cached from the iOMS will be deleted. Those applications will of
course be loaded again when you start the applications from AL.
The AL build information can be found in the About Application Launcher dialog. Build
information is given under the version information.
The cache location variable (-cachebaselocation) is defined in AL start script.
-Loadall parameter loads all files from the OMS server ignoring cache.
Removing cached Element Manager applications
A fast connection to iOMS (over 2 Mb/s) is preferred, as downloading the cached
Element Manager applications may take some time with slower connections.
You cannot identify and remove a specific application but applications having the same
version (build). To locate cache folder of the Application Launcher, see Removing
cached parts of Application Launcher.
Default applications are cached under C:\Documents and Settings\<USER>\
Application Data\Nokia\ApplicationLauncher\cache\<BUILD>\
<APPLICATION>\<APPLICATION.jar> (in Windows) and
~/.Nokia/ApplicationLauncher/cache/<BUILD>\<APPLICATION>\
<APPLICATION.jar> (in Linux) folders.
To remove cached EM application delete this directory.
DN0962937
Issue 01A
Id:0900d805807ef4d0
Confidential
75
LTE iOMS Alarms and Troubleshooting
13.6
Printouts and error codes in Element Manager
13.6.1
Application is already running
The possible reasons and recovery actions for this error are given in the table below.
Cause
You are trying to start a iOMS application, which is defined as a single instance application, and an instance of that application is already running.
Recovery actions
You can close the previously started application and try to start the application again.
If you are sure that you have closed the application but you still get the error message,
the application is still closing itself. If the application has no window visible, and after
minutes of waiting you still get this error message, you have no other option but to
close Application Launcher and start it again. After that you will be able to start the
application.
Table 2
Application is already running
For more information see Application Launcher troubleshooting.
76
Id:0900d805807ff9a9
Confidential
DN0962937
Issue 01A
LTE iOMS Alarms and Troubleshooting
13.6.2
Connection to the Network Element could not be established.
Network cable may be broken
The possible reasons and recovery actions for this error are given in the table below.
Cause
Application Launcher is trying to connect to the iOMS but the connection cannot be
created. The possible reasons are:
•
•
•
•
•
Recovery actions
User has misspelled the iOMS IP address.
User has entered a iOMS IP address (for example, 123.12.12.123) that does not
exist in the network.
The iOMS at the entered IP address is not connected to the network.
The network where the iOMS is located is not accessible from the network where
the user's workstation is located.
A network cable is broken somewhere between the workstation and the iOMS.
Check the IP address. If it is correctly entered, check the connection to the iOMS (you
may ask your network administrator to help you on this).
Table 3
DN0962937
Issue 01A
Connection to the Network Element could not be established. Network
cable may be broken.
Id:0900d805807ef4e1
Confidential
77
LTE iOMS Alarms and Troubleshooting
13.6.3
Given Network Element was unknown to the system
The possible reasons and recovery actions for this error are given in the table below.
Cause
The entered iOMS name is not valid. Reasons for that can be:
•
•
•
•
Recovery actions
User has misspelled the iOMS name.
User has entered a iOMS name that does not exist in the network.
The iOMS at the entered IP address is not connected to the network.
The network where the iOMS is located is not accessible from the network where
the user's workstation is located.
Check the iOMS name. If it is correctly entered, ask your network administrator
whether you have an access to the network where the iOMS locates.
Table 4
78
Given Network Element was unknown to the system.
Id:0900d805807ef4e3
Confidential
DN0962937
Issue 01A
LTE iOMS Alarms and Troubleshooting
13.6.4
Setting NE account to iOMS unsuccessful
The possible reasons and recovery actions for this error are given in the table below.
Description
The request about setting NE Account to iOMS is always done by NetAct. The reply
for the request is also sent to NectAct automatically, and Remote Users Administration makes further actions. Anyhow, if the NE Account setting is unsuccessful, the
Syslog contains some information about the situation.
Cause
•
•
•
Recovery actions
NE Account setting unsuccessful. Invalid NE Account.
NE Account setting unsuccessful. Region of NE Account was not found from
LDAP server!
NE Account setting unsuccessful.
Contact the Remote User Administration.
Table 5
DN0962937
Issue 01A
Setting NE Account to iOMS unsuccessful.
Id:0900d805807ef4e7
Confidential
79
Troubleshooting iOMS Fault Management application
LTE iOMS Alarms and Troubleshooting
14 Troubleshooting iOMS Fault Management
application
14.1
iOMS alarm system is not responding
Description
The fault management application you are using is indicating that the iOMS alarm
system is not operating. Possible causes for the failure are:
•
•
The alarm processor is not running properly.
The alarm system database is not running.
If none of the recovery procedures solves the problem, contact your Nokia Siemens
Networks representative.
Symptoms
Alarm management application operations fail. The application can also stop operating
or display error messages.
Recovery procedures
Restarting alarm processor
Check the status of the alarm processor and restart it if necessary.
The recovery group for alarm processor is AlarmSystem.
Steps
1
Log into the iOMS as _nokfsoperator.
2
Change permissions to root user.
Enter the following command:
su
3
Check the status of alarm processor.
Enter the following command:
fshascli -s /AlarmSystem
4
Restart the alarm processor.
If the AlarmSystem recovery group is not active
Then
Restart the alarm processor.
Restart the alarm processor by entering the following command:
fshascli -r /AlarmSystem
5
Unlock the alarm processor.
If the AlarmSystem recovery group is locked
80
Id:0900d805808001a6
Confidential
DN0962937
Issue 01A
LTE iOMS Alarms and Troubleshooting
Troubleshooting iOMS Fault Management application
Then
Unlock the alarm processor.
Unlock the alarm processor by entering the following command:
fshascli -u /AlarmSystem
Restarting iOMS alarm system database
If the operations related to iOMS alarm parameters (for example, viewing or modifying
alarm parameters) fail or if you received an error message indicating problems with the
iOMS alarm system database, it is possible that the iOMS alarm system database is not
running. Check the state of the OMS alarm system database and restart it if necessary.
The recovery group for the alarm system database is AlarmDB.
Steps
1
Log into the iOMS as _nokfsoperator .
2
Change permissions to root user
Enter the following command:
su
3
Check the state of the alarm system database.
Enter the following command:
fshascli -s /AlarmDB
4
Restart the alarm system database.
If the AlarmDB recovery group is not active
Then
Restart the alarm system database.
Restart the alarm system database by entering the following command:
fshascli -r /AlarmDB
DN0962937
Issue 01A
Id:0900d805808001a6
Confidential
81
LTE iOMS Alarms and Troubleshooting
14.2
Fault Management GUI is not updating the list of alarms
Note that correct functionality of FM GUI requires an active TCP/IP connection between
OMS and Element Manager. If there is a network firewall device between OMS and
Element Manager workstation and if the FM GUI has been running for a long time, the
firewall may, depending on its configuration, drop the active TCP/IP connection of the
FM GUI and cause alarm situation in the FM GUI not to be updated.
If you experience or suspect this situation, you can reactivate the TCP/IP connection
and update the alarm situation by pressing Refresh button. In that case it is also recommended to check the configuration of the firewall(s) in question to avoid this problem in
the future.
82
Id:0900d805807ef4f7
Confidential
DN0962937
Issue 01A
LTE iOMS Alarms and Troubleshooting
Checking for problems in iOMS processes
15 Checking for problems in iOMS processes
Description
You suspect that something is wrong with iOMS.
Symptoms
Symptoms can be, for example, that services are not working as they should and there
are a lot of error events in the event log.
Recovery procedures
Checking for problems in iOMS processes
Steps
1
Check iOMS services status and coredumps by zstatus command
g Note that zstatus can not be executed by _nokfsoperator.
Non running recovery groups and units are listed with zstatus -d -c. Coredumps
are product of crashed services. If coredumps are found, gather error information as
described in section Gathering error information.
Monitoring cluster:
# zstatus -d -c
Monitoring cluster
CLA-0
R_GOMS1_1.19.debug_oms.corr4
RecoveryGroups(@OMS, disabled):
RecoveryUnits(@/, disabled):
Processes(@/, disabled):
total 22668
-rw------- 1 root
OMSBTSOM-6955.core
2
root
562946048 Nov 19 16:12
Check harddisk spaces by df -k command.
df -k
Filesystem
1K-blocks
Used Available Use% Mounted
on
none
2017856
236
2017620
1% /dev
/dev/md0
5039552
1505384
3278172 32%
/var/mnt/local/localimg
/dev/md3
198273
15858
172179
9%
/var/mnt/local/cmf
/dev/md1
30237584
3408308 25293280 12%
/var/mnt/local/sysimg
/dev/md4
2015760
178276
1735088 10%
DN0962937
Issue 01A
Id:0900d805807ef575
Confidential
83
Checking for problems in iOMS processes
LTE iOMS Alarms and Troubleshooting
/var/mnt/local/log
/dev/md5
10079020
55296
9511728
1%
/var/mnt/local/backup
/dev/md1
30237584
3408308 25293280 12%
/var/mnt/remote/sysimg
/dev/md4
2015760
178276
1735088 10%
/var/mnt/remote/log
/dev/md9
2015760
209288
1704076 11%
/var/mnt/local/MySQL_DB_Alarm
/dev/md8
297421
184336
97729 66%
/var/mnt/local/MySQL_DB_CosNaming
/dev/md1
30237584
3408308 25293280 12%
/var/mnt/remote/sysimg/R_GOMS1_1.19.
debug/opt/Nokia/var/ftp/nokfsFtpUser/PM
/dev/md16
198273
128047
59990 69%
/var/mnt/local/MySQL_DB_AidS
/dev/md18
1007832
139748
816888 15%
/var/mnt/local/MySQL_DB_Topology
/dev/md19
495780
132790
337394 29%
/var/mnt/local/MySQL_DB_CMPlan
/dev/md17
10079020
638832
8928192
7%
/var/mnt/local/MySQL_DB_PMData
iOMS filesystem is divided into several partitions to prevent the filled filesystem
shutting the whole system down. However, if one of the partitions is used to 100%,
it naturally hinders iOMS operation.
3
Check iOMS memory usage by free command.
# free
total
used
free
shared
cached
Mem:
4034240
2968640
1065600
358380
1110080
-/+ buffers/cache:
1500180
2534060
Swap:
0
0
0
buffers
0
iOMS does not use swap. Cached memory is used by linux diskcache to speed up
operations and is allocated to programs when needed by software. Actual iOMS free
memory is calculated by summing free and cached memory together.
4
84
If erroneous writings or behavior are detected (for example some RG, RU or
process is in disabled state), gather error information as described in section
Gathering iOMS trace logs to help Nokia Siemens Networks to solve the
problem.
Id:0900d805807ef575
Confidential
DN0962937
Issue 01A
LTE iOMS Alarms and Troubleshooting
Gathering error information
16 Gathering error information
Description
If something is not working as it should in iOMS, it usually leaves log writings into the
event log and into the trace log.
For more information, see Gathering iOMS trace logs
DN0962937
Issue 01A
Id:0900d805807ef579
Confidential
85
Using EnvCam script
LTE iOMS Alarms and Troubleshooting
17 Using EnvCam script
17.1
EnvCam script overview
EnvCam (Environment Camera) is a script that is delivered as part of the software delivery. The purpose of the script is to collect basic data about the network element to aid
support in troubleshooting problem reports.
By default, the script prints the collected information into a standard output (usually a
console or a terminal session). The output can also be easily redirected into a file for offsite storage or for forwarding to your Nokia Siemens Networks representative.
The script collects relevant basic information on the state and configuration of the
network element.
General information
EnvCam collects the following general information (the services that EnvCam is dependent of are listed in parenthesis):
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Information about available and non-available nodes in the network element.
(hwcli or running LDAP server)
The cluster topology produced by the hwcli command. (
hwcli)
Note that this information is collected only in clustered environments that are supported by the hwcli command.
Information about the deliveries installed to the network element.
(currentdelivery)
Basic information about the network element obtained from LDAP. For example,
Cluster ID, fshwPIUId, fshwPosition, fshwHASNodeId,
fshwSWReleaseVersion, fshwPIUSpecificType, fshwMemoryInstalled,
fshwVersion, fshwSerialNumber, fshwVendorName. (running LDAP server)
Information about active software sets. (setmap file)
Information about the kernel version of the operating system. (ssh, uname)
Information about the uptime of the network element. (ssh, uptime).
Information about the status of the network element: The status of the node,
recovery groups, recovery units and processes obtained with the fshascli
command. (fshascli, running LDAP server)
Information about virtual memory statistics of the network element. (ssh, vmstat)
Process tree information of the network element. (ssh, pstree)
Information about zombie processes in the network element. (ssh, ps)
Disk usage and mount status of the network element. (ssh, df, mount)
Loaded kernel modules in the network element. (ssh, lsmod)
Core files in the network element. (ssh)
Active alarms. (mysql)
If some or all of the services that are listed in parentheses are not available to EnvCam,
the report of system state produced by EnvCam is undefined. The node where EnvCam
is being run must also be in a stable state.
86
Id:0900d805808001b4
Confidential
DN0962937
Issue 01A
LTE iOMS Alarms and Troubleshooting
17.2
Using the EnvCam tool
Before you start
Decide on where you want to store the EnvCam information.
Summary
The EnvCam script is located in /opt/Nokia_BP/SS_SysReport/envcam. Please
note that the script command fsenvcam.sh has no command line arguments.
EnvCam is meant to be run over a serial console session. This way the output produced
by EnvCam is easily accessible from outside the network element. The output can also
be redirected into a file in the network element and transferred for off-site storage
provided that outside connections work in the network element.
Steps
1
Log into the iOMS as root user.
2
Change to the correct directory.
Change to the directory where the EnvCam script resides:
cd /opt/Nokia_BP/SS_SysReport/envcam
3
Run the script and direct the output to a file.
Run the EnvCam script and direct the result to the file
envcam_output_<current_date>.txt in the directory /home:
./fsenvcam.sh | tee /home/envcam_output_<current_date>.txt
4
Verify the result.
Verify that the file contains sensible information by displaying the contents using the
Linux cat command:
cat /home/envcam_output_<current_date>.txt
If required, you can transfer the file containing the EnvCam information to an external
server or forward the file to your Nokia Siemens Networks representative.
Expected outcome
You have stored the EnvCam information in the desired file.
DN0962937
Issue 01A
Id:0900d805807ef55d
Confidential
87
Replacing a faulty disk in HP BladeSystem iOMS hardware
LTE iOMS Alarms and Troubleshooting
18 Replacing a faulty disk in HP BladeSystem
iOMS hardware
Purpose
To change the SAS hard disk attached to a node running on the HP BladeSystem iOMS
hardware.
Before you start
Check that the storing capacity of the new disk is at least as big as in the disk to be
replaced. If it is smaller, the identical logical volumes cannot be created to the new disk.
Summary
The new hard disk must be configured to belong to the same volume group than the
replaced disk. Also the logical drives and disk partitions must be re-created. Use the
provided scripts to prepare the disk for removal and to create the volumes and disk partitions on the replaced disk.
Steps
1
Log in as root user.
2
Prevent the periodical md recovery.
The mdrecovery.sh script entry in the mdrecovery file in the /etc/cron.d directory must be commented out with hash mark (#) to prevent the system from automatically detecting and adding RAIDs to an array.
a) Open the mdrecovery file in the /etc/cron.d directory in any text editor, for
example vi. Enter the following command:
vi /etc/cron.d/mdrecovery
b) Insert a hash mark to the beginning of the line that has the mdrecovery.sh script
entry:
# */30 * * * * root /opt/Nokia_BP/bin/mdrecovery.sh
c) Save the file.
3
Find out the volume group of the disk.
To find out to which volume group the disk belongs to, enter the following command:
cciss_info.sh <logical drive number>
For example, to find out the information about logical drive number 2 enter the following
command:
[root@CLA-0]# cciss_info.sh 2
device /dev/cciss/c0d1
volumegroup VG_63
port 1I box 1 bay 2
In this example, the hard disk on bay 2 belongs to the volume group VG_63.
88
Id:0900d805808001ba
Confidential
DN0962937
Issue 01A
LTE iOMS Alarms and Troubleshooting
4
Replacing a faulty disk in HP BladeSystem iOMS hardware
Prepare the disk for hotswapping.
The disk is prepared for hotswapping by executing the disk_change_pre.sh
<volume group> script. For example, to prepare the disk that belongs to the volume
group VG_63, enter the following command:
[root@CLA-0]# disk_change_pre.sh VG_63
### Checking cciss devices.
### Deactivating volume group VG_63.
raid1: Disk failure on dm-3, disabling device.
mdadm: set /dev/mapper/VG_63-backup faulty in /dev/md5
mdadm: hot removed /dev/mapper/VG_63-backup
raid1: Disk failure on dm-2, disabling device.
mdadm: set /dev/mapper/VG_63-log faulty in /dev/md4
mdadm: hot removed /dev/mapper/VG_63-log
raid1: Disk failure on dm-0, disabling device.
mdadm: set /dev/mapper/VG_63-sysimg faulty in /dev/md1
mdadm: hot removed /dev/mapper/VG_63-sysimg
raid1: Disk failure on dm-4, disabling device.
mdadm: set /dev/mapper/VG_63-cmf faulty in /dev/md3
mdadm: hot removed /dev/mapper/VG_63-cmf
raid1: Disk failure on dm-1, disabling device.
mdadm: set /dev/mapper/VG_63-localimg_CLA--0 faulty in /dev/md0
mdadm: hot removed /dev/mapper/VG_63-localimg_CLA--0
0 logical volume(s) in volume group "VG_63" now active
### Deactivation OK.
### Ready to hotswap drive from bay 2.
5
Change the hard disk.
The hard disk can be replaced with a new one when the disk_change_pre.sh
outputs the following message to the screen:
### Deactivation OK.
### Ready to hotswap drive from bay <bay number>.
6
Create the logical drives, volumes and disk partitions.
The logical drives, volumes and disk partitions can be created to the new disk by executing the disk_change_post.sh <volume group> script. For example, if the
replaced disk belonged to the volume group VG_63, enter the following command:
[root@CLA-0]# disk_change_post.sh VG_63
### Figuring out cciss information.
### Deleting logical drive 2.
Warning: Deleting an array can cause other array letters to
become renamed.
E.g. Deleting array A from arrays A,B,C will result in
two remaining
arrays A,B ... not B,C
DN0962937
Issue 01A
Id:0900d805808001ba
Confidential
89
Replacing a faulty disk in HP BladeSystem iOMS hardware
LTE iOMS Alarms and Troubleshooting
### Creating logical drive 2.
### Waiting...
### Logical drive 2 successfully created.
### Size of existing disk is 72 GB, the new one is 72 GB in
size.
### Size of new disk ok.
### Copying partition table from HDF_62.
1+0 records in
1+0 records out
### Partition table copied.
### Waiting...
### Running updatepreboot.sh script.
[output of updatepreboot.sh not shown]
### Updating done.
### Creating physical volume p2.
Physical volume "/dev/cciss/c0d1p2" successfully created
### Physical volume created.
### Creating volume group VG_63.
Volume group "VG_63" successfully created
### Volume group created.
### Creating logical volumes for VG_63.
Logical volume "sysimg" created
Logical volume "localimg_CLA-0" created
Logical volume "log" created
Logical volume "backup" created
Logical volume "cmf" created
### Logical volumes created.
### Logical volumes of VG_63 now identical with VG_62.
### Recovering md devices (this will take a LONG time).
mdadm: hot added /dev/mapper/VG_63-backup
mdadm: hot added /dev/mapper/VG_63-log
mdadm: hot added /dev/mapper/VG_63-sysimg
mdadm: hot added /dev/mapper/VG_63-cmf
mdadm: hot added /dev/mapper/VG_63-localimg_CLA--0
### Recovery done.
### All finished.
Note that the execution of the script will take a long time since it waits until the recovery
actions are complete.
7
Restore the periodical md recovery.
Open the mdrecovery file in the /etc/cron.d directory in any text editor and remove
the hash mark (#) that was inserted before the operation, that is, from the beginning of
the following line:
*/30 * * * * root /opt/Nokia_BP/bin/mdrecovery.sh
90
Id:0900d805808001ba
Confidential
DN0962937
Issue 01A
LTE iOMS Alarms and Troubleshooting
Replacing a faulty disk in HP BladeSystem iOMS hardware
Save the file after editing.
Expected outcome
The scripts are finished normally, that is, data is successfully copied to the newly added
disk and you can use the system normally.
Unexpected outcome
The post script outputs one of the following messages:
!!! Couldn't determine the device of existing vg VG_62.
!!! Try doing the reactivation manually.
In this case there might be a problem with the other disk as well.
!!! Partition copying didn't produce correct partitions on
/dev/cciss/c0d1.
Check that the other volume group has correct partition table and execute the post script
again.
!!! Wiping failed, disk already mounted?
Check that the periodic mdrecovery is not being run and try running the post script again.
!!! Did not manage to create identical logical volumes for VG_63.
Check that the newly added disk is at least as big as the existing one.
DN0962937
Issue 01A
Id:0900d805808001ba
Confidential
91
LTE iOMS alarms
LTE iOMS Alarms and Troubleshooting
19 LTE iOMS alarms
19.1
70001 CONFIGURATION OF SNMP MEDIATOR IS OUT OF
ORDER
Probable cause: Corrupt data
Event type: Processing error
Default severity: Minor
Meaning
The configuration of the SNMP mediator contains values that are unacceptable.
The invalid part of configuration is ignored. This causes partial loss of functionality. The
SNMP traps may be lost.
Identifying additional information fields
Configuration entry
•
The name and value of the attribute that is out of order under the fssnmpMediatorName=1, fsFragmentId=SNMP, fsClusterId=ClusterRoot branch.
Additional information fields
Instructions
Use the parameter management application to correct the configuration branch that is
out of order. The Application Additional Information field displays the attribute or
entry name that has an unacceptable value. For example, the following entry causes
the alarm 70001, if xxx is not a hostname that can be resolved:
fssnmpNEId=xxx,
fssnmpAttributeType=NEattrs,
fssnmpMediatorName=1,fsFragmentId=SNMP,
fsClusterId=ClusterRoot
Testing instructions section below provides instructions for creating the invalid entry.
Clearing
The alarm is cleared automatically by the alarm system after five minutes. If the configuration is still out of order after that, the alarm is raised again.
Testing instructions
1. Open parameter management application and use it in the extended mode (select
Browse > Mode > Extended Mode).
2. Add an invalid hostname to SNMP mediator’s LDAP configuration:
a) Expand the entry tree below fsFragmentID=SNMP: In the parameter management application main window, click the arrow next to the SNMP fragment in the
entry tree (fsFragmentID=SNMP).
b) Click the arrow next to fssnmpMediatorName=1 to further expand the entry tree.
c) Select fssnmpAttributeType=NEattrs and click the arrow next to it to display the
managed NEs.
92
Id:0900d805807fdd90
Confidential
DN0962937
Issue 01A
LTE iOMS Alarms and Troubleshooting
LTE iOMS alarms
d) Select Entry > New Child or right-click fssnmpAttributeType=NEattrs and select
New Child.
e) In the Add new entry dialog box, enter any value for attribute fssnmpMOID and
value xxx for fssnmpNEId.
f) Click OK and select Forced Activation in the Select Operation window.
3. Restart /SNMPMediator.
Alarm 70001 with IAAI=”fssnmpNEId=xxx” is raised.
DN0962937
Issue 01A
Id:0900d805807fdd90
Confidential
93
LTE iOMS Alarms and Troubleshooting
19.2
70002 INVALID SNMP TRAP COMMUNITY STRING
Probable cause: Corrupt data
Event type: Processing error
Default severity: Warning
Meaning
The SNMP Mediator has received an SNMP trap that contains an invalid trap community
string, that is, the community string in the trap does not match the community string in
SNMP Mediator's configuration. The community strings are passwords that are used to
authenticate the senders of SNMP traps.
Identifying additional information fields
Additional information fields
1. IP address of the SNMP agent that sent the trap
2. The received trap community string
3. Version of the used SNMP, possible values are:
• SNMPv1
• SNMPv2c
4. Object identifier of the received trap
Instructions
1. Check the IP address of the SNMP agent that sent the trap. The IP address is displayed in the Identifying additional information fields field #1 of the alarm
2. Check the community string that was received in the trap. The community string is
displayed in the Application Additional Information field #1 of the alarm.
3. Use the parameter management tool to check the community string that the SNMP
Mediator expects. Attribute fssnmpCommunityString of the following entry
defines the community string:
fssnmpTrapSource=<agent ip / hostname>,
fssnmpAttributeType=Commstrings,
fssnmpMediatorName=1,
fsFragmentId=SNMP,
fsClusterId=ClusterRoot
4. Modify the community string in the LDAP directory to match the community string
received in the trap, or configure the SNMP agent to use the community string that
the SNMP Mediator expects. Note that if no community string has been specified for
an IP address in the LDAP, the SNMP Mediator accepts all community strings from
that address.
Clearing
Clear the alarm with the alarm management application after correcting the fault as presented in Instructions.
Testing instructions
1. Open the parameter management application and use it in normal mode, when
SNMP Mediator is running.
94
Id:0900d805802d4c1b
Confidential
DN0962937
Issue 01A
LTE iOMS Alarms and Troubleshooting
2. Define the trap community for address CLA-0 to be -secret" by adding the following
entry to SNMP mediator's LDAP configuration: dn:
fssnmpTrapSource=CLA-0,
fssnmpAttributeType=Commstrings,
fssnmpMediatorName=1,
fsFragmentId=SNMP,
fsClusterId=ClusterRoot,
fssnmpCommunityString: secret,
fssnmpTrapSource: CLA-0,
objectClass: FSSNMPTrapCommunityString,
objectClass: top,
objectClass: FSMOCBase
3. Log into CLA-0.
4. Send a trap to SNMP Mediator with the following command:
# snmptrap -v 1 -c public SNMPMediator "" <CLA-0 IP address> 0 0 ""
Alarm 70002 INVALID SNMP TRAP COMMUNITY STRING withIAAI= <CLA-0 IP
address> and AAI="public SNMPv1 .1.3.6.1.6.3.1.1.5.1" is raised.
DN0962937
Issue 01A
Id:0900d805802d4c1b
Confidential
95
LTE iOMS Alarms and Troubleshooting
19.3
70003 NO REPLY TO SNMP REQUEST
Probable cause: Corrupt data
Event type: Processing error
Default severity: Warning
Meaning
SNMP Mediator has sent an SNMP request to an SNMP agent but it has not received a
response.
•
Example 1. A filter condition has been added for the
authenticationFailure1.3.6.1.6.3.1.1.5.5 trap. Thus the following
entry can be viewed by the parameter management tool:
fssnmpV2TrapId=.1.3.6.1.6.3.1.1.5.5
fssnmpAttributeType=V2traps
fssnmpMediatorName=1,
fsFragmentId=SNMP,
fsClusterId=ClusterRoot
The filter condition is defined by the attribute fssnmpFilterCondition.
fssnmpFilterCondition may have, for example, the value
(.1.3.6.1.2.1.1.1.0=*Linux*). See RFC 2254 for more information about
the filter syntax.
Example 2. The SNMP Mediator receives the authenticationFailure trap that
does not contain the value of variable .1.3.6.1.2.1.1.1.0. 3. The SNMP Mediator
queries the value of .1.3.6.1.2.1.1.1.0 from the SNMP agent, but does not
receive a response.
The SNMP is not able to handle the trap correctly, because it is not able to query or
modify variables in the SNMP agent.
Identifying additional information fields
Additional information fields
IP address of the SNMP agent that does not answer
Instructions
1. Check the IP address of the SNMP agent that sent the trap. The IP address is displayed in the Application Additional Information field #1 of the alarm.
2. The net-snmp command line tools (snmpget, snmpset and so on) provided by the
operating system may be used to verify the functionality of the SNMP agent.
3. To check the attributes defined for the SNMP agent, use the parameter management tool. The attributes are located under the following entry:
fssnmpNEId=<agent IP / hostname>,
fssnmpAttributeType=NEattrs,
fssnmpMediatorName=1,
fsFragmentId=SNMP,
fsClusterId=ClusterRoot
4. Verify that the optional attribute fssnmpUDPPort has the value that the SNMP
agent is listening to. The default value is 161.
96
Id:0900d805803b05c7
Confidential
DN0962937
Issue 01A
LTE iOMS Alarms and Troubleshooting
5. Verify that the optional attribute fssnmpProtocolVersion is the same that the
SNMP agent supports. The default value is V2c.
6. Verify that the optional attributes fssnmpReadCommString and
fssnmpWriteCommString are the ones that the SNMP agent expects.
Clearing
Clear the alarm with the alarm management application after correcting the fault as presented in Instructions.
Testing instructions
1. Open the parameter management tool and use it in normal mode, when SNMP
Mediator is running.
2. Add entry "fssnmpV2Trapld=.1.3.6.1.6.3.1.1.5.1" under branch "fssnmpAttributeType=V2traps,fssnmpMediatorName=1,fsFragmentld=SNMP,fsClusterld=ClusterRoot"
3. Add attribute fssnmpFilterCondition to the entry created in step 2 and give it the
value (.1.3.6.1.2.1.1.5.0=anystring) (The grammar for the filter condition is specified
in http:/www.ietf.org/rfc/rfc2254.txt?number=2254)
4. Verify that there is no SNMP agent process such as snmpd running on CLA-0.
#
netstat -alp | grep snmp
tcp 0 0
*:smux *:*
LISTEN
11017/snmpd
udp
0
0
*:snmp *:*
11017/snmpd
# kill 11017
root@CLA-0(GUI):~# netstat -alp | grep snmp
#
5. Send a trap to SNMP Mediator with the following command (use the IP address of
CLA-0 as agent IP):
# snmptrap -v 1 -c public SNMPMediator "" 192.168.128.1 0 0 ""
Alarm 70003 NO REPLY TO SNMP REQUEST is raised with AAI=192.168.128.1,
because
•
•
•
DN0962937
Issue 01A
SNMP Mediator receives trap ".1.3.6.1.6.3.1.1.5.1", which does not contain the
variable ".1.3.6.1.2.1.1.5.0" that is part of the filter condition.
SNMP Mediator tries to get the value of ".1.3.6.1.2.1.1.5.0" from an SNMP agent
running in address 192.168.128.1.
SNMP Mediator does not get a response from 192.168.128.1, because no SNMP
agent is running in the address.
Id:0900d805803b05c7
Confidential
97
LTE iOMS Alarms and Troubleshooting
19.4
70004 UNKNOWN SNMP TRAP
Probable cause: Corrupt data
Event type: Processing error
Default severity: Warning
Meaning
The SNMP Mediator has received an SNMP trap that it is unaware of. The trap is
unknown to the SNMP Mediator, if 1) the IP address of the SNMP agent that sends the
trap is missing from the SNMP Mediator's configuration, or 2) the OID (object identifier)
of the trap is unknown to the SNMP Mediator.
1. Unknown traps may contain information that could be useful.
2. Unnecessary traps waste network capacity.
Identifying additional information fields
Additional information fields
1. IP address of the SNMP agent that sent the trap
2. Version of the used SNMP, possible values:
•
•
SNMPv1
SNMPv2c
3. Object identifier of the received trap
Instructions
1. Using the parameter management application, check that the IP address of the
SNMP agent is stored in the SNMP Mediator's configuration. An entry of the following format should be found:
fssnmpNEId=<agent IP or hostname>,
fssnmpAttributeType=NEattrs,
fssnmpMediatorName=1,
fsFragmentId=SNMP,
fsClusterId=ClusterRoot
2. If the trap is unnecessary, check whether there is a way to disable the sending of the
trap in the SNMP agent or use filtering in the SNMP Mediator. The SNMP Mediator
may be configured to filter out traps by adding an entry of the following format:
fssnmpV2TrapId=<trap OID> fssnmpAttributeType=V2traps,
fssnmpMediatorName=1,
fsFragmentId=SNMP,
fsClusterId=ClusterRoot
If the above entry without attributes exists in the configuration, the SNMP Mediator
will ignore the trap and no alarm is raised. Additionally, filtering attributes
fssnmpAcceptFrom or fssnmpDiscardFrom may be used to define the IP
addresses from where the trap should be accepted or ignored. Attribute fssnmpFilterCondition may be used for filtering away traps based on variables within the trap
itself. See RFC 2254 for information about the filter syntax ("approx", "extensible"
and "escaping mechanism" are not supported).
3. If the trap contains important information, the implementation of the SNMP Mediator
should be updated. The rules that define what the SNMP Mediator does when it
98
Id:0900d805802d470b
Confidential
DN0962937
Issue 01A
LTE iOMS Alarms and Troubleshooting
receives traps are part of the implementation. Fill in a problem report and send it to
your local Nokia Siemens Networks representative.
Clearing
Clear the alarm with the alarm management application after correcting the fault as presented in Instructions.
Testing instructions
1. Log into the active CLA.
2. Send coldStart trap to SNMP Mediator by using agent IP that is not in SNMPMediator's configuration (127.0.0.1):
# snmptrap -v 1 -c public SNMPMediator "" 127.0.0.1 0 0 ""
3. Alarm 70004 UNKNOWN SNMP TRAP with AAI=127.0.0.1 and AAI= "SNMPv1
.1.3.6.1.6.3.1.1.5.1" is raised.
DN0962937
Issue 01A
Id:0900d805802d470b
Confidential
99
LTE iOMS Alarms and Troubleshooting
19.5
70005 INCORRECT ALARM DATA
Probable cause: Invalid parameter
Event type: Processing error
Default severity: Major
Meaning
The alarm system has been requested to raise or clear an alarm with incorrect alarm
data. One or more arguments provided with the request might have an invalid value or
meaning:
•
•
•
•
•
•
null
empty
too long
out of specified range
contain non-printable characters
have an incorrect format
The alarm number (Specific Problem) might also be unknown. An incorrect format in
this case means, for example, that a character value was entered where a numeric value
was expected. A special case of an incorrect format is if the quotes (") surrounding the
value of an information field are missing from an alarm notification record in the syslog.
The alarm which is requested to be raised or cleared with incorrect data is not processed
further but the information is put as additional information in this alarm. If the alarm
number is unknown, then the actual fault for which the alarm has been raised is also left
unknown.
Identifying additional information fields
1. Erroneous data
•
•
100
Identifies the alarm data that was incorrect or that was totally missing. Only the
name of the first field containing invalid data is mentioned here.
Possible values are:
• SP: Specific Problem given in the data is not known by the alarm system, or is
not reasonable;
• MOId: Managed Object Id given in the data is not reasonable;
• PS: Perceived Severity given in the data is not reasonable;
• applId: Application Id given in the data is not reasonable;
• AAI: Additional Information given in the data is not reasonable;
• IAAI: Identifying Additional Information given in the data is not reasonable;
• alarmTime: Alarm time is presented in too long a format, or is in non-numerical
format;
• length: The combined length of the string type fields (Managed Object Id, Application Id, Application Additional Information, Identifying Application Additional
Information) given in the data exceeds the maximum value of 896 characters.
Note that in this case, both Application Id and Managed Object Id in the given
data are considered as invalid, as only the combined length is verified.
In addition, these values are also possible for RNC alarms:
• rncLocalMOId: the Local Managed Object Id given in the data is not reasonable;
• rncApplicationId: the RNC Application Id given in the data is not reasonable;
• rncNotificationId: the RNC Notification Id given in the data is not reasonable;
Id:0900d805803f7d2d
Confidential
DN0962937
Issue 01A
LTE iOMS Alarms and Troubleshooting
•
rncFlowControl: the RNC Flow Control given in the data is not reasonable.
2. Specific Problem
•
Specific problem (the alarm number) of the invalid alarm can also contain the
original invalid value if this was the invalid field.
Additional information fields
Managed Object Id
•
Distinguished name of the managed object that was given as the Managed Object
Id in the invalid alarm. If the MOId itself was the incorrect data, then the value
fsManagedObjectId=invalid, fsClusterId=ClusterRoot is displayed in
this field.
Instructions
Fill in a problem report and send it to your local Nokia Siemens Networks representative.
Clearing
Clear the alarm with the alarm management application after correcting the fault as presented in Instructions, in other words, after sending the report to your local Nokia
Siemens Networks representative.
Testing instructions
Use, for example, the alarm system command line interface (CLI) command
flexalarm to send a request to raise or clear an alarm with a Specific Problem that
does not exist.
For example:
$> flexalarm -raise -mo=<myMO> -ap=<myAP> -sp=700111
where <myMO> and <myAP> have the correct format.
Since the 700111 Specific Problem does not exist, alarm 70005 is raised.
DN0962937
Issue 01A
Id:0900d805803f7d2d
Confidential
101
LTE iOMS Alarms and Troubleshooting
19.6
70007 AUTHENTICATION FAILURE IN ETHERNET DEVICE
Probable cause: Protection path failure
Event type: Equipment
Default severity: Minor
Meaning
An Authentication Failure SNMP trap signifies that the sending protocol entity is the
addressee of a protocol message that is not properly authenticated. The agent on an
Authentication failure generates this trap. The SNMP Trap is generated when some
actor tries to request the SNMP queries with wrong authentication methods/keys. This
authentication key is called the community string in SNMP. This is most likely someone
with a misconfigured SNMP manager or MIB browser, but it may indicate malicious
activity, that is, some malicious user trying to obtain information by sending an SNMP
request. It does not get triggered for CLI (Command Line Interface)/Web login failures.
The SNMP request will fail and no information will be returned.
Identifying additional information fields
IP address
•
The trap was generated because of this IP address entity had wrong community
string.
Additional information fields
Instructions
In case when there is no misconfigured SNMP managers there is a danger that some
entity is inside the network without an authorization and this actor must be found. This
entity can be identified from the authentication failure SNMP trap sent by SNMP agent.
In case of misconfigured SNMP configuration in manager, the SNMP community string
must be updated.
Clearing
Clear the alarm with the alarm management application after correcting the fault as presented in Instructions.
Testing instructions
1. Log into the switch. For example:
[root@CLA-0(MIKAEL_R_FSPR4EDC_1.9) /root]
# ssh switch-1
Linux swsea 2.4.17_mvl21-swsea #1 Wed May 17 11:59:44 CDT 2006 ppc unknown
Linux swsea 2.4.17_mvl21-swsea #1 Wed May 17 11:59:44 CDT 2006 ppc unknown
2. 2. Start the swc command line tool:
root@swsea@1-1-8:~# swc
(RadiSys SWSE-A Switch) >
3. Display the community strings by "show snmpcommunity":
(RadiSys SWSE-A Switch) >show snmpcommunity
102
Id:0900d805803c315d
Confidential
DN0962937
Issue 01A
LTE iOMS Alarms and Troubleshooting
SNMP Community Name
Client IP
Address
tstcomm
Client IP
Mask
Access Mode
Status
0.0.0.0
Read Only
Enable
0.0.0.0
Read Only
Enable
192.168.128.
1
com
192.168.128.
1
4. Exit the switch:
(RadiSys SWSE-A Switch) >quit
The system has unsaved changes.
Would you like to save them now? (y/n) n
root@swsea@1-1-8:~# exit
logout
Connection to switch-1 closed.
5. Perform an SNMP Get request with a valid community string:
# snmpget -c tstcomm -v 2c switch-1 system.sysDescr.0
SNMPv2-MIB::sysDescr.0 = STRING: RadiSys SWSE-A Switch
6. Perform an SNMP Get request with an invalid community string:
# snmpget -c invalid -v 2c switch-1 system.sysDescr.0
SNMPv2-MIB::sysDescr.0 = STRING: RadiSys SWSE-A Switch
Alarm 70007 will be raised after step 6 due to the invalid community string.
DN0962937
Issue 01A
Id:0900d805803c315d
Confidential
103
LTE iOMS Alarms and Troubleshooting
19.7
70011 NODE NOT RESPONDING
Probable cause: Equipment malfunction
Event type: Equipment
Default severity: Major
Meaning
A physical computing node has not restarted despite of restart attempts. The node may
be broken, is unable to restart, or is stuck.
Any important services/functions that are provided with an active-standby recovery
group may have been taken over by other operational nodes. Services may be down if
standby nodes are also down.
Identifying additional information fields
Additional information fields
Any further information if available.
Instructions
Perform the following steps to verify the state of the node:
1. Log into the cluster as root user.
2. Use the hwcli command to verify the state of the node. For example, the state of
the node /CLA-1 can be checked as follows:
$ hwcli CLA-0
CLA-1:
available (FlexiSvr CPI1
000157:0108 01.02)
3. Previous hwcli output shows that the CLA-0 node is physically available. The high
availability services (HAS) of the system attempts, after about 30 minutes, to restart
a failed node by issuing a power-off, power-on and restart sequence. If you do not
want to wait for this, you can perform the power-off, power-on and restart sequence
manually.
For example:
$ hwcli --power off CLA-0
ATTAMPTING TO POWER OFF NODE
CLA-0
ARE YOU SURE YOU WANT TO PROCEED? yes
Powering off CLA-0: OK
$ hwcli --power on CLA-0
Powering on CLA-0: OK
$ hwcli --reset CLA-0
ATTAMPTING TO RESET NODE
CLA-0
ARE YOU SURE YOU WANT TO PROCEED? yes
Resetting CLA-0: OK
4. If the node does not start within a few minutes or the hwcli does not show that the
node is available, check if the CPU board has any error lights on. If it does, you can
try to restore the node into service by removing and re-inserting the node.
104
Id:0900d8058043d853
Confidential
DN0962937
Issue 01A
LTE iOMS Alarms and Troubleshooting
5. Contact your Nokia Siemens Networks representative even if these operations bring
the node up, because it is possible that the computing node needs to be replaced or
it may, for example, need a BIOS upgrade.
Clearing
The system clears the alarm automatically when the fault has been corrected.
Testing instructions
1. Power-off an operational unlocked node using hwcli. You can check the state of
the node using fshascli. For example,
$ fshascli --state /AS-1
/AS-1
administrative(UNLOCKED)
<== Unlocked
operational(ENABLED)
<== Operational
usage(IDLE)
procedural()
availability()
unknown(FALSE)
alarm()
$ hwcli --power off AS-1
ATTEMPTING TO POWER OFF NODE
AS-1
ARE YOU SURE YOU WANT TO PROCEED? yes
Powering off TA-A:
OK
2. Wait for the node to change its state to DISABLED. By default, the alarm is raised
about 10 minutes after the node has been declared faulty because attempts to
restart it have failed. A faulty node has OFFLINE and FAILED in the availability
status. For example,
$ fshascli --state /AS-1
/AS-1
administrative(UNLOCKED)
<== Unlocked
operational(DISABLED)
<== Not operational
usage(IDLE)
procedural(INITIALIZING)
availability(OFFLINE)
<== Not yet failed
unknown(FALSE)
alarm(MAJOR,OUTSTANDING)
$ sleep 11m
$ fshascli --state /AS-1
/AS-1
administrative(UNLOCKED)
<== Unlocked
operational(DISABLED)
<== Not operational
usage(IDLE)
procedural(NOTINITIALIZED)
availability(OFFLINE,FAILED) <== FAILED!
unknown(FALSE)
alarm()
The alarm raising is also visible in the syslog as a message that begins as follows:
ALARM RAISE SP=70011 . . .
DN0962937
Issue 01A
Id:0900d8058043d853
Confidential
105
LTE iOMS Alarms and Troubleshooting
3. The alarm is automatically cancelled when the node has successfully restarted.
Issue a power-on for the node using hwcli and wait for the node restart to complete. For example,
$ hwcli --power on AS-1
Powering on AS-1: OK
$ sleep 3m
$ fshascli --state /AS-1
/AS-1
administrative(UNLOCKED)
<== Unlocked
operational(ENABLED)
<== Operational
usage(IDLE)
procedural()
availability()
unknown(FALSE)
alarm()
The alarm cancellation is also visible in the syslog as a message that begins as
follows:
ALARM CANCEL SP=70011 . . .
106
Id:0900d8058043d853
Confidential
DN0962937
Issue 01A
LTE iOMS Alarms and Troubleshooting
19.8
70025 POSSIBLE SECURITY THREAT IN NETWORK
ELEMENT
Probable cause: Threshold crossed
Event type: Quality of Service
Default severity: Warning
Meaning
There is reason to suspect that someone is trying to intrude a network element. This
condition emerges if there are too many wrong login attempts.
Identifying additional information fields
Additional information fields
Instructions
Security log data must be checked. Investigate specially login entries made just before
alarm was raised.
Clearing
After correcting the fault as presented in Instructions, clear the alarm with the alarm
management application.
Testing instructions
Prerequisites for the testing: Make an internal test account (i.e., to reside in the network
element's LDAP server by using either the parameter management application or the
fsuseradd CLI command) and set its password.
1. Log into a node with ssh and with a valid user account and password so that a
session is successfully started.
2. Log out from the node.
3. Log in with the same user account but with a wrong password the predefined
number of times (for the number, please see the file /etc/pam.d/ssh its
row"/opt/Nokia_BP/lib/security/$ISA/PamAlarm.so
file=/var/log/faillog alarmThreshold=<number>
validfor=internal" in which the threshold is defined with the parameter
alarmThreshold=<threshold_for_number_of_failed_logins>").
The default value for the needed subsequent failed logins is 5. Make sure that there
are no successful logins for the user between the failed ones.
An alarm should be raised after the predefined number of failed logins Check the
alarm list with the alarm management application.
Tip: You can also use Element Manager instead of ssh for the test.
DN0962937
Issue 01A
Id:0900d8058038eeed
Confidential
107
LTE iOMS Alarms and Troubleshooting
19.9
70030 DISK DATABASE IS GETTING FULL
Probable cause: Storage capacity problem
Event type: Processing error
Default severity: Major
Meaning
The disk storage area reserved for disk database is filling up.
The disk database is still fully operational. If the database fills up completely, its services
cannot be properly used anymore.
Identifying additional information fields
Additional information fields
1. Max size: the maximum size of database in kB
2. Fill ratio: the fill ratio of the database (the percentage of how much is filled from the
database)
Instructions
The actions to be done in order to avoid a completely full database are database-specific, so contact your local Nokia Siemens Networks representative immediately and
provide them with the information you obtained from the alarm notification's fields.
Clearing
Clear the alarm with the alarm management application after correcting the fault as presented in Instructions.
Testing instructions
You can test the alarm either by filling the database until the allocated space exceeds
the fill ratio alarm limit, or by decreasing the fill ratio alarm limit under the current fill ratio
of the database. You can also combine these two approaches.
•
•
108
In the first approach, you simply create a dummy table to the database and insert
rows to it until the fill ratio exceeds the fill ratio alarm limit (see attribute
fsdbFillRatioAlarmLimit in the DB fragment in LDAP - Lightweight Directory
Access Protocol).
In the second approach, you must use a parameter management tool to change the
fsdbFillRatioAlarmLimit attribute of the DB fragment to a smaller value than
the current fill ratio of the database. After this, you must restart the recovery group
of the database (fshascli -r /<RG>). The current fill ratio of the database can
be estimated as follows:
1. Get the maximum size of the database either by checking the
innodb_data_file_path attribute from the MySQL instance configuration
file (/var/mnt/local/MySQL_<DBName>/my.cnf) or by connecting to the
instance and entering the following command:
SHOW GLOBAL VARIABLES LIKE 'innodb_data_file_path'\G
The maximum size is the sum of the maximum size of each InnoDB data file
listed in the value. For example, the following result means that the maximum
size is 500 MB (512'000 kB):
Id:0900d80580438a73
Confidential
DN0962937
Issue 01A
LTE iOMS Alarms and Troubleshooting
*************************** 1. row ***************************
Variable_name: innodb_data_file_path
Value: ibdata1:500M
2. Get the free space of the database by connecting to the instance and entering
the following command for any InnoDB table:
SHOW TABLE STATUS FROM <schema> LIKE '<table>'\G
where <schema> is the schema name of the InnoDB table and <table> is the
name of the table. The comment column of the result set shows the free space.
For example, the following result means that the database has 492'544 kB free
space (when using the example size of step 1, the result leads to fill ratio of
3,8%):
mysql> SHOW TABLE STATUS FROM test LIKE 'mysqlwdtest'\G
*************************** 1. row ***************************
Name: mysqlwdtest
...
Comment: InnoDB free: 492544 kB
It does not matter which InnoDB table is used in the query.
3. Check the schema and the name of an arbitrary InnoDB table by using the following query:
SELECT table_schema,table_name
FROM information_schema.tables
WHERE engine = 'InnoDB'
LIMIT 1;
DN0962937
Issue 01A
Id:0900d80580438a73
Confidential
109
LTE iOMS Alarms and Troubleshooting
19.10
70064 BACKUP ERROR
Probable cause: Application subsystem failure
Event type: Processing error
Default severity: Minor
Meaning
Backup has failed because of a fatal error or it has been interrupted.
As a result, either the backup archive does not exist or it is corrupted and unusable.
Identifying additional information fields
1. Backup log file. Identifies the name of the backup log file without the path.
The format is BUTYPE_$BASE_$DATE, where $BUTYPE is either "FULL", "PARTIAL"
or "CUSTOM", $BASE is the name of the base delivery or hostname (if flexiserver link
is not present in the system), and $DATE is current date in the format
YYYYMMDD_HHMMSS.
Additional information fields
Instructions
1. Locate the backup log from /var/mnt/local/backup/SS_Backup. The name of
the log file is given in the alarm.
2. See the backup summary at the end of the log.
3. Search the log contents for "ERROR" and "WARNING" statements to see which
backup module has failed.
4. Refer to the backup and restore troubleshooting instructions.
5. If the backup has failed before the log file has been created, search the syslog for
the latest fsbackup entries.
6. After the failure, re-execute the backup.
However, if the failure was caused by incorrect environment and/or configuration, refer
to backup and restore troubleshooting instructions and correct the environment and/or
configuration before re-executing the backup.
Clearing
Clear the alarm with an alarm management application after correcting the fault as presented in Instructions.
Testing instructions
1. Start a partial backup. For example:
fsbackup -p -v
2. Interrupt the process by pressing Ctrl-C.
The backup process raises an alarm.
Or
1. Lock a database recovery group (for example, TimesTen and Solid)
2. Execute custom backup, for example:
fsbackup -d -v
The backup process raises an alarm.
110
Id:0900d805802f1c90
Confidential
DN0962937
Issue 01A
LTE iOMS Alarms and Troubleshooting
19.11
70110 CONFIGURATION OF NWI3 ADAPTER IS OUT OF
ORDER
Probable cause: Configuration or customizing error
Event type: Processing error
Default severity: Minor
Meaning
The configuration file of NWI3 adapter contains invalid attribute values. Depending on
the release, the configuration is stored only in files or files and LDAP (Lightweight Directory Access Protocol).
The system ignores the invalid parameters and uses the default values or the closest
acceptable value. For example, the value 2000 is greater than the highest acceptable
value (1440) for heartbeatPeriod (see the table in the Instructions) and causes this
alarm. In this case, 1440 would be used as the heartbeatPeriod.
Identifying additional information fields
Attribute name: name of the attribute that has an invalid value
Additional information fields
File path: the path of the file that includes invalid attribute values; or LDAP branch: the
LDAP branch that includes invalid attribute values
Instructions
1. Correct the invalid attribute value. The attribute name is displayed in the Identifying
additional information field. The name of the configuration file is displayed in the
Additional information field.
The attributes that can cause this alarm are mainly stored in file
/var/opt/Nokia/SS_Nwi3Adapter/config/nwi3mdcorba.ini or LDAP
branch fsFragmentId=mediator, fsFragmentId=NWI3, fsClusterId=ClusterRoot. The
valid as well as default values of these attributes are presented in the table below.
The attribute names in LDAP are prefixed with fsnwi3.
Name
Type
Default
(fsnwi3)takeIntoUseNext
boolean: (0=false,1=true) in
nwi3mdcorba.ini and
(false,true) in LDAP
0
(fsnwi3)registrationServiceIOR
string, a valid IOR to
NetAct’s registration service
empty string
(fsnwi3)heartbeatPeriod
short: [0..1440] minutes,
granularity:1 minute
15
(fsnwi3)reRegistrationPeriod
short: [15..1440] minutes,
granularity:1 minute
60
(fsnwi3)registrationRetryBasePeriod
short: [5..240] minutes, gran- 15
ularity:1 minute
(fsnwi3)retryRandom
short: [5..240] minutes, gran- 5
ularity:1 minute
Table 6
DN0962937
Issue 01A
Valid and default attribute values of the NWI3 adapter configuration file
Id:0900d8058036b134
Confidential
111
LTE iOMS Alarms and Troubleshooting
Name
Type
Default
(fsnwi3)rePublicationPeriod
short [1..60] minutes, granularity:1 minute
3
(fsnwi3)getPublicationServiceRetryPeriod
short [1..60] minutes, granularity:1 minute
15
Table 6
Valid and default attribute values of the NWI3 adapter configuration file
2. This alarm can also be caused by the parameter mediatorSessionManagerIOR
located in file /var/opt/Nokia/www/SessionManager_V1.ior.
Restart the NWI3 adapter to generate mediatorSessionManagerIOR into
SessionManager_V1.ior. In normal conditions, the restart generates the parameter with valid value.
3. If the problem is the results from the parameter systemID in file
/var/opt/Nokia/www/systemid.txt, the probable cause is that the file
systemid.txt is missing. The value in systemID should be the same as in the file
/etc/cluster-id.
Copy /etc/cluster-id to /var/opt/Nokia/www/systemid.txt and restart
the NWI3 adapter.
Clearing
Clear the alarm with alarm management application after correcting the fault as presented in Instructions.
Testing instructions
1. If file /var/opt/Nokia/SS_Nwi3Adapter/config/nwi3mdcorba.ini exists,
set the following content to it (no value for registrationServiceIOR and
takeIntoUseNext=1):
[DN:N3CF-1]
objectClassVersion=1
N3CFId=1
objectClass=N3CF
configurationActive=0
takeIntoUseNext=1
registrationServiceIOR=
registrationServiceUsername=Nemuadmin
registrationServicePassword=nemuuser
heartbeatPeriod=15
reRegistrationPeriod=60
registrationRetryBasePeriod=15
retryRandom=5
rePublicationPeriod=3
getPublicationServiceRetryPeriod=15
userLabel=
2. If branch fsFragmentId=mediator, fsFragmentId=NWI3, fsClusterId=ClusterRoot
exists in the LDAP, use parameter management application for creating a new child
to the branch. Enter the following attributes in the Add New Entry dialog:
• fsnwi3N3CFId=1
• takeIntoUseNext=1
112
Id:0900d8058036b134
Confidential
DN0962937
Issue 01A
LTE iOMS Alarms and Troubleshooting
3. Restart NWI3Adapter.
If file nwi3mdcorba.ini was modified in step 1, alarm 70110 with IAAI= registrationServiceIOR and AAI=/var/opt/Nokia/SS_Nwi3Adapter/config/nwi3mdcorba.ini is raised.
If LDAP was modified in step 1, alarm 70110 with IAAI= fsnwi3registrationServiceIOR
and AAI= fsnwi3N3CFId=1,fsFragmentId=mediator,fsFragmentId=NWI3,fsClusterId=ClusterRoot is raised.
DN0962937
Issue 01A
Id:0900d8058036b134
Confidential
113
LTE iOMS Alarms and Troubleshooting
19.12
70111 FAILED TO CREATE NETACT CONNECTION
Probable cause: Connection establishment error
Event type: Communications
Default severity: Major
Meaning
The NWI3 adapter failed to register to Nokia NetAct.
NetAct cannot subscribe to notifications or be used for managing the network element
(NE) via NWI3.
Identifying additional information fields
Additional information fields
Depending on the release
N3CFId: the naming attribute of the active N3CF instance in file
/var/opt/Nokia/SS_Nwi3Adapter/config/nwi3mdcorba.ini; or Distinguished name of the active N3CF instance in LDAP.
Instructions
1. If file /var/opt/Nokia/SS_Nwi3Adapter/config/nwi3mdcorba.ini
exists:
a) Make sure that the NetAct Registration Service IOR (parameter registrationServiceIOR) is filled in file
/var/opt/Nokia/SS_Nwi3Adapter/config/nwi3mdcorba.ini and
check the correctness of the IOR. The command printIOR <IOR> can be used
for viewing the IP address and port included in the IOR.
b) Verify that there is a valid username (parameter registrationServiceUsername)
and password (registrationServicePassword) to the registration service of
NetAct in file
/var/opt/Nokia/SS_Nwi3Adapter/config/nwi3mdcorba.ini.
c) Check the value of the takeIntoUseNext parameter in the nwi3mdcorba.ini
file. The value of the parameter in an active section should be 1, and the value
of the configurationActive parameter should also be 1.
The system sets the value of the configurationActive parameter automatically to
1 when a parameter set is taken into use.
2. If file /var/opt/Nokia/SS_Nwi3Adapter/config/nwi3mdcorba.ini does
not exist and NWI3 adapter's configuration is stored under branch fsFragmentId=mediator, fsFragmentId=NWI3, fsClusterId=ClusterRoot in the LDAP:
a) Verify that there is an LDAP entry fsnwi3N3CFId=<id>>,fsFragmentId=mediator, fsFragmentId=NWI3 with attribute fsnwi3takeIntoUseNext=true, which
defines the active attribute set.
b) Make sure that the NetAct Registration Service IOR (attribute
fsnwi3registrationServiceIOR) has been specified for the active set and check
the correctness of the IOR. Command printIOR can be used for viewing the
IP address and port included in the IOR.
114
Id:0900d8058050aef9
Confidential
DN0962937
Issue 01A
LTE iOMS Alarms and Troubleshooting
c) If attributes fsnwi3NEAccountUsername and fsnwi3NEAccountPassword exist
under branch fsFragmentId=security, fsFragmentId=NWI3, they are used for
NetAct registration. Verify that they are valid.
d) If attributes fsnwi3NEAccountUsername and fsnwi3NEAccountPassword do not
exist under branch fsFragmentId=security, fsFragmentId=NWI3, the initial
username (attribute fsnwi3initialRegistrationUsername) and password
(fsnwi3initialRegistrationPassword) defined in the active set are used for NetAct
registration. Verify that they are valid.
3. Verify that NetAct is up and running and check the connection between the NE and
NetAct. Ping NetAct from the node where the NWI3 adapter is running: ping -I
<node's external IP address> <NetAct's IP (see step 1)>.
4. Check that the NetAct hostname is configured in the external domain name system
(DNS) in use.
Clearing
The alarm system clears the alarm automatically after the fault has been corrected.
Testing instructions
1. If file /var/opt/Nokia/SS_Nwi3Adapter/config/nwi3mdcorba.ini exists:
a) Set the following content to it (a valid registrationServiceIOR of a non-existent
NetAct object) and takeIntoUseNext=1):
[DN:N3CF-1]
objectClassVersion=1
N3CFId=1
objectClass=N3CF
configurationActive=0
takeIntoUseNext=1
registrationServiceIOR=IOR:000000000000002449444c3a4e574933
2f526567697374726174696f6e536572766963655f56313a312e3000000
000010000000000000064000102000000000e3137322e32312e3232302e3
631009c3f0000002400504d43000000040000000a2f4e657441637452530
02020000000084e657441637452530000000256495303000000050005070
17d00000000000000000000080000000056495300
registrationServiceUsername=Nemuadmin
registrationServicePassword=nemuuser
heartbeatPeriod=15
reRegistrationPeriod=60
registrationRetryBasePeriod=15
retryRandom=5
rePublicationPeriod=3
getPublicationServiceRetryPeriod=15
userLabel=
b) Verify that NetAct's registration service is not running in the IP address and port
defined by registrationServiceIOR.
c) Restart NWI3Adapter.
Alarm 70111 with AAI=1 is raised.
2. If branch fsFragmentId=mediator, fsFragmentId=NWI3, fsClusterId=ClusterRoot
exists in the LDAP:
DN0962937
Issue 01A
Id:0900d8058050aef9
Confidential
115
LTE iOMS Alarms and Troubleshooting
a) Use parameter management tool for creating a new child to the branch. Enter
the following attributes in the Add New Entry dialog:
• fsnwi3N3CFId=1
• takeIntoUseNext=1
• fsnwi3registrationServiceIOR=
IOR:000000000000002449444c3a4e5749332f526567697374
726174696f6e536572766963655f56313a312e300000000001
0000000000000064000102000000000e3137322e32312e3232
302e3631009c3f0000002400504d43000000040000000a2f4e
65744163745253002020000000084e65744163745253000000
025649530300000005000507017d000000000000000000000
80000000056495300
b) Restart NWI3Adapter.
Alarm 70111 with AAI="fsnwi3N3CFId=1,fsFragmentId=mediator,fsFragmentId=NWI3,fsClusterId=ClusterRoot" is raised.
116
Id:0900d8058050aef9
Confidential
DN0962937
Issue 01A
LTE iOMS Alarms and Troubleshooting
19.13
70156 DISK DATABASE WATCHDOG START-UP FAILED
Probable cause: Configuration or Customizing Error
Event type: Processing error
Default severity: Critical
Meaning
Start-up of the disk database watchdog has failed due to a configuration error, or other
reasons.
Because the disk database and its watchdog belong to the same Recovery Unit (RU),
the disk database watchdog start-up failure means that the database is not available.
Identifying additional information fields
Additional information fields
1. Reason. Possible values:
• Disk database watchdog failed to read the parameters from the parameter management system.
• Invalid or missing parameter value.
2. List of invalid or missing parameters if the reason for the alarm is 2.
Instructions
Check the Application Additional Information field for a reason for the configuration
error:
•
•
Reason 1: Disk database watchdog failed to read the parameters from parameter
management
Reason 2: Invalid or missing parameter value
Continue according to the following procedure:
1. Check that the following parameters exist in parameter management for each
database entry in the database fragment with the DN (Distinguished Name) "fsFragmentId=DB, fsClusterId=ClusterRoot":
fsdbRedundancyModel
fsdbDataSourceName
fsdbFillRatioAlarmLimit
fsdbFillRatioCheckFreq
2. Use parameter management application to get the values of those parameters for
the database in question. To find those parameters, use the value of the Managed
Object field in alarm management application, for example:
fsdbName=DB_Alarm,fsFragmentId=DB,fsClusterId=ClusterRoot
3. Send the found values and/or parameters that do not exist (parameters for which
the fields are empty) to your local Nokia Siemens Networks representative.
Clearing
Clear the alarm using alarm management application after correcting the fault.
DN0962937
Issue 01A
Id:0900d8058036a131
Confidential
117
LTE iOMS Alarms and Troubleshooting
Testing instructions
1. 1. use a parameter management application to change the
fsdbFillRatioAlarmLimit or fsdbFillRatioCheckFreq attribute of the
database to a non-numeric value
2. restart the recovery group of the database.
118
Id:0900d8058036a131
Confidential
DN0962937
Issue 01A
LTE iOMS Alarms and Troubleshooting
19.14
70157 CPU USAGE OVER LIMIT
Probable cause: Threshold crossed
Event type: Quality of service
Default severity: Major
Meaning
A processor is being used at a very high throughput level because the execution of some
processes is taking a lot of CPU time.
There is a risk that the node is unable to fulfill the tasks allocated to it. This depends on
to what extent the processes taking the most of the CPU time are blocking other processes from getting runtime on the CPU, and whether there is a temporary or a permanent increase on the throughput.
If the processor is constantly used at a very high throughput level, the system might
appear very slow. For example, the execution of commands takes an unusually long
time to finish.
Identifying additional information fields
1. CPU index (optional).
Additional information fields
Instructions
1. Run
top
Linux command on the node that reports the alarm. The command gives a repetitive
update of processor activity in real time. It gives a listing of the most CPU-intensive
tasks of the system.
2. If the problem persists, contact your local Nokia Siemens Networks representative
and provide the information gathered in the previous step.
Clearing
The alarm is cleared automatically by the operating system's fault detector once the
CPU usage is on a low enough level. The raising / clearing thresholds are different to
prevent unnecessary trashing.
Testing instructions
Do not test this alarm, because testing it will result in reduced quality of service.
DN0962937
Issue 01A
Id:0900d80580331c00
Confidential
119
LTE iOMS Alarms and Troubleshooting
19.15
70158 FILE SYSTEM USAGE OVER LIMIT
Probable cause: Threshold Crossed
Event type: Quality of service
Default severity: Major
Meaning
The available disk space on a partition is smaller than the minimal requirement. The partition can be filled up, for example, by crashing programs resulting large core files or by
large log files, if the rotation of logs does not function.
There is a risk that some data cannot be written to the disk.
Identifying additional information fields
Mountpoint
Additional information fields
Instructions
1. Run the
df -k <mountpoint>
Linux command on the node that reports the alarm to get a report of the usage of
the file system disk space in 1 kilobyte blocks.
See the mountpoint in the Identifying additional information fields of the alarm.
Alternatively, run the Linux command
df -h <mountpoint>
to see the information in a human readable format.
2. Run the Linux command du -k or du -h on the node that reports the alarm to
disocver the directories that consume most of the space.
3. Check with
du -h /var/tmp/.
if /var/tmp is among the large directories. If it is, remove the unnecessary files.
4. Check with
du -h /var/log/.
if /var/log is among the large directories. If it is, move the old files outside the
Network Element (NE) using the appropriate network management tools.
5. Check with
du -h /var/crash/.
if /var/crash is among the large directories. If it is, move the core files outside the
NE using the appropriate network management tools.
6. If the alarm is not cleared, contact your local Nokia Siemens Networks representative.
Clearing
The alarm is automatically cleared by the operating system's fault detector once the
amount of available disk space increases above the specified limit. The raising / clearing
thresholds are different to prevent unnecessary trashing.
120
Id:0900d8058034faea
Confidential
DN0962937
Issue 01A
LTE iOMS Alarms and Troubleshooting
Testing instructions
Do not test this alarm, because testing it in a live system will reduce the quality of
service.
DN0962937
Issue 01A
Id:0900d8058034faea
Confidential
121
LTE iOMS Alarms and Troubleshooting
19.16
70159 MANAGED OBJECT FAILED
Probable cause: Software program abnormally terminated
Event type: Processing error
Default severity: Major
Meaning
The named managed object (MO) has failed. The managed object can be a software,
hardware or logical entity. The type of the managed object identifies the following:
•
•
•
•
Node: The physical computing node, its system software, or operating system has
failed, or the node has been manually restarted.
Recovery Unit (RU): A recovery unit contains one or more processes. A recovery
unit failure is usually caused by a process failure.
Process: The process has crashed, terminated abnormally or stopped responding.
Recovery Group (RG): A recovery group consists of one or more recovery units. A
recovery group failure alarm is raised for an active-standby configuration, when both
redundant components (recovery units of the recovery group) have failed. This is
always a serious situation as it indicates a double failure (for example, two nodes
have failed at the same time).
The effect of the situation depends on the managed object type:
•
•
•
•
Node: Any important services/functions that are provided with an active-standby or
N+M recovery group may be taken over by other operational nodes. Services may
be down if standby/spare nodes are also down.
Recovery Unit (RU): If the recovery unit belongs to an active-standby or N+M
recovery group, the service may be taken over by an operational standby/spare
recovery unit.
Process: The service or function that the process provides is not available. A
process failure can cause a recovery unit level recovery action or the system may
attempt to restart the failed process.
Recovery Group (RG): The service provided by the recovery group is not available.
Manual correction is required, as the automatic system repair actions have not
solved the problem.
The system High Availability Services (HAS) will periodically attempt to solve the
problem with corrective actions, such as switchovers or restarts. The alarm system also
clears the obsolete alarms that may have been raised by this managed object or by its
child managed objects.
Identifying additional information fields
Additional information fields
1. Identifies the managed object type: "Node", "Recovery unit", "Process" or "Recovery
group".
2. Explains the string of the fault type (if that information is available) or just the string
"failure".
For example:
"Process has stopped responding to heartbeats"
"Node connection heartbeat failure"
"Recovery group failure"
122
Id:0900d80580292626
Confidential
DN0962937
Issue 01A
LTE iOMS Alarms and Troubleshooting
Instructions
1. Log into the cluster and check that the named managed object has been successfully restarted.
2. Verify also that the MO did not raise any new alarms that would explain the failure.
You can check the status of an MO with the HAS user interface tool fshascli. An operational MO has the value ENABLED in the operational state attribute and an empty procedural status attribute.
For example, the state of the process NodeDNS in the recovery unit FSNodeDNSServer
of the node AS-5 can be seen as follows:
$ fshascli --status /AS-5/FSNodeDNSServer/NodeDNS
/AS-5/FSNodeDNSServer/NodeDNS:
administrative(UNLOCKED)
operational(ENABLED)
usage(ACTIVE)
procedural()
availability( )
unknown(FALSE)
role(ACTIVE)
If the MO is not operational, perform the following steps:
1. With a node MO, you can wait for a node restart. The system will raise another alarm
(70011 NODE NOT RESPONDING) if the node does not come up within some time.
2. Check the system logs (/var/log/master-syslog on the active CLA node) for
error(s) that have occurred by searching for the MO's name and/or by looking at
events that occurred before this alarm was raised.
3. You can also use the HAS user interface tool to initiate an immediate restart attempt
of the failed MO using the -r (--restart) command line option:
$ fshascli --restart /AS-5/FSNodeDNSServer
The restart operation is mostly useful after a problem has been corrected. Verify the
result from the syslog and by checking the status of the MO.
4. An alarm for a recovery group implies a multiple error situation (for example, multiple
node failures) or a persistent configuration or corruption problem. In this case,
contact your local Nokia Siemens Networks representative.
Clearing
The system clears the alarm automatically when the fault has been corrected.
Testing instructions
Scenario 1: Alarm for a node
1. Restart an operational unlocked node using fshascli. For example,
$ fshascli --state /AS-1
/AS-1
administrative(UNLOCKED)
<== Unlocked
operational(ENABLED)
<== Operational
usage(IDLE)
procedural()
availability()
DN0962937
Issue 01A
Id:0900d80580292626
Confidential
123
LTE iOMS Alarms and Troubleshooting
unknown(FALSE)
alarm()
$ fshascli --restart --nowarning /AS-1
/AS-1 is restarted successfully
2. Wait for a few seconds for the node to turn DISABLED. The alarm is raised after this.
For example,
$ fshascli --state /AS-1
/AS-1
administrative(UNLOCKED)
<== Unlocked
operational(DISABLED)
<== No longer operational
usage(IDLE)
procedural(TERMINATING)
availability()
unknown(FALSE)
alarm()
$ fshascli --state /AS-1
/AS-1
administrative(UNLOCKED)
<== Unlocked
operational(DISABLED)
<== No longer operational
usage(IDLE)
procedural(TERMINATING)
availability()
unknown(FALSE)
alarm(MAJOR,OUTSTANDING)
<== Alarm has been raised
The alarm raising is also visible in the syslog as a message that begins as follows:
ALARM RAISE SP=70159 . . .
The alarm is automatically cancelled when the node has successfully restarted.
The alarm cancellation is also visible in the syslog as a message that begins as
follows:
ALARM CANCEL SP=70159 . . .
Scenario 2: Alarm for a process
1. Terminate an operational and unlocked "modest" severity process. An operational
process has ENABLED operational state and an empty procedural status. You can
search for modest criticality processes with the fshascli command --view. For
example,
$ fshascli --view --filter process "/*/*/*"
. . .
/TA-A/TestApplAServer/TestProcA:
Process
/TA-A/TestApplAServer/TestProcA
command=(/opt/Nokia/SS_ABC/bin/testProcA)
status=(fullHA)
startMethod=(requested)
severity(modest)
. . .
124
Id:0900d80580292626
Confidential
DN0962937
Issue 01A
LTE iOMS Alarms and Troubleshooting
$ fshascli -state /TA-A/TestApplAServer/TestProcA
/TA-A/TestApplAServer/TestProcA:
administrative(UNLOCKED)
<== Unlocked
operational(ENABLED)
<== Operational
usage(ACTIVE)
procedural()
<== Empty PS = running
availability()
unknown(FALSE)
alarm()
role(ACTIVE)
$ ssh TA-A killall testProcA
2. Verify that the alarm was raised and (very likely) also immediately cancelled. The
HAS cancels the alarm immediately if the process repair cycle allowed an immediate
restart.
The alarm raising is also visible in the syslog as a message that begins as follows:
ALARM RAISE SP=70159 . . .
Similarly, the alarm cancellation is also visible in the syslog as a message that begins
as follows:
ALARM CANCEL SP=70159 . . .
Scenario 3 : Alarm for a recovery unit
1. Terminate an operational and unlocked "important" severity process. This causes a
failure of the recovery unit. An operational process has ENABLED operational state
and an empty procedural status. You can search for important criticality processes
with the fshascli command --view. For example,
$ fshascli --view --filter process "/*/*/*"
. . .
/TA-A/TestApplBServer/TestProcB:
Process
/TA-A/TestApplBServer/TestProcB
command=(/opt/Nokia/SS_ABC/bin/testProcB)
status=(fullHA)
startMethod=(requested)
severity(important)
. . .
$ fshascli -state /TA-A/TestApplBServer/TestProcB
/TA-A/TestApplBServer/TestProcB:
administrative(UNLOCKED)
<== Unlocked
operational(ENABLED)
<== Operational
usage(ACTIVE)
procedural()
<== Empty PS = running
availability()
unknown(FALSE)
alarm()
role(ACTIVE)
DN0962937
Issue 01A
Id:0900d80580292626
Confidential
125
LTE iOMS Alarms and Troubleshooting
$ ssh TA-A killall testProcB
2. Verify that the alarm was raised and (very likely) also immediately cancelled. The
HAS cancels the alarm immediately if the recovery unit repair cycle allowed an
immediate restart.
The alarm raising is also visible in the syslog as a message that begins as follows:
ALARM RAISE SP=70159 . . .
Similarly, the alarm cancellation is also visible in the syslog as a message that
begins as follows:
ALARM CANCEL SP=70159 . . .
126
Id:0900d80580292626
Confidential
DN0962937
Issue 01A
LTE iOMS Alarms and Troubleshooting
19.17
70160 MEMORY USAGE OVER LIMIT
Probable cause: Threshold crossed
Event type: Quality of service
Default severity: Major
Meaning
Memory consumption is too high because some processes are using too much memory.
There is a risk that the node is unable to fulfil the tasks allocated to it because the processes cannot reserve enough memory for their use. As a result, the processes cannot
perform the tasks allocated to them.
Identifying additional information fields
Additional information fields
Instructions
1. Run
top
Linux command on the node that reports the alarm to view a snapshot of the current
global memory.
Press M to sort the processes in the node based on their memory resident size to
check which processes consume the most memory.
2. If the problem persists, contact your local Nokia Siemens Networks representative
and provide them with the information gathered in the previous step.
Clearing
The alarm is automatically cleared by the operating system's fault detector once the
memory usage is on a low enough level. The raising / clearing thresholds are different
to prevent unnecessary trashing.
Testing instructions
Do not test this alarm, because testing it will result in reduced quality of service.
DN0962937
Issue 01A
Id:0900d805802f6914
Confidential
127
LTE iOMS Alarms and Troubleshooting
19.18
70161 OPERATING SYSTEM MONITORING FAILURE
Probable cause: System call unsuccessful
Event type: Processing error
Default severity: Major
Meaning
The fault detector in the operating system has failed to capture the statistics of the usage
of a given resource.
The state of the named device cannot be discovered, which may indicate that there are
some fundamental problems with it.
Identifying additional information fields
1. Failed subsystem
2. Failed resource, where the values are
• CPU: Index of the processor
• FILESYSTEM: Name of the mountpoint
• ETHERNET: Name of the interface
• MEMORY:
• RAID: Name of the device
• FC (Fibre Channel):
Additional information fields
Instructions
If the alarm is not cleared automatically, contact your Nokia Siemens Networks representative.
Clearing
Do not clear the alarm. The alarm is automatically cleared when the fault detector of the
operating system is able to capture the statistics of the failed resource.
Testing instructions
This alarm is difficult to test, because the hardware problem cannot be simulated.
128
Id:0900d805803aa9bd
Confidential
DN0962937
Issue 01A
LTE iOMS Alarms and Troubleshooting
19.19
70162 RAID ARRAY HAS BEEN DEGRADED
Probable cause: Disk problem
Event type: Equipment
Default severity: 3 Major
Meaning
Redundancy of the RAID array is lost. A device belonging to the RAID array can be
marked faulty by the system. The alarm may be caused by either errors in the fibre
channel (FC) or small computer system interface (SCSI) bus or by a potentially broken
disk media.
In the case of a subsequent disk failure, data will be lost.
Identifying additional information fields
1. RAID array.
Additional information fields
2. Faulty device (optional).
Instructions
If the hardware is FlexiServer Blade Hardware, then follow these instructions:
1. Use the command cat /proc/mdstat to check the status of the RAID array found in
the Identifying additional information field of the alarm on the node that reports the
alarm.
The [UU] field printed by the command describes whether both of the disks are in
the RAID array or not. If this field contains [_U] or [U_], one of the disks is not in the
RAID array.
2. The redundancy of the RAID array should be automatically restored by the system
within an hour. If the problem persists and the alarm is not cleared within an hour,
contact your local Nokia Siemens Networks representative.
3. If the problem persists, try changing the faulty disk according to the hardware maintenance instructions. If that does not help, contact your local Nokia Siemens
Networks representative.
If the hardware is IBM BladeCenter, then follow these instructions:
1. Check the Maintenance Module and find the faulty disk and the possible cause of
the fault. Replace the faulty disk with a new disk, referring to the hardware maintenance documentation for detailed replacement instructions.
2. The redundancy of the RAID array should be automatically restored by the system
within an hour. If the problem persists and the alarm is not cleared within an hour,
contact your local Nokia Siemens Networks representative.
Clearing
The alarm is automatically cleared by the operating system's fault detector once the
redundancy of the RAID array is restored.
Testing instructions
Do not test this alarm in a live system. Any real disk faults during the execution of this
test may lead to data corruption.
DN0962937
Issue 01A
Id:0900d805804b01a5
Confidential
129
LTE iOMS Alarms and Troubleshooting
19.20
70163 ETHERNET INTERFACE USAGE OVER LIMIT
Probable cause: Threshold Crossed
Event type: Quality of service
Default severity: Minor
Meaning
The Ethernet interface is used at a very high level. This alarm may be raised, for
example, when large files are copied over the network causing a lot of network file
system (NFS) traffic.
Packages are not lost yet but if the interface is loaded increasingly, packages might
eventually be lost.
Identifying additional information fields
1. Bonding interface
2. Ethernet interface
Additional information fields
Instructions
This is an informative alarm and does not require direct actions.
Clearing
The alarm is automatically cleared by the operating system's fault detector once the
Ethernet load has decreased to a tolerable level.
Testing instructions
Do not test this alarm, because testing it will create instability in the system.
130
Id:0900d8058047092d
Confidential
DN0962937
Issue 01A
LTE iOMS Alarms and Troubleshooting
19.21
70164 ETHERNET LINK FAILURE
Probable cause: Link failure
Event type: Equipment
Default severity: Minor
Meaning
The redundancy of Ethernet is lost because of an Ethernet link failure. The error might
have been caused by a hardware failure, that is, a potentially broken Ethernet port, by
an unplugged cable on the front panel of the gateway (GW) node, or if some program or
user has issued a command shutting down the Ethernet interface.
In case of subsequent link failure, the Ethernet packages are lost which means that the
node cannot receive or transmit data over the network.
Identifying additional information fields
1. Bonding interface
2. Ethernet interface
Additional information fields
Instructions
1. If the alarm is raised for an external Ethernet interface, check that the cable is
properly connected in the front panel of the GW node.
2. Take a console connection to the node with the alarming interface.
3. Check the status of the interface with the following command:
ifconfig -a <interface>
For example, ifconfig -a eth0
4. Assuming that the interface does not have the UP and RUNNING flags set, try to
configure the interface UP with the following command
ifup <interface>
For example, ifup eth0
5. If the previous steps have not resolved the situation, contact your local Nokia
Siemens Networks representative.
Clearing
The alarm is automatically cleared by the operating system's fault detector when the
Ethernet link comes up.
Testing instructions
Do not test this alarm, because testing it will create instability in the system.
DN0962937
Issue 01A
Id:0900d80580384d0c
Confidential
131
LTE iOMS Alarms and Troubleshooting
19.22
70166 MANAGED OBJECT LOCKED
Probable cause: Software program abnormally terminated
Event type: Processing error
Default severity: Warning
Meaning
The administrative state of the named managed object (MO) which can be a cluster, a
node, or a recovery unit (RU) has changed to LOCKED as a result of a user action (graceful shutdown or lock operation).
The named MO and its child MOs have been stopped and will not be started before a
corresponding unlock operation is performed by the user. The service provided by the
MO is not available, unless the MO is a RU with some operational and UNLOCKED redundant resources.
When a MO is locked, the alarm system of the cluster clears the alarms raised by the
MO and its child MOs.
Identifying additional information fields
Additional information fields
Identifies the MO type: a cluster, a node, or a RU.
Instructions
This is an informative alarm and does not require any actions.
Clearing
Do not clear the alarm. This is an informative alarm and will be cleared automatically by
the alarm system after its time to live has expired.
Testing instructions
Lock the managed object using fshascli. For example:
$ fshascli --lock --nowarning /AS-1/FSNodeDNSServer
The alarm raising is also visible in the syslog as a message that begins as follows:
ALARM RAISE SP=70166...
Note that test case for alarm 70189 MANAGED OBJECT UNLOCKED BY OPERATOR
should be run after this to get the initial situation restored.
132
Id:0900d80580465c3b
Confidential
DN0962937
Issue 01A
LTE iOMS Alarms and Troubleshooting
19.23
70168 CLUSTER STARTED (RESTARTED)
Probable cause: Software environment problem
Event type: Processing error
Default severity: Major
Meaning
The whole cluster is starting or restarting.
Starting or restarting of the whole cluster means (re)starting of all managed objects
within the cluster.
The (re)start may have been initiated by an operator or be caused by fatal errors in some
critical hardware or software component. When the cluster is restarted, the alarm
system clears all alarms that were raised by the cluster's managed objects before the
restart.
Identifying additional information fields
Additional information fields
Instructions
This alarm is an informative alarm indicating that the whole cluster has been (re)started.
As this operation is critical for software and hardware, check carefully the alarm status
in the cluster after the restart.
Clearing
Clear the alarm after carefully checking the alarm status in the cluster.
Testing instructions
1. Restart the cluster usingfshascli:
$ fshascli --restart --nowarning /
2. Wait for the cluster to restart
The alarm is visible in the alarm database (if configured) and in syslog as a message
that begins as follows:
ALARM RAISE SP=70168 ...
3. Note that all services are unavailable during restart.
DN0962937
Issue 01A
Id:0900d805803276a5
Confidential
133
LTE iOMS Alarms and Troubleshooting
19.24
70173 BACKEND DATABASE REQUIRED BY CORBA
NAMING SERVICE IS UNAVAILABLE
Probable cause: Underlying Resource Unavailable
Event type: Processing error
Default severity: Major
Meaning
The MySQL database instance DB_CosNaming, used by the private CORBA naming
service (NaS) instance, cannot be contacted by the NaS wrapper. Note that the recovery
group that owns the backend database is NamingServiceDB and CORBA NaS
instances belong to recovery group PrivateCosNaming.
The CORBA NaS is not able to store data in the database. Therefore the CORBA NaS
is not functional and replies to the high availability services (HAS) heartbeats with a
failure indication.
Identifying additional information fields
Additional information fields
Instructions
•
•
•
Check that the error situation still exists
/opt/Nokia/SS_Naming/bin/ns_listall
These commands should list the content of the private naming graphs when the NaS
is working correctly. If the command throw exceptions, the NaS is not working correctly, which may result, for example, from an unavailable backend database.
Check if the backend database DB_CosNaming (RG NamingServiceDB) is
unlocked and active.
fshascli -s /NamingServiceDB
If the NamingServiceDB is locked, unlock it.
fshascli -u /NamingServiceDB
After a few seconds the database should have restarted and the NaS should have
automatically re-established connections. Ensure the restart and the re-established
connections by issuing the ns_listall command mentioned above.
If this does not solve the problem, there is something wrong with the database
deployment or configuration. In that case, also the alarm 70156 DISK DATABASE
WATCHDOG START-UP FAILED should be raised by the MySQL DB watchdog
dedicated for the DB_CosNaming database instance.
The following steps describe the error checking procedure if NamingServiceDB RG
fails (see alarm description 70156 DISK DATABASE WATCHDOG START-UP
FAILED for more information).
1. Check the master-syslog for any indication of errors.
less /var/log/master-syslog
2. Check that the LDAP (Lightweight Directory Access Protocol) server is up and
running.
• Check that the RG owning the LDAP server is unlocked.
fshascli -s /Directory
134
Id:0900d8058034a2fb
Confidential
DN0962937
Issue 01A
LTE iOMS Alarms and Troubleshooting
•
Check that the LDAP server is really working by listing the content of the LDAP
tree (CTRL-C aborts the listing).
ldapsearch
3. If the LDAP is working correctly, check that the DB directory mount is functional:
• Lock the NamingServiceDB RG (if not yet locked).
• Mount the database directory manually.
a) Create the SW RAID (md device) to where the DB_CosNaming directory is
stored at.
create_sw_raid /dev/md8 \
/dev/VG_62/MySQL_DB_CosNaming \
/dev/VG_63/MySQL_DB_CosNaming
Note that the device paths given as arguments above may be different in
your system.
Check the correct device paths from:
/opt/Nokia_BP/etc/ldapfile/ldif_in/PFSAN*.ldif
The device paths are defined under an entry defining the FSHWSWRAID
object class for the NaS:
dn: fshwStorageResourceName=/dev/md8, fshwSANName=0,
fsFragmentId=HW, fsClusterId=ClusterRoot
fshwStorageResourceName: /dev/md8
objectClass: FSHWStorageResource
objectClass: FSHWSWRAID
objectClass: extensibleObject
fshwRAIDLevel: 1
fshwPartitionName: /dev/VG_62/MySQL_DB_CosNaming
fshwPartitionName: /dev/VG_63/MySQL_DB_CosNaming
fsUserComment: MySQL DB for CORBA Naming Service
b) Mount the directory.
mkdir /tmp/tmp_nasDB
mount /dev/md8 /tmp/tmp_nasDB
Remember to unmount the directory and to stop the md device after the following
checks have been performed (see the last step).
4. Check that the database disk content is accessible and readable
ls -la /tmp/tmp_nasDB
5. Check that the my.cnf and odbc.ini files exist in that directory and have read
access rights. Check also that these files are identical to those under the
SS_Naming home directory.
diff /tmp/tmp_nasDB/odbc.ini /opt/Nokia/SS_Naming/etc/odbc.ini
diff /tmp/tmp_nasDB/my.cnf /opt/Nokia/SS_Naming/etc/my.cnf
6. Check the mysql.err file for any error indications. You can also find this file from
the /tmp/tmp_nasDB directory.
7. Remove the mount and stop the md devices
a) Unmount and remove the directory.
umount /tmp/tmp_nasDB
rmdir /tmp/tmp_nasDB
DN0962937
Issue 01A
Id:0900d8058034a2fb
Confidential
135
LTE iOMS Alarms and Troubleshooting
b) Stop the md device.
mdadm --manage -S /dev/md8
If any of the preceding checks fail, a major software failure exists in the system. In
that case, contact your Nokia Siemens Networks representative with the information
gathered during the preceding steps.
Clearing
HAS clears the alarm automatically when it has detected the NaS to be faulty and therefore restarted the PrivateCosNaming recovery group.
However, if the backend database remains faulty, the alarm is raised again. This may
result in a restart loop constantly raising the same alarm. Therefore, if the problem
seems to be permanent, it is recommended to lock the NaS and the database recovery
groups with the following commands:
fshascli -l /NamingServiceDB
fshascli -l /PrivateCosNaming
and to clear the alarm manually before performing the steps for solving the error.
Testing instructions
1. Unlock the NamingServiceDB RG.
2. Unlock the CosNaming and PublicCosNaming RGs.
3. Running the command /opt/Nokia/SS_Naming/bin/ns_listall should list
all the object bound in the name service. This shows that the Naming Service is functional.
4. Lock the NamingServiceDB RG.
Within some tens of seconds the alarm should be raised.
Clearing:
1.
2.
3.
4.
Lock the CosNaming and PublicCosNaming RGs.
Unlock the NamingServiceDB RG.
Unlock the CosNaming and PublicCosNaming RGs.
Check with /opt/Nokia/SS_Naming/bin/ns_listall that the naming service
is functional again.
The alarm should be cleared at this point.
The alarm is automatically cleared by the naming service when it re-establishes connections to database.
136
Id:0900d8058034a2fb
Confidential
DN0962937
Issue 01A
LTE iOMS Alarms and Troubleshooting
19.25
70186 CLUSTER OPERATION INITIATED BY OPERATOR
Probable cause: Congestion
Event type: Quality of service
Default severity: Warning
Meaning
This is an informative alarm which indicates that an operator has initiated a cluster operation on the specified managed object (MO). The MO can refer to the whole cluster, a
node, a recovery unit (RU), recovery group (RG), or a process. The platform high availability services (HAS) is now executing the operation. The operation can be
•
•
•
switchover
restart
power-off.
The operations have different effects:
•
•
•
Switchover
Applicable only to recovery groups (RG). The active RU instance of the RG is terminated and a standby instance on another node started or, in case of a hot active
standby RG, activated. The service provided by the named RU is down until the switchover is complete.
Restart
For the cluster and nodes this means a physical restart (reboot) of node(s). For other
MOs, the named MO is stopped and restarted. The services provided by the named
MO are down during the restart.
Power-off
Applicable only to nodes. The named node is being powered off.
Identifying additional information fields
Additional information fields
1. Identifies the MO type (the cluster, a node, a process, or an RU).
Instructions
This is an informative alarm and does not require any actions.
Clearing
The alarm system clears the alarm automatically after its time to live has expired.
Testing instructions
1. Log into the cluster.
2. Restart a managed object using fshascli. For example:
fshascli --restart --nowarning /AS-1
The alarm is visible in the alarm database (if configured) and in the syslog as a message
that begins as follows:
ALARM RAISE SP=70186 ...
DN0962937
Issue 01A
Id:0900d80580344161
Confidential
137
LTE iOMS Alarms and Troubleshooting
19.26
70188 MANAGED OBJECT SHUTDOWN BY OPERATOR
Probable cause: Congestion
Event type: Quality of service
Default severity: Warning
Meaning
This is an informative alarm which indicates that the specified managed object (MO)
which can be the whole cluster, a node or a recovery unit (RU) is being shutdown. The
named MO and all its unlocked sub-resources are now terminating.
The MO is being shutdown by an operator. All services provided by the named MO are
terminating. Once the operation is completed, the administrative state of the MO and all
its sub-MOs will be changed to locked.
Note that a shutdown request may take a long time if the maximum duration for the operation has not been specified. The shutdown request can be forced to completion by
issuing a lock command. In that case the platform high availability services (HAS) will
terminate the services ungracefully.
Identifying additional information fields
Additional information fields
1. Identifies the MO type (a cluster, a node, or an RU)
Instructions
This is an informative alarm which requires no user actions.
Clearing
The alarm system clears this alarm automatically after its time to live has expired.
Testing instructions
The target of the shutdown command can be a cluster, node, recovery group or recovery
unit.
1. Log into the cluster
2. Execute the shutdown command to the managed object. For example: fshascli -shutdown /AS-1
The alarm is also visible in the syslog as a message that begins as follows:
ALARM
RAISE SP=70188 ...
Note that in the example above --shutdown does not power off the node. It just gracefully shuts down all HAS managed non-critical processes in the node.
After the testing is finished, use the fshascli --unlock command to get the initial
situation restored. For example:
fshascli --unlock /AS-1
138
Id:0900d80580296bb3
Confidential
DN0962937
Issue 01A
LTE iOMS Alarms and Troubleshooting
19.27
70189 MANAGED OBJECT UNLOCKED BY OPERATOR
Probable cause: Congestion
Event type: Quality of service
Default severity: Warning
Meaning
This is an informative alarm which indicates that the specified managed object (MO)
which can be the whole cluster, a node, or a recovery unit (RU) has been unlocked. The
named MO and its unlocked sub-resources (if there are any) can now be activated.
Notice that the MO (or its sub-MOs) can remain locked because of the dependency on
a higher level MOs. That is, the unlock operation will not have effect on the MO in
question before the higher level MOs are unlocked. For example, an RU in a node will
remain locked, if the node or the cluster MO is locked.
The MO has been set to the unlocked state. If all the higher level MOs are unlocked as
well, the services provided by the MO are activated.
Identifying additional information fields
Additional information fields
Identifies the MO type (a cluster, a node, or an RU)
Instructions
This is an informative alarm and does not require any actions.
Clearing
The alarm system clears the alarm automatically after its time to live has expired.
Testing instructions
Unlock the previously locked managed object using fshascli:
1. Log into the cluster.
2. Unlock the managed object using fshascli. For example:
fshascli -unlock /AS-1/FSNodeDNSServer
The alarm is also visible in the syslog as a message that begins as follows:
ALARM RAISE SP=70189 ...
Note that this test should be run after the test case for alarm 70166 MANAGED OBJECT
LOCKED.
DN0962937
Issue 01A
Id:0900d805803d689d
Confidential
139
LTE iOMS Alarms and Troubleshooting
19.28
70236 LDAP DATABASE CORRUPTED
70236 LDAP DATABASE CORRUPTED
Severity
Major
Fault reason
A primary or secondary Lightweight Directory Access Protocol (LDAP) database is corrupted and cannot be accessed anymore. An LDAP database can get corrupted, for
example, when:
•
•
a disk becomes full while the database is being updated
a node failure and/or ungraceful node restart happens while the database is being
updated.
The identified LDAP database is currently unavailable.
In case of a secondary database, the only impact is that the node start-ups can take
slightly longer because some platform services attempt to use the secondary database(s) by default.
Failure of the primary database has a more significant impact. Most application processes cannot be (re)started anymore and applications that update LDAP will fail. If a
secondary database is still available, nodes can still be (re)started but only basic
platform services will be able to start. If the primary and all secondary databases have
failed, the cluster or any of its nodes cannot (re)start anymore. The system will next
automatically try to recover the corrupted database from an operational primary or secondary database.
Description
Identifying additional information fields
Additional information fields
1. Type of the database: Primary or Secondary
2. Relative path of the database directory. Notice that secondary databases are
usually located in a directory such as /var/mnt/local/localimg/<platform
release>/opt/Nokia_BP/var/pmgmt/pt/Nokia_BP/var/pmgmt/<platf
orm release>/fsPlatformSlave-ldbm. Primary LDAP database directory is
usually of the following format: /var/mnt/local/sysimg/<platform
release>/opt/Nokia_BP/var/pmgmt/<platform
release>/fsPlatform-ldbm. Notice especially that the lowest level directory is
fsPlatformSlave-ldbm for secondary databases and fsPlatform-ldbm for
the primary database.
Instructions
The system will automatically attempt to recover the corrupted database from a functional copy. If the automatic recovery is successful, this alarm is automatically cleared
and the system raises a new "CORRUPTED LDAP DATABASE RECOVERED" warning
alarm. The automatic recovery, if successful, takes less than a minute.
If the primary and secondary database(s) are all corrupted you must restore them from
a backup. DO NOT ATTEMPT TO RESTART THE CLUSTER OR ANY OF ITS NODES
140
Id:0900d805804611f9
Confidential
DN0962937
Issue 01A
LTE iOMS Alarms and Troubleshooting
BEFORE ENSURING THAT THE PRIMARY DATABASE IS OPERATIONAL The applications can still be providing service normally and a service interruption only happens if
an unsuccessful restart attempt is made.
Notice, however, that the automatic recovery will fail if the node or database disk has
become full. In this case, you can attempt to solve the situation by making space to the
disk, and then allowing the system to retry automatic recovery. To do this, perform the
following steps:
1. Log into the node that has the corrupted database as root user. For example, log
into the node (usually CLA-0 or CLA-1) where the directory service is active:
ssh root@mycluster-directory
<password>
2. Check the available disk space with the df command. For example,
df -k
root@CLA-1(mycluster):~# df -k
Filesystem
1k-blocks
/dev/rd/0
15863
tmpfs
1029260
/dev/md/0
4999712
directory:/var/mnt/local/sysimg
Used Available Use% Mounted on
10698
4346 72% /
8
1029252
1% /tmp
1401348
3598364 29% /var/mnt/local/localimg
49998408 49998408
directory:/var/mnt/local/sysimg
49998408
49998408
0 100% /var/mnt/remote/sysimg_rw
0 100% /var/mnt/remote/sysimg_ro
/dev/md/1
49998404
/dev/md/9
49998408
19999256
0 100% /var/mnt/local/sysimg
32840 19966416
1% /var/mnt/local/backup
3. If the database partition (in this example the system image partition) is full, release
space, for example, by deleting excess core and syslog files. You can locate large
files from the partition using the find command: Use the cd command to go to the
partition mount point directory and search files below it. For example,
cd /var/mnt/local/sysimg
find . -type f -name "syslog*" -size +1000000
You can also locate core files using the find command. For example,
cd /var/mnt/local/sysimg
find . -type f -name "*core"
When the disk has at least 100 MB of free space, make the system trying the recovery:
• In case of a secondary database, reboot the node. For example, execute the following command:
shutdown -r now
• In case of the primary database, use fshascli to restart the Directory service:
fshascli -rnF /Directory
Note that this will terminate your terminal connection, thus you will need to log
in again.
If the database was not corrupted because of a full disk, or the automatic
recovery fails again, for example, because all LDAP databases are corrupted,
DN0962937
Issue 01A
Id:0900d805804611f9
Confidential
141
LTE iOMS Alarms and Troubleshooting
you must restore the databases from a backup copy. For instructions on the
restore process, see the backup and restore customer documentation.
Clearing
The alarm is cleared automatically if the automatic recovery operation is successful. The
alarm must be cleared manually, in case the database has to be manually restored from
a backup.
Testing instructions
The alarm can be tested by simulating a secondary LDAP corruption. This can be done
by renaming the secondary LDAP database directory in the CLA node where the Directory recovery group is active.
Move to the directory where the secondary LDAP is located. The default location is
/var/mnt/local/localimg/flexiserver/opt/Nokia_BP/var/pmgmt/_act
ive/. The location and the name of the database can also be verified from
fsPlatformSlave.conf file located under /opt/Nokia_BP/etc/ldapfiles.
The secondary LDAP database is defined after "directory" tag.
cd /var/mnt/local/localimg/flexiserver/opt/Nokia_BP/var/pmgmt/_active
Rename the current secondary LDAP database.
mv fsPlatformSlave-ldbm fsPlatformSlave-ldbm.bkp
Execute the LDAP recovery script manually. The execution of the script may take
several minutes.
/opt/Nokia_BP/bin/fsLDAPRecoverDatabase -s
The alarm should be visible immediately after starting the recovery script. The script will
use the primary LDAP to restore the secondary LDAP database after which the alarm
will be cancelled. Also alarm "70237: CORRUPTED LDAP DATABASE RECOVERED"
will be raised.
If the alarm was cancelled successfully and a new secondary LDAP database was
created the backup database can safely be removed.
rm -rf fsPlatformSlave-ldbm.bkp
If the alarm was not cancelled, the secondary LDAP database was not created or the
script was terminated before it could finish, restore the backup database. In this case
the alarm needs to be cancelled manually. Remove the partially created secondary
LDAP database if one exists.
rm -rf fsPlatformSlave-ldbm
Restore the original database.
cp -r fsPlatformSlave-ldbm.bkp fsPlatformSlave-ldbm
Cancelling
142
Id:0900d805804611f9
Confidential
DN0962937
Issue 01A
LTE iOMS Alarms and Troubleshooting
19.29
70237 CORRUPTED LDAP DATABASE RECOVERED
Probable cause: Corrupt data
Event type: Processing error
Default severity: Warning
Meaning
A primary or secondary LDAP (Lightweight Database Access Protocol) database was
corrupted but it has been successfully recovered. The LDAP databases can get corrupted, for example, when
•
•
a disk becomes full while the database is being updated
a node failure and/or ungraceful node restart happens while the database is being
updated.
The platform software has automatically recovered the database from an operational
primary or secondary database. Some applications may have been impacted by the
temporary unavailability of the LDAP database. As the platform restarts the failed applications, the problem should not have caused permanent problems.
Identifying additional information fields
Additional information fields
1. Type of the database that was corrupted; "Primary" or "Secondary".
2. Relative path of the database directory. Notice that secondary databases are
usually located in a directory such as /var/mnt/local/localimg/<platform
release>/opt/Nokia_BP/var/pmgmt/<platform
release>/fsPlatformSlave-ldbm. The primary LDAP database directory is
usually in the following format: /var/mnt/local/sysimg/<platform
release>/opt/Nokia_BP/var/pmgmt/<platform
release>/fsPlatform-ldbm. Notice especially that the lowest level directory is
fsPlatformSlave-ldbm for secondary databases and fsPlatform-ldbm for
the primary database.
Instructions
This is an informative alarm. No operator actions required.
Clearing
The system clears the alarm automatically when the fault has been corrected.
Testing instructions
The alarm can be tested by simulating a secondary LDAP corruption. This can be done
by renaming the secondary LDAP database directory in the CLA node where the Directory recovery group is active.
Change the directory to the one where the secondary LDAP is located. The default
location is
/var/mnt/local/localimg/flexiserver/opt/Nokia_BP/var/pmgmt/_act
ive/. The location and the name of the database can also be verified from
fsPlatformSlave.conf file located under /opt/Nokia_BP/etc/ldapfiles.
The secondary LDAP database is defined after "directory" tag.
DN0962937
Issue 01A
Id:0900d805803a48a2
Confidential
143
LTE iOMS Alarms and Troubleshooting
cd /var/mnt/local/localimg/flexiserver/opt/Nokia_BP/var/pmgmt/_active
Rename the current secondary LDAP database.
mv fsPlatformSlave-ldbm fsPlatformSlave-ldbm.bkp
Execute the LDAP recovery script manually. The execution of the script may take
several minutes.
/opt/Nokia_BP/bin/fsLDAPRecoverDatabase -s
Alarm "70236: LDAP DATABASE CORRUPTED" should be visible immediately after
starting the recovery script. The script will use the primary LDAP to restore the secondary LDAP database after which the alarm will be raised. Also alarm "70236: LDAP
DATABASE CORRUPTED" will be cancelled.
If the alarm was raised successfully and a new secondary LDAP database was created
the backup database can safely be removed.
rm -rf fsPlatformSlave-ldbm.bkp
If the alarm was not raised, the secondary LDAP database was not created or the script
was terminated before it could finish, restore the backup database. Remove the partially
created secondary LDAP database if one exists.
rm -rf fsPlatformSlave-ldbm
Restore the original database.
cp -r fsPlatformSlave-ldbm.bkp fsPlatformSlave-ldbm
144
Id:0900d805803a48a2
Confidential
DN0962937
Issue 01A
LTE iOMS Alarms and Troubleshooting
19.30
70242 ALARM LOG FILE INACCESSIBLE
Probable cause: File Error
Event type: Processing error
Default severity: Critical
Meaning
Alarm processor cannot open or read the alarm log file.
Alarm notifications recorded in the alarm log file cannot reach the alarm system, and as
a result the control for the alarm situation in the network element is lost.
Identifying additional information fields
Additional information fields
1. reason, possible values:
• file cannot be opened
• permanent file read error
2. additional information about the problem (for example, text of the corresponding
system exception.)
Instructions
1. Check with the parameter management application that the alarm log file name in
the alarm processor configuration in LDAP (Lightweight Directory Access Protocol)
( fsParameterId=fsLogFileName,
fsAlarmProcessorConfigurationId=Default,
fsAlarmProcessorId=AlarmProcessor1, fsFragmentId=
AlarmProcessors, fsFragmentId=AlarmMgmt,
fsClusterId=ClusterRoot) is the same as the name specified in the Management Object Model in the SS_MOMfsAlarm document.
2. If the value in LDAP is different, then modify the LDAP value and restart the alarm
processor with the following command:
fshascli -r /<node>/FSAlarmSystemServer/AlarmProcessor
where <node> is the name of the node where alarm processor is deployed.
3. If the values are the same, then fill in a problem report with the alarm data and send
it to your Nokia Siemens Networks representative.
Clearing
The alarm is cleared automatically by the alarm system when access to the alarm log
file is restored.
Testing instructions
1. Use the parameter management application to set a wrong log file name in the
LDAP alarm processor configuration (fsParameterId=fsLogFileName,
fsAlarmProcessorConfigurationId=Default,
fsAlarmProcessorId=AlarmProcessor1, fsFragmentId=
AlarmProcessors, fsFragmentId=AlarmMgmt,
fsClusterId=ClusterRoot).
2. Restart alarm processor with the following command:
fshascli -r /<node>/FSAlarmSystemServer/AlarmProcessor
DN0962937
Issue 01A
Id:0900d805802ae78e
Confidential
145
LTE iOMS Alarms and Troubleshooting
where <node> is the name of the node where alarm processor is deployed.
3. After verifying that an alarm for the situation has been raised, correct the fault as
described in the 'Instructions for operator' field and check that the alarm is cleared.
146
Id:0900d805802ae78e
Confidential
DN0962937
Issue 01A
LTE iOMS Alarms and Troubleshooting
19.31
70243 ALARM PROCESSOR CONFIGURATION IS OUT OF
ORDER
Probable cause: Configuration or customising error
Event type: Processing error
Default severity: 4 Minor
Meaning
The configuration of alarm processor contains an invalid attribute value or an attribute
is missing.
The system ignores the invalid value and uses a default value.
Additional information fields
1. Invalid attribute's value or an empty string if attribute or its value is missing.
Instructions
1. Use the parameter management application to correct the invalid value of the attribute. The distinguished name of the attribute - identifying its location in the LDAP can be found in the 'Managed Object Id' field of the alarm.
2. Restart alarm processor with the following command:
fshascli -r /<Node>/FSAlarmSystemServer/AlarmProcessor
where <Node> is the name of the node where alarm processor is deployed. The
default values of the alarm processor attributes used when correcting the situation
are listed below:
DN0962937
Issue 01A
Attribute
Default value
fsNumProcessors
5
fsHasSimpleAware
true
fsLogFileName
/var/log/master-alarms
fsLogParserSleepTime
1
fsAlarmNotificationCollectorSleepTime
1
fsParameterNotificationProcessorSleepTime
15
fsAlarmHistoryProcessorSleepTime
60
fsAlarmHistorySize
1000000
fsBatchSize
120
fsHeartbeatInterval
300
fsAlarm70247raise
true
fsSeverityChangeReRaise
false
fsNotificationBatchSize
20
fsStrictAlarmTimeOrder
false
fsAllowedMCACAlarms
true
fsDatSupport
true
fsAutoAckedDAT
true
fsTimeBasedAlarmHistorySize
false
Id:0900d80580677c5d
Confidential
147
LTE iOMS Alarms and Troubleshooting
fsDeletedAlarmHistorySize
4000
fsStoredAlarmNotificationsPerSecond
0
fsZeroTTLforWarnings
False
Clearing
The system clears the alarm automatically when the fault has been corrected.
Testing instructions
1. Use the parameter management application to set an invalid value for an attribute,
for example, Customized configuration i.e. specific values for parameters can be
found in the Alarm System configuration in LDAP under
(fsAlarmProcessorConfigurationId=Default
,fsAlarmProcessorId=AlarmProcessor1 , fsFragmentId=
AlarmProcessors, fsFragmentId=AlarmMgmt,
fsClusterId=ClusterRoot).
2. Restart alarm processor with the following command:
fshascli -r /<Node>/FSAlarmSystemServer/AlarmProcessor
Where, <Node> is the name of the node where alarm processor is deployed.
3. After verifying that an alarm for the situation has been raised, correct the fault as
described in the 'Instructions for operator' field and check that the alarm is cleared.
148
Id:0900d80580677c5d
Confidential
DN0962937
Issue 01A
LTE iOMS Alarms and Troubleshooting
19.32
70244 CORRUPTED ALARM DATA
Probable cause: Corrupt data
Event type: Processing error
Default severity: Major
Meaning
Corrupted data found in the alarm log file.
The corrupted record in the alarm log file is ignored, meaning that it is possible that an
alarm notification was lost or a more serious system error has occurred.
Identifying additional information fields
1. Invalid record (please note that the field can hold no more than ~390 symbols, so the
original invalid record can be cut).
Additional information fields
2. Error code, possible values:
1.
2.
3.
4.
missing mandatory field
duplicated field
empty record
non-alarm data record.
3. Field name (for missing or duplicated field).
Instructions
1. Fill in a problem report with the alarm data and send it to your local Nokia Siemens
Networks representative.
Clearing
Clear the alarm with alarm management application after correcting the fault as presented in Instructions.
Testing instructions
1. Create a text file containing an empty row or a row with some dummy information.
2. Use the parameter management application to store the value of the
fsParameterId=fsLogFileName,
fsAlarmProcessorConfigurationId=Default,
fsAlarmProcessorId=AlarmProcessor1,
fsFragmentId=AlarmProcessors, fsFragmentId=AlarmMgmt,
fsClusterId=ClusterRoot attribute in the alarm processor LDAP configuration
and replace it with the name of the created file.
3. Restart alarm processor with the following command:
fshascli -r /<node>/FSAlarmSystemServer/AlarmProcessor
where <node> is the name of the node where alarm processor is deployed.
4. After verifying that an alarm for the situation has been raised, clear it with alarm management application.
5. Use the parameter management application to restore the original name of the
alarm log file.
6. Restart alarm processor.
DN0962937
Issue 01A
Id:0900d80580348bbc
Confidential
149
LTE iOMS Alarms and Troubleshooting
19.33
70245 ILLEGAL INTERNAL USAGE OF EXTERNAL ALARM
NOTIFICATION FORMAT
Probable cause: Software Program Error
Event type: x2
Default severity: Major
Meaning
The application raised or cleared an alarm containing an internal MOID (Managed
Object ID) and provided its own alarm time. The application is allowed to provide an
alarm time only for external alarms (alarms with external MOIDs). This alarm is also
raised if the application raised or cleared an alarm containing an external MOID but did
not provide its own alarm time.
The original alarm is discarded.
Identifying additional information fields
Data from the original alarm:
1. Managed Object ID
2. Specific problem
3. Identifying application additional information
(The application ID is present in the MOID field of the alarm)
Additional information fields
Instructions
Fill in a problem report with the alarm data and send it to your Nokia Siemens Networks
representative.
Clearing
Clear the alarm with the alarm management application after correcting the fault as presented in Instructions.
Testing instructions
1. Create a text file containing the following single row:
2008 Oct 15 18:31:39 ALARM RAISE SP=70156 \
MO=fshaProcessInstanceName= XWDforAlarmType,\
fshaRecoveryUnitName=FSAlarmDBServer,fsipHostName=WAS,\
fsFragmentId=Nodes,fsFragmentId=HA,fsClusterId=ClusterRoot \
AP=fshaProcessInstanceName=XWDforAlarmType,\
fshaRecoveryUnitName=FSAlarmDBServer,fsipHostName=WAS,\
fsFragmentId=Nodes,fsFragmentId=HA, fsClusterId=ClusterRoot \
SE=5 NINFO="1" TIME=E1224084699996
2. Use the parameter management application to store the value of the
fsParameterId=fsLogFileName,
fsAlarmProcessorConfigurationId=Default,
fsAlarmProcessorId=AlarmProcessor1, fsFragmentId=
AlarmProcessors, fsFragmentId=AlarmMgmt,
150
Id:0900d805804e145a
Confidential
DN0962937
Issue 01A
LTE iOMS Alarms and Troubleshooting
fsClusterId=ClusterRoot attribute in the alarm processor LDAP configuration
and replace it with the name of the created file.
3. Restart alarm processor with the following command:
fshascli -r /<Node>/FSAlarmSystemServer/AlarmProcessor
where <Node> is the name of the node where alarm processor is deployed.
4. After verifying that an alarm for the situation has been raised (in the case of an
internal MOID with provided time), clear it with the alarm management application.
5. Use the parameter management application to restore the original name of the
alarm log file.
6. Restart the alarm processor.
7. Create a text file containing the following single row:
2008 Oct 15 18:32:39 ALARM RAISE SP=70159 \
MO=rncMOId=DN:NE-WBTS-34/WCEL-1,fsLogicalNetworkElemId=OMS,\
fsFragmentId=external,fsClusterId=ClusterRoot
AP=fshaProcessInstanceName=HASNodeAgent,\
fshaRecoveryUnitName=FSNodeHAServer, \
fsipHostName=CLA-0,fsFragmentId=Nodes,fsFragmentId=HA, \
fsClusterId=ClusterRoot SE=3 NINFO="MO failed".
8. Repeat steps 2,3.
9. After verifying that an alarm for the situation has been raised (in the case of an
external MOID without provided time), clear it with the alarm management application.
10. Repeat steps 5, 6.
DN0962937
Issue 01A
Id:0900d805804e145a
Confidential
151
LTE iOMS Alarms and Troubleshooting
19.34
70246 ALARM SYSTEM HEARTBEAT
Probable cause: Timeout expired
Event type: Processing error
Default severity: Warning
Meaning
This is an informative alarm, which indicates that the alarm system itself is in operational
state. The alarm system is continuously (after each expiration of a heartbeat interval)
raising or clearing this alarm, which means that the state of this alarm is constantly
changing in a loop (new alarm > cleared alarm > new alarm > cleared alarm > new alarm
> ...) and the alarm time is updated by the time of the last raise or clear operation. If the
refreshing of the alarm does not occur, it signals that the alarm system is faulty.
Note that there is a delay before the raise/clear operation becomes visible in the alarm
monitoring tool. If the system is under heavy load it might take even longer for the operation to be visible in the alarm monitoring tool.
Identifying additional information fields
Additional information fields
1. Heartbeat interval in seconds.
Instructions
1. If the used alarm monitor tool does not support an automatic alert in situations where
the alarm system heartbeating is not functioning, check occasionally that the heartbeating functions properly. The time of the alarm and the value of the heartbeat
interval (specified in the 'Application Additional Info' field) should be used in the
analysis of the situation.
2. Perform such checking also when the system does not generate any alarm events
for a long time.
3. If the checking shows that the alarm time is not continuously refreshed, restart the
alarm processor with the following command:
fshascli -r /<Node>/FSAlarmSystemServer/AlarmProcessor
where <Node> is the name of the node where the alarm processor is deployed.
4. If restarting the alarm processor does not help, also restart the alarm system
database with the following command:
fshascli -r /AlarmDB
Clearing
The alarm system clears the alarm when the heartbeat interval expires.
Testing instructions
1. Check with the parameter management application that the alarm system heartbeating is switched on, for example, the fsParameterId= fsHeartbeatInterval,
fsAlarmProcessorConfigurationId=Default,
fsAlarmProcessorId=AlarmProcessor1, fsFragmentId=
AlarmProcessors, fsFragmentId=AlarmMgmt,
152
Id:0900d8058040be43
Confidential
DN0962937
Issue 01A
LTE iOMS Alarms and Troubleshooting
fsClusterId=ClusterRoot attribute in the alarm system LDAP configuration
has a positive value (set the positive value if it is needed).
2. Restart the alarm processor with the following command:
fshascli -r /<Node>/FSAlarmSystemServer/AlarmProcessor
where <Node> is the name of the node where the alarm processor is deployed.
3. With the alarm system heartbeating switched on, check that only one instance of this
alarm is raised or cleared within a period that is approximately equal to the heartbeat
interval.
DN0962937
Issue 01A
Id:0900d8058040be43
Confidential
153
LTE iOMS Alarms and Troubleshooting
19.35
70247 ALARM SYSTEM HEARTBEATING SWITCHED OFF
Probable cause: Configuration or Customising Error
Event type: Processing error
Default severity: Major
Meaning
The alarm system heartbeating is switched off, which means that the alarm system does
not raise or clear its heartbeat alarms.
The alarm system heartbeating is the simplest and most efficient way for the operator to
monitor that the alarm system itself is healthy. If the system is in a switched off state, the
operator cannot detect if the alarm system becomes faulty. This is why it is strongly recommended that you have the alarm system heartbeating always switched on. Nevertheless the alarm system heartbeating can be switched off if an alternative heartbeating
exists. In the alarm system configuration, by setting the value of the fsAlarm70247raise
configuration parameter to false, raising the 70247 alarm will be disabled.
Identifying additional information fields
Additional information fields
1. Heartbeat interval in seconds.
Instructions
1. Use the parameter management application to set a non-zero (0 means that heartbeating is switched off) heartbeat interval in seconds for the fsParameterId=
fsHeartbeatInterval, fsAlarmProcessorConfigurationId=Default,
fsAlarmProcessorId=AlarmProcessor1, fsFragmentId=
AlarmProcessors, fsFragmentId=AlarmMgmt,
fsClusterId=ClusterRoot attribute in the alarm system LDAP configuration.
2. Use the parameter management application to set the value of the
fsParameterId=fsAlarm70247raise,fsAlarmProcessorConfiguration
Id=Default, fsAlarmProcessorId=AlarmProcessor1, fsFragmentId=
AlarmProcessors, fsFragmentId=AlarmMgmt,
fsClusterId=ClusterRoot attribute to false in the alarm system LDAP configuration for the case when the alarm system heartbeating is desired to be switched off.
3. Restart alarm processor with the following command:
fshascli -r /<Node>/FSAlarmSystemServer/AlarmProcessor
where <Node> is the name of the node where alarm processor is deployed.
Clearing
The alarm system clears the alarm automatically after restart if the alarm system heartbeating is switched on in the configuration.
Testing instructions
1. Use the parameter management application to set the value of the
fsParameterId= fsHeartbeatInterval,
fsAlarmProcessorConfigurationId=Default,
fsAlarmProcessorId=AlarmProcessor1, fsFragmentId=
154
Id:0900d805803a51f2
Confidential
DN0962937
Issue 01A
LTE iOMS Alarms and Troubleshooting
AlarmProcessors, fsFragmentId=AlarmMgmt,
fsClusterId=ClusterRoot attribute to zero in the alarm system LDAP configuration. The value of the fsParameterId=fsAlarm70247raise,
fsAlarmProcessorConfigurationId=Default,
fsAlarmProcessorId=AlarmProcessor1, fsFragmentId=
AlarmProcessors, fsFragmentId=AlarmMgmt,
fsClusterId=ClusterRoot attribute should be true.
2. Restart alarm processor with the following command:
fshascli -r /<Node>/FSAlarmSystemServer/AlarmProcessor
where <Node> is the name of the node where alarm processor is deployed.
3. After verifying that an alarm for the situation has been raised, correct the fault as
described in the 'Instructions for operator' field and check that the alarm is cleared.
DN0962937
Issue 01A
Id:0900d805803a51f2
Confidential
155
LTE iOMS Alarms and Troubleshooting
19.36
70256 RESOURCE ALLOCATION OR DE-ALLOCATION
FAILURE
Probable cause: Software Program Abnormally Terminated
Event type: Processing error
Default severity: Major
Meaning
Allocation or deallocation of resources to or from a computer node in the cluster has
failed.
Applications running in the cluster are often identified with resources that are allocated
to the node before the application is started and released from the node after the application has terminated. Such resources can, for example, be TCP/IP addresses that are
associated with the service provided by the software or a disk partition that for example
contains the application database. In addition, the application can allocate and deallocate other resources (for example, start and stop 3rd party applications) in its control
scripts.
An operation failure has been reported for the defined recovery unit while it was starting
or stopping.
If the error occurred when an application was starting, application start-up is aborted. In
case of a permanent fault, the service provided by the application is now down. With a
transient or node-specific fault, and providing that the application has a standby, the
application may have been restarted successfully on another node.
If the fault happened while the application was terminating, the node on which the error
happened has now been restarted to restore it to a known state. If the node has
restarted successfully or the application has a standby resource, the application has
likely already restarted, and service is again available.
Identifying additional information fields
Additional information fields
1. Name of the recovery group to which the recovery unit belongs. For example,
"/Directory".
2. Situation when the failure happened: string "allocating" or "de-allocating"
3. Type of the resource allocation: "IP(address)", "disk(mount point)" or "ctrlscript". For
example, "IP(192.1.1.78)" or "disk(sysimg)".
4. Only present if argument 3 is "ctrlscript". Contains the name of the control script that
reported the failure. For example, "RUControlDirectoryServer.sh"
Instructions
1. Log into the network element as root user to check the situation.
2. Use the fshascli command to check the state of all recovery units within the
recovery group (name of the recovery group is in the Application Additional Information field).
If the recovery group is providing service, its every UNLOCKED recovery unit that
has the ACTIVE role, has the ENABLED operational state and an empty procedural
status. For example, the state of recovery units of the /Directory recovery groups can
be checked as follows:
156
Id:0900d80580439c1e
Confidential
DN0962937
Issue 01A
LTE iOMS Alarms and Troubleshooting
$ fshascli --state $(fshascli -children /Directory |
grep -vE "\/.+\/.+\/" )
/CLA-0/FSDirectoryServer:
administrative(UNLOCKED)
operational(ENABLED)
usage(IDLE)
procedural(NOTINITIALIZED)
availability()
unknown(FALSE)
alarm()
role(COLDSTANDBY)
/CLA-1/FSDirectoryServer:
administrative(UNLOCKED)
operational(ENABLED)
usage(ACTIVE)
procedural()
availability()
unknown(FALSE)
alarm()
role(ACTIVE)
In the above case, the recovery unit of the CLA-0 node is acting as a cold standby
backup and the recovery unit on CLA-1 is running the service normally.
Note that the grep command in the example is used to filter out information regarding individual processes in each recovery unit. Since this is a situation that may be
caused by various different faults, contact your Nokia Siemens Networks representative to analyse the root cause.
Clearing
Clear the alarm manually after the problem has been solved.
Testing instructions
Simulate an IP address allocation failure
1. An IP address allocation failure can be caused by manually allocating an IP address
to a node before a recovery unit is started. Select a cold active/standby recovery
group (but do not use the Directory recovery group) that has an IP address associated with it, and allocate the address to the standby node. For example:
$ fshascli --state /CLA-0/FSClusterDNSServer
/CLA-0/FSClusterDNSServer
administrative(UNLOCKED)
<== Unlocked
operational(ENABLED)
<== Operational
usage(IDLE)
procedural(NOTINITIALIZED)
availability()
unknown(FALSE)
alarm()
role(COLDSTANDBY)
$ grep ClusterDNS /etc/hosts
DN0962937
Issue 01A
Id:0900d80580439c1e
Confidential
157
LTE iOMS Alarms and Troubleshooting
192.168.2.255 ClusterDNS
. . .
$ ip addr show | grep 192.168.2.255
inet 192.168.2.255/23 scope global secondary bond0
inet fe80::192:168:2:255/10 scope link
$ ssh cla-0
Last login: . . .
$ ip address add 192.168.2.255/23 dev bond0
2. Issue a switchover for the recovery group so that the service attempts to move to the
node that already has the IP address. For example:
$ fshascli --switchover /ClusterDNS
The switchover fails and the alarm gets raised. The alarm is visible, for example, in
the alarm log. Note that you have to cancel the alarm manually.
3. Remove the IP address that you added manually or reboot the node. For example:
$ ip address del 192.168.2.255/23 dev bond0
158
Id:0900d80580439c1e
Confidential
DN0962937
Issue 01A
LTE iOMS Alarms and Troubleshooting
19.37
70265 RECOVERY ACTIONS BANNED FOR MANAGED
OBJECT
Probable cause: Software Error
Event type: Processing error
Default severity: Major
Meaning
An operator has set the specified managed object to an inert mode. The managed object
identifies a node. If the inert mode is set for the whole cluster, this alarm is raised separately for each node. While the inert mode is on, high availability services (HAS) does
not attempt to recover services from failures, for example, by restarting nodes or applications, or by performing switchovers within the specified managed objects. Note that
the inert mode should be used only by qualified supplier's representatives when
analysing problems in the system.
The inert mode is switched on by issuing an fshascli command, for example:
$ fshascli --inert-mode on /CLA-0
The command above switches the inert mode on for the /CLA-0 node. Accordingly, the
inert mode can be switched off by using the fshascli command:
$ fshascli --inert-mode off /CLA-0
This alarm is raised when an operator switches the inert mode on for either a set of
nodes or the cluster. The inert mode has the following effects on the behaviour of the
system in nodes for which the inert mode has been switched on:
•
•
•
•
•
•
•
If there are no failures, the service provided by the network element is not affected.
If failures occur, no recovery actions are performed and the service may be affected.
For example, if a process fails, it is not restarted by HAS.
Process failures are still propagated to the recovery unit level, but the recovery unit
level fault recovery does not take place. In practice, this means that the propagated
process failure does not cause restarts of other recovery unit processes, and switchovers do not take place with active/standby recovery groups.
HAS logs pending recovery actions to master syslog (/var/log/master-syslog
on the active CLA node) in the form "INFO Inert mode set for <managed object
name>. Recovery action \"restart\" pending.".
HAS does not raise any alarms for managed objects in the inert mode. The inert
mode for a node sets all managed objects within the node to the inert mode.
The inert mode sustains in the nodes over node or cluster restarts.
Only the node and cluster restart, power on and power off fshascli commands
work while the inert mode is set for the nodes or the cluster.
Note that fault recovery works in a normal way in the nodes that are not in the inert mode
.
Identifying additional information fields
Additional information fields
-
DN0962937
Issue 01A
Id:0900d8058044eb6a
Confidential
159
LTE iOMS Alarms and Troubleshooting
Instructions
1. To ensure proper functionality of the system, switch off the inert mode after the
problem analysis is done.
2. You can switch off the inert mode from all nodes of the cluster by issuing the
fshascli command:
$ fshascli --inert-mode off /
Note that this should be done by the supplier's field engineer that is currently analysing
the system.
When the inert mode is switched off, pending recovery actions take place. For example,
if an important severity process in a cold active/standby recovery group has failed in a
node that was in the inert mode, switching the inert mode off for the node causes a switchover of the recovery group.
Clearing
The system clears the alarm when the inert mode is switched off from the managed
object.
Testing instructions
1. Switch the inert mode on for the cluster:
$ fshascli --inert-mode on /
An alarm should be raised for all present nodes of the cluster.
2. Switch the inert mode off for the cluster:
$ fshascli --inert-mode off /
The alarm should be cancelled for all present nodes of the cluster.
160
Id:0900d8058044eb6a
Confidential
DN0962937
Issue 01A
LTE iOMS Alarms and Troubleshooting
19.38
70267 EXTERNAL USER ACCOUNT VALIDATION FAILED
Probable cause: Configuration or Customising Error
Event type: Processing error
Default severity: Warning
Meaning
Network Element (NE) has detected that according to the NetAct Remote User Information Management (RUIM) LDAP (Lightweight Directory Access Protocol) access control
lists, an external user account defined in NetAct LDAP user database has permissions
for this NE. According to the NE security architecture, remote user accounts are replicated locally. The validation check performed before the replication for the user account
did not pass and therefore the user account was not replicated.
Possible reasons for a failing validation check are:
1. External username is the same as one of the NE internal usernames. This should
not happen if NetAct is following the agreed way of naming users.
2. External username is a reserved username.
3. External username is invalid, for example, too long (supported usernames are up to
31 characters long).
4. External username contains invalid characters.
5. Account is not assigned with any valid permissions.
6. External user ID is the same as one of internal user IDs.
7. External user ID is not in the supported range.
8. Some permissions do not map to any valid groups.
9. User ID is not a valid number.
The user account cannot be used to log into the NE (except for case 8 above, where
user is still able to log in).
Identifying additional information fields
Username
Additional information fields
error type (1-9 according to the list in "Meaning of alarm")
uid (numeric user ID). Note that in case of error type 9, the user ID in this field is set to -1
comma-separated list of invalid group names (for error type 8)
Instructions
Check that the username complies with the restrictions imposed by the NE and correct
the account information in NetAct LDAP.
The restrictions (based on /RUIMFLEXI/) are the following:
•
•
•
DN0962937
Issue 01A
the username must be created according to [a-zA-Z0-9_.][a-zA-Z0-0_-.]{0,30}{a-zAZ0-9_.$-]? (32 characters maximum)
the username cannot start with one of the prefixes reserved for network elements:
"_nok", "_nsn"
the username cannot be the same as one of the reserved names from the list
(defined in /RUIMFLEXI/): root, wheel, daemon, adm, sync, shutdown, halt, lp, mail,
Id:0900d805805c5c9c
Confidential
161
LTE iOMS Alarms and Troubleshooting
•
•
uucp, operator, games, nobody, gopher, nfs, nfsnobody, named, ntp, ldap, mysql,
postgres, apache, sshd, rpm, dbus, vcsa, nscd
the numeric user ID of a RUIM user must be in the range of [1.000, 9.999.999], that
is, greater or equal to one thousand and less than ten million.
the account must be assigned with at least one valid permission. Valid permissions
are those that allow mapping an external user account to one or more network
element groups.
Clearing
Clear the alarm with an alarm management application (for example, Alarm Browser)
after correcting the fault as presented in Instructions.
Testing instructions
The test setup must include an external LDAP server supporting the RUIM schema
(defined in /RUIMSCHEMA/).
Before you start, check that:
•
•
•
•
FlexiPlatform cluster is commissioned and up.
NE account is defined in the NE's internal LDAP (NWI3 Security fragment).
External LDAP server is up.
All RUIM-related RGs (RuimRep and PAP) are unlocked and enabled.
1. Create a user account in external LDAP in a way that conflicts with the restrictions
described in the Meaning of the alarm section.
2. Make this user a member of an LDAP ACL that is linked with ruiAuthObject that
defines a valid permission in the network element. For example,
ruiAuthObject and ruiAuthOperation.
dn: ruiAuthObjectName=fsui,ou=SystemPermissionsSet,ou=NetAct,ou=Authori
zation,ou=ruim, ou=region-911080,ou=regions,ou=NetAct,dc=noklab,dc=net
ruiIsStereoType: FALSE
ruiAuthObjectName: fsui
objectClass: top
objectClass: ruiAuthorizationObject
ruiMgmtDomain: ALL
dn: ruiAuthOperationName=monitor,ruiAuthObjectName=fsui,ou=Syst
emPermissionsSet,ou=NetAct,ou=Authorization,ou=ruim, ou=region-911080,ou=reg
ions,ou=NetAct,dc=noklab,dc=net
ruiIsScopeDependent: FALSE
objectClass: top
objectClass: ruiAuthorizedOperation
ruiClassification:
ruiAuthOperationName: monitor
You can construct the group name _nokfsuimonitor, if applying the rule "_nok"+ruiAuthObject+ruiAuthOperation. Making a user a member of this group gives it permissions FSNASVIEW, FSIPVIEW, FSLBVIEW, FSLANVIEW, and so on.
3. Initiate an ssh login using the created account.
4. Observe that the alarm is raised and check that the user is not replicated to the NE's
internal LDAP RUIM cache fragment (fsFragmentId=security-ruim-cache,fsClusterId=ClusterRoot). Login is not successful.
162
Id:0900d805805c5c9c
Confidential
DN0962937
Issue 01A
LTE iOMS Alarms and Troubleshooting
5. Clear the alarm manually.
DN0962937
Issue 01A
Id:0900d805805c5c9c
Confidential
163
LTE iOMS Alarms and Troubleshooting
19.39
70268 EXTERNAL LDAP FAILURE
Probable cause: Underlying resource unavailable
Event type: Processing error
Default severity: Warning
Meaning
Network element (NE) experiences problems with the connection to the NetAct external
Lightweight Directory Access Protocol (LDAP) server. The alarm is raised for the following types of problems:
1. Both primary and secondary NetAct LDAP servers are down, unreachable, not
responding within certain time, or replying with a return code indicating that LDAP is
busy. This indicates a failure.
2. Both the NE account and the initial registration account are not accepted by neither
primary nor secondary NetAct LDAP servers. This indicates a failure.
3. Bad LDAP data (for example, loops in referrals, too big a result set).
4. Other types of problems, for example invalid RUIM configuration in the local LDAP
server.
The NE is trying to contact the external NetAct LDAP server in the following scenarios:
1. NE connects to the NetAct LDAP server to verify external user's password information.
2. NE connects to the NetAct LDAP server to obtain external user's authorization data.
There are several use cases when this scenario is triggered:
a) User authorization data is fetched and replicated locally during the first login of
an external user into the NE, or a login occurring after the replicated user
account is removed from NE's internal user database due to cache expiry. This
scenario occurs after external user's password has been verified in the context
of user authentication.
b) User authorization data replication that is triggered by NE Name Service Switch
(NSS) module, for example, by using the id command.
c) User authorization data is fetched and replicated after a relevant CLI command
(fsruimrepcli --refreshusers --username <username>) is executed. For more information, see the RUIM user guide.
d) User authorization data is fetched and replicated due to a scheduled cache
update. Scheduled cache updates are performed by the RuimReplicator
process of the RuimReplicator Recovery Group automatically and regularly with
time interval in between replications. The time interval between replications is
configured by the following property in the RuimReplicator property file (in
/opt/Nokia_BP/SS_AAA/etc ):
// automatic cache refresh interval in seconds
ruim.replicator.refresh_interval
Problems 1 and 2 prevent successful completion of all scenarios. The effect of the
problems is described below:
•
•
•
164
In scenarios 1 and 2a external user's login is denied with appropriate PAM (Pluggable Authentication Module) error code.
In scenario 2b there can be various problems related to user-to-group mappings for
external users.
In scenario 2c the CLI operation fails.
Id:0900d80580787388
Confidential
DN0962937
Issue 01A
LTE iOMS Alarms and Troubleshooting
•
In scenario 2d the scheduled cache update fails. If time-based replication fails due
to the NetAct LDAP server unavailability (problem 1), RuimReplicator process starts
to recover from the failure by retrying the replication according to the following properties:
// retry count incase of cache refresh failure
ruim.replicator.refresh_retry_count
// sleep between cache refresh tries in seconds
ruim.replicator.refresh_retry_interval
Identifying additional information fields
1. Problem type (1 - NetAct LDAP not available, 2 - both NE account and initial registration accounts not usable, 3 - Bad data, 4 - Other)
2. Scenario (1 - PAM or NSS failure, 2 - RuimReplicator replication)
Additional information fields
1. LDAP or RUIMCppAPI error code
2. Number of retries (applicable for scenario with time-based replication (2d))
3. Retry interval (in seconds as defined by the RuimReplicator properties)
Instructions
Depending on the problem type (see Identifying Application Additional Info) the cause
for the problem can be:
•
•
•
Network configuration problems.
Check that the primary and secondary NetAct LDAP server addresses (related attributes in LDAP are fsnwi3PrimaryLDAPServer and
fsnwi3SecondaryLDAPServer) defined in the active configuration fragment
under the NWI3 Mediator fragment (fsClusterId=ClusterRoot
fsFragmentId=NWI3 fsFragmentId=mediator fsnwi3N3CFId=<your
number>) are reachable.
NE and the initial registration accounts are both invalid as compared to NetAct
(wrong account name, password, and so on).
Check that the accounts (related attributes in LDAP are
fsnwi3NEAccountUsername and fsnwi3InitialRegistrationUsername)
stored in the internal LDAP server (fsClusterId=ClusterRoot
fsFragmentId=NWI3 fsFragmentId=security and
fsClusterId=ClusterRoot fsFragmentId=NWI3
fsFragmentId=mediator fsnwi3N3CFId=<your number>) exist also in the
NetAct LDAP servers, have not expired, have correct passwords, and so on.
NetAct LDAP is overloaded or shut down.
Clearing
Alarm is automatically cleared by the RuimReplicator when replication is successful.
The alarm is also cleared when a new alarm with the same specific problem but with different Identifying Application Additional Info is raised by the RuimReplicator.
Testing instructions
The test setup must include an external LDAP server populated according to NetAct
Remote User Information Management (RUIM) schema (/RUIMSCHEMA/).
Before you start, check that:
DN0962937
Issue 01A
Id:0900d80580787388
Confidential
165
LTE iOMS Alarms and Troubleshooting
•
•
•
NE is commissioned and functioning.
Connection with the external LDAP is established.
All RUIM-related RGs (RuimReplicator and PAP) are unlocked and enabled.
Execution scenario 1:
1. Shut both the primary and secondary NetAct LDAP servers down.
2. Login through ssh with a valid external (RUIM) user to the NE.
3. If login is unsuccessful, observe that the alarm is raised in NE with the following SCLI
command:
show alarm active filter-by specific-problem 70268
Alarm additional info must indicate the problem correctly.
4. Start the NetAct LDAP servers.
5. Login through ssh with a valid external (RUIM) user to the NE.
6. If login is successful, observe that the alarm is cleared automatically by RuimReplicator in NE by using the SCLI command provided in step 3.
Execution scenario 2:
1. In NE, modify the registration accounts (NE and the initial registration: related attributes in LDAP are fsnwi3NEAccountUsername and
fsnwi3InitialRegistrationUsername) so that both the primary and secondary NetAct LDAP servers are not accessible (fsClusterId=ClusterRoot
fsFragmentId=NWI3 fsFragmentId=security and
fsClusterId=ClusterRoot fsFragmentId=NWI3
fsFragmentId=mediator fsnwi3N3CFId=<your number>).
2. Initiate an ssh login with an external account.
3. Observe that the login is denied and an alarm is raised. Alarm additional info must
indicate the problem correctly.
166
Id:0900d80580787388
Confidential
DN0962937
Issue 01A
LTE iOMS Alarms and Troubleshooting
19.40
70269 INVALID ACTIVE SESSIONS
Probable cause: Database inconsistency
Event type: Processing error
Default severity: Critical
Meaning
Currently there are open sessions to the Network Element (NE) that operate according
to outdated authorisation profiles. This situation occurs when there are changes in
NetAct Lightweight Directory Access Protocol (LDAP) affecting those external Remote
User Information Management (RUIM) user accounts (or permissions associated with
those accounts) which were replicated into the NE's local user database.
The change can be one of the following:
•
•
•
The user account has been removed from NetAct.
The user account cannot be used to access the NE anymore.
The permissions associated with this account have changed in NetAct.
Currently there are active user sessions, opened before the above-mentioned changes
were detected in the NE. Within those already created user sessions, access control
changes are not automatically taken into effect. Users logged in with affected user
accounts still continue to operate with the old permission set.
Note that only sessions maintained in /var/run/utmp are monitored. Currently only
SSH sessions are monitored. Ftp sessions opened with vsftpd are also visible in
/var/run/utmp, but ftp sessions are not possible with external user accounts according to the platform configuration. For other types of sessions, no alarm is raised.
This alarm can indicate that some users operate within the NE with higher permissions
than allowed by NetAct according to a changed user account authorisation profile. There
are four possible reasons for this:
1. A non-existent user is still logged into the NE (user account removed from NetAct).
2. A user with no permissions for the NE is logged in (user account has been detached
from the NE according to RUIM Access Control Lists).
3. A user has higher permissions than defined in NetAct (permissions for the user
account were lowered).
4. A user has lower permissions than defined in NetAct (permissions for the user
account were raised).
Note that cases 1-3 indicate a security risk.
Identifying additional information fields
username, for which changes were detected
Additional information fields
change type (user was removed or denied access to the NE (1), user's permissions
changed (2))
Instructions
All currently active ssh sessions based on user accounts mentioned in the Application
Additional Info field of the alarm must be closed and reopened, if needed. After reopening a session, correct permissions are taken into use, if the account is still in use for the
NE.
DN0962937
Issue 01A
Id:0900d8058050bc60
Confidential
167
LTE iOMS Alarms and Troubleshooting
•
To check open ssh sessions:
1. Log into the active CLA.
2. Execute the following command:
# utmpdump /var/run/utmp
For example, the result of invoking utmpdump may look as follows:
# utmpdump /var/run/utmp
...
[6] [06306] [co ] [LOGIN
] [ttyS1
16:20:58 2006 EET]
[7] [32610] [ts/0] [testuser] [pts/0
19:06:11 2006 EET]
[7] [32679] [ts/1] [testuser] [pts/1
19:07:07 2006 EET]
[7] [32743] [ts/2] [testuser] [pts/2
19:07:45 2006 EET]
[7] [00361] [ts/3] [testuser] [pts/3
19:08:50 2006 EET]
[7] [17382] [ts/4] [root
] [pts/4
14:59:47 2006 EET]
[7] [01256] [ts/5] [extuser ] [pts/5
13:05:44 2006 EET]
[7] [04574] [ts/6] [root
] [pts/6
12:29:21 2006 EET]
...
] [
] [196.144.10.0
] [Tue Nov 14
] [fle4gr01.ntc.nokia.com] [172.21.216.104 ] [Tue Nov 21
] [fle4gr01.ntc.nokia.com] [172.21.216.104 ] [Tue Nov 21
] [fle4gr01.ntc.nokia.com] [172.21.216.104 ] [Tue Nov 21
] [fle4gr01.ntc.nokia.com] [172.21.216.104 ] [Tue Nov 21
] [flegrp13.ntc.nokia.com] [172.21.220.61
] [Fri Dec 01
] [esfleg03.ntc.nokia.com] [172.21.216.127 ] [Sun Dec 03
] [esfleg02.ntc.nokia.com] [172.21.216.126 ] [Fri Dec 01
The preferred way of closing a session is a graceful exit. It is, however, possible to
close it forcefully. The following example illustrates a forceful cleanup of a session
for user extuser.
1. First, check the sshd process ID of the child process of 01256:
# ps -ef | grep 1256
root
1256 7701
10009
1276 1256
root
2504 17382
•
168
0 13:05 ?
0 13:05 ?
0 13:06 pts/4
00:00:00 sshd: extuser [priv]
00:00:00 sshd: extuser@pts/5
00:00:00 grep 1256
2. Terminate the session:
# kill -9 1276
ssh session for user extuser is terminated.
Other instructions
The following gives some information about other types of sessions, even though
they cannot be reported in this alarm.
• Authorisation handling for Element Manager over Nwi3 (secure CORBA) implies
automatic refreshing of the authorisation data according to Nwi3Adapter Secure
CORBA properties. To achieve faster refreshing of the authorisation data for
Nwi3Adapter (for example, if you know that authorisation data for a logged-in
user has changed), invoke the following command:
# fscorbaseccli -c updatetoken
This triggers the authorisation profile update for all users after at most as many
seconds as specified by the property:
com.nokia.flexiplatform.corba.security.cache.tokenpollrefr
esh.pollinterval in
/opt/Nokia/SS_Nwi3Adapter/etc/secfwk.properties
• In Element Manager, over HTTP user's permissions are checked when accessing a method (according to the default configuration), so the changed authorisation profile is immediately taken into effect. The local NE LDAP sessions are not
affected. It is not possible to bind to a local NE LDAP with an external account.
Id:0900d8058050bc60
Confidential
DN0962937
Issue 01A
LTE iOMS Alarms and Troubleshooting
Clearing
Clear the alarm with an alarm management application (for example, Alarm Browser)
after correcting the fault as presented in Instructions.
Testing instructions
The test setup must include an external LDAP server populated according to the RUIM
schema.
Before you start, check that:
•
•
FlexiPlatform cluster is commissioned and up.
All RUIM-related RGs (RuimRep and PAP) are unlocked and enabled.
Execution scenario for ssh:
1. Open an ssh session to the NE using an account defined in RUIM LDAP, for
example, extaccount. Check with command:
$ utmpdump /var/run/utmp
that the session is opened. You get the following entry:
[7] [01505] [ts/1] [extaccount] [pts/1
19:36:26 2006 EET]
] [flegrp13.ntc.nokia.com] [172.21.220.61
] [Sun Nov 26
2. Remove extaccount from RUIM LDAP. Execute the following CLI command:
$ fsruimrepcli --refreshcache
to enforce synchronisation between RUIM LDAP and the local replicated security
fragment.
3. Observe that an alarm is displayed and it indicates user extuser as the one for which
sessions should be restarted.
4. Check that there is an sshd process corresponding to the session.
# ps -ef | grep extuser
root
1505 26013 0 19:36 ?
10009
1584 1505 0 19:36 ?
00:00:00 sshd: extuser [priv]
00:00:00 sshd: extuser@pts/1
5. Terminate the process:
# kill -9 1584
6. Observe that the session is terminated.
7. Try to login again using account extuser.
Access must be denied.
DN0962937
Issue 01A
Id:0900d8058050bc60
Confidential
- ssh session
169
LTE iOMS Alarms and Troubleshooting
19.41
70280 UNKNOWN SPECIFIC PROBLEM
Probable cause: Configuration or customising error
Event type: Processing error
Default severity: Warning
Meaning
This alarm is raised when an alarm notification is detected for a specific problem (alarm
number) that is unknown to the alarm system (the corresponding alarm type is not
defined in the reference data).
The unknown specific problem can be the result of either using a dynamic alarm type (a
type that is not inherently predefined and correspondingly not ported to the alarm
system) or a mistake due to a missing import of the existing alarm definition in the alarm
system.
The alarm is raised in two cases:
1. When the alarm system is configured for supporting dynamic alarm types (the
fsDatSupport attribute in the alarm system's LDAP configuration is set to true).
2. When the alarm system doesn't support dynamic alarm types but is configured for
raising alarm 70280 instead of alarm 70005 for unknown specific problems (the
fsRaise70280insteadOf70005forUnknownSP attribute in the alarm system's
LDAP configuration is set to true).
For the first case, the alarm system creates a new type of alarm instantaneously, using
the data from the alarm notification. This sets the alarm type parameters and applies it
to the alarm notification in question, that is not discarded.
This alarm type is stored persistently in the reference data of the alarm system database. It is then applied to the subsequent new alarm notifications that contain the
specific problem in question. This results in no longer raising alarm 70280, in the case
of recently registered specific problem.
For the second case the alarm system discards the alarm notification in question and
raises alarm 70280 that includes data from the original alarm notification.
Identifying application additional information fields
1. Unknown specific problem in the original alarm notification.
2. Managed object ID in the original alarm notification.
3. Identifying application additional information in the original alarm notification.
Application Additional information fields
1. Perceived severity in the original alarm notification.
2. Application additional information in the original alarm notification.
Instructions
The alarm either announces the use of a dynamic alarm type in alarm notification or indicates an undefined alarm in the alarm system (the exact reason can be identified by
checking the list of known alarms in the customer documentation). In latter case, contact
your Nokia Siemens Networks representative to upgrade the system with the definition
of the missing alarm.
The alarm system creates a new alarm type using the following values for its parameters:
170
Id:0900d805807735d3
Confidential
DN0962937
Issue 01A
LTE iOMS Alarms and Troubleshooting
A. Static Parameters:
Parameter
Value
Alarm text
The value of a special field in the alarm notification; if the field is not defined then the text
takes the following form: "ALARM NNN" where
NNN is the specific problem in question.
Probable cause
0 (INDETERMINATE).
Event type
Environmental.
Specific problem
The specific problem in question.
Clearing info
Automatic clearing.
B. Dynamic Parameters:
Parameter
Value
Default severity
The perceived severity of the alarm notification; if it is not set then the INDETERMINATE
value is used.
Autoacknowledgment
Yes, if the
fsParameterId=fsAutoAckedDAT,
fsAlarmProcessorConfigurationId=De
fault,
fsAlarmProcessorId=AlarmProcessor1
, fsFragmentId= AlarmProcessors,
fsFragmentId=AlarmMgmt,
fsClusterId=ClusterRootdefined attribute in the alarm system’s configuration in
Configuration Directory are set to "true"; otherwise - no.
Switch over update
No
Clearing delay
0
Informing delay
0
Time to live
0
Operation Instructions
"Not defined".
If required, the static parameters can be changed by using SCLI commands:
Clearing
The alarm system clears the alarm automatically after its time to live has expired.
Testing instructions
Scenario 1 (dynamic alarm type support is switched on).
1. Check with the parameter tool that dynamic alarm type support is switched on, i.e.
the fsDatSupport attribute in the alarm system LDAP configuration is set to true
(modify the configuration if necessary and restart the Alarm Processor using the
fshascli -rn /AlarmSystem command).
DN0962937
Issue 01A
Id:0900d805807735d3
Confidential
171
LTE iOMS Alarms and Troubleshooting
2. Raise any unknown test alarm using the flexalarm tool - for instance 79999:#
flexalarm --raise --sp=79999 --mo=/ --ap=/CLA0/TestRU/TestApp --se=3
3. Observe that a new alarm type with the parameters described in the Instructions
field has been added to the alarm system reference data.
4. Observe that alarm 70280 has also been raised.
5. Observe that alarm for the created alarm type (with the specific problem unknown
before) has also been raised.
6. Observe that alarm 70280 has been cleared after its time to live has expired.
Scenario 2 (dynamic alarm type support is switched off, but alarm 70280 is raised
instead of alarm 70005).
1. Check with the parameter tool that dynamic alarm type support is switched off, i.e.
the fsDatSupport attribute in the alarm system's LDAP configuration is set to
false; and that the fsRaise70280insteadOf70005forUnknownSP attribute is
set to true (modify the attributes if necessary and restart the Alarm Processor using
the fshascli -rn /AlarmSystem command ).
2. Raise any unknown test alarm using the flexalarm tool - for instance 89999:
# flexalarm --raise --sp=89999 --mo=/ --ap=/CLA0/TestRU/TestApp --se=3
3. Observe that a new alarm type has not been added to the reference data of the
alarm system.
4. Observe that alarm 70280 has been raised with the data from the original alarm notification.
5. Observe that the original alarm notification has been discarded.
6. Observe that alarm 70280 has been cleared after its time to live has expired.
172
Id:0900d805807735d3
Confidential
DN0962937
Issue 01A
LTE iOMS Alarms and Troubleshooting
19.42
71000 PM FTP CONNECTION FAILED
Probable cause: Communication Protocol Error
Event type: Communications
Default severity: Minor
Meaning
File transfer operation failed when trying to upload measurement file. IP-address in the
additional information field tells which interface the problem concerns.
This alarm will not be set immediately after a file transfer operation fails, but only after
the file transfer has failed to the same IP-address consequtively over the duration
defined by LDAP parameter
OMS/OMSRNC/SS_RNCPM/OMSMeaHandler/BTSFTPAlarmSetDelay.
Measurement data may be lost or delayed.
Identifying additional information fields
1. IP-address of FTP/HTTP/HTTPS server
Additional information fields
2. Cause information: "Connect_failed", "Get_failed", "Other_error"
3. Network element identifier ("WMBTS-xxx" for BTS-failures, "ASNGW-xxx" for ASN
GW failures, "IADA-xxx" for I-HSPA Adapter failures, "FGW-xxx" for Femto Gateway
failures)
Instructions
Normally the alarm does not need to be cleared but the system cancels the alarm automatically when the file transfer operation is successful. However, if the related network
element is removed altogether from the network or its IP-address is changed, it may be
necessary to cancel the alarm manually using Element Manager.
Clearing
-
DN0962937
Issue 01A
Id:0900d805805ca24c
Confidential
173
LTE iOMS Alarms and Troubleshooting
19.43
71001 MEASUREMENT DATA NOT TRANSFERRED
Probable cause: Queue Size Exceeded
Event type: Quality of service
Default severity: Minor
Meaning
The number of files waiting to be transferred to NetAct has exceeded a defined threshold.
Some measurement data may not have been transferred to NetAct or file transfer
acknowledgements from NetAct to OMS are not working correctly.
Identifying additional information fields
Additional information fields
Instructions
Do not clear the alarm. Alarm System will clear the alarm when the amount of untransferred files decreases below a defined threshold.
If NetAct connection is wanted to be disabled, the alarm will get cancelled automatically
within 10 minutes after setting LDAP parameter PMFileBufferAlarmEnabled to
value 0 (zero).
Clearing
-
174
Id:0900d8058053d926
Confidential
DN0962937
Issue 01A
LTE iOMS Alarms and Troubleshooting
19.44
71002 MEASUREMENT DATA ERROR
Probable cause: Corrupt data
Event type: Processing error
Default severity: Warning
Meaning
Measurement file could not be processed.
Some measurement data could have been lost due to invalid measurement file content.
Identifying additional information fields
Additional information fields
1. Error info, possible values: "Decompression_failed", "File_corrupted", "Other_failure"
2. File name
3. IP-address of data provider
4. Detailed error code for troubleshooting
Instructions
Do not clear the alarm. Alarm System will clear the alarm automatically.
Clearing
-
DN0962937
Issue 01A
Id:0900d8058080548d
Confidential
175
LTE iOMS Alarms and Troubleshooting
19.45
71003 OMS MEASUREMENT DATA PROCESSING
OVERLOAD
Probable cause: System Resources Overload
Event type: Quality of service
Default severity: Minor
Meaning
The time used for processing performance measurement data in OMS has exceeded
the defined limit. This does not necessarily indicate any loss of measurement data but
the measurement parameters should be changed to decrease load and prevent possible
problems caused by the overload.
The limits used to set and cancel this alarm can be changed by the user from OMS
LDAP parameters.
Too much measurement data is produced in the network elements and OMS overload
causes a risk for losing some data.
Identifying additional information fields
1. Measurement category, possible values: "RNW_meas", "Transm_hw_meas",
"WBTS_meas" which covers also WMBTS, ASN GW and FGW measurements.
Additional information fields
Instructions
Do not clear the alarm. Alarm System will clear the alarm after data processing load
decreases to normal level.
Clearing
-
176
Id:0900d8058053d928
Confidential
DN0962937
Issue 01A
LTE iOMS Alarms and Troubleshooting
19.46
71052 OMS FTP CONNECTION COULD NOT BE OPENED
Probable cause: Communication Protocol Error
Event type: Communications
Default severity: Minor
Meaning
Starting a new FTP connection has failed.
File transfer between OMS and target network element is not working.
Identifying additional information fields
1. IP address of the failed target
Additional information fields
Instructions
The error can be caused by many different reasons (configuration error, for example a
faulty IP address, out of memory, load is too high, and so on). To find out the reason for
the error:
1. Open web browser.
2. Go to page https://<OMS IP address>/logviewer.html.
3. Check the log for errors. If the problem persists, see for how to fix the problem. If that
does not provide a solution, contact the local Nokia Siemens Networks representative.
Clearing
Normally, the alarm does not need to be cleared but the system cancels the alarm automatically when FTP operation is successful. However, if the related network element is
removed altogether from the network or its IP-address is changed, it may be necessary
to cancel the alarm manually using Element Manager.
DN0962937
Issue 01A
Id:0900d8058068f8e0
Confidential
177
LTE iOMS Alarms and Troubleshooting
19.47
71054 O&M MEDIATION FAILURE
Probable cause: Communication Protocol Error
Event type: Communications
Default severity: Minor
Meaning
NWI3 connection problem between OMS and NetAct.
This alarm is set by OMS unit when WBTS O&M operation reply sending from OMS to
NetAct has failed.
In case of NWI3 problem the WBTS O&M mediation tasks done by OMS unit cannot be
performed (SW download, SW version upload, HW configuration upload).
Identifying additional information fields
Additional information fields
Instructions
No actions required from the operator.
Clearing
Do not clear the alarm. This alarm is cancelled automatically by the system.
After the problem in NWI3 connection has been corrected, the system will cancel the
alarm only when the next O&M mediation event is successfully sent to NetAct. Thus it
is normal behaviour that alarm stays active for a while after the problem has been corrected.
178
Id:0900d80580774dcc
Confidential
DN0962937
Issue 01A
LTE iOMS Alarms and Troubleshooting
19.48
71057 NWI3 NOTIFICATION MISSING
Probable cause: Communication Protocol Error
Event type: Communications
Default severity: Minor
Meaning
The OMS sends notifications to the NMS when the configuration in the network element
has been updated. The notification event is related to either configuration or topology
changes in the network element. The alarm is set if all the notifications related to the
configuration and topology changes cannot be sent to the NMS. The reason is notification handling error in the OMS or in the NMS. The alarm is set for OMS.
There might be incoherent information in the NMS about the network configuration
and/or topology.
Identifying additional information fields
Additional information fields
1. Notification event type, possible values: "configuration", "topology".
Instructions
Upload the information related to the NWI3 fragment in question from the NMS to get
the configuration information from network elements up-to-date.
Clearing
The alarm does not need to be cleared, but the system cancels the alarm automatically
when the error situation is cleared.
DN0962937
Issue 01A
Id:0900d80580774df8
Confidential
179
LTE iOMS Alarms and Troubleshooting
19.49
71058 NE O&M CONNECTION FAILURE
Probable cause: Communication Protocol Error
Event type: Communications
Default severity: Major
Meaning
This alarm is raised when BTS O&M connection between OMS and network element
fails.
The managed network element is not reachable by centralized O&M systems as OMS
or NetAct.
Identifying additional information fields
1. IP-address of BTS-O&M interface
Additional information fields
Instructions
Normally, the alarm does not need to be cleared but the system cancels the alarm automatically when the connection is working again. However, if the related network element
is removed altogether from the network or its IP-address is changed, it may be necessary to cancel the alarm manually using Element Manager.
Clearing
Do not clear the alarm unless the conditions given in instructions for manual clearing
need are fulfilled. This alarm is cancelled automatically by the system after the fault has
been corrected.
180
Id:0900d805807f9428
Confidential
DN0962937
Issue 01A
LTE iOMS Alarms and Troubleshooting
19.50
71101 OMS ALARM UPLOAD FROM NE FAILED
Probable cause: Communication Protocol Error
Event type: Communications
Default severity: Minor
Meaning
Alarm upload from NE to OMS has failed.
Active alarm situation may be out of synch between NE and OMS.
Identifying additional information fields
1. NE info
Additional information fields
1. Explanation of the fault observed in alarm upload scenario between NE and OMS.
Possible values are:
- No response from NE
- NE disconnected before upload finished
- NE sent AckNack(<NackReasonCode>) NackReason=<NackReasonText>
Instructions
1. Check the cause of failure from Application Additional Information field.
2. If the cause of failure indicates BTS O&M connection problem, then perhaps something can be checked from BTS Site Manager.
Clearing
Manual clearing is not necessary.
Alarm is cleared automatically when alarm upload from NE to OMS next time succeeds
or when Time-to-Live has passed.
DN0962937
Issue 01A
Id:0900d80580774e09
Confidential
181
LTE iOMS Alarms and Troubleshooting
19.51
71103 ID CONFLICT IN BTS O&M CONNECTION
Probable cause: Communication Protocol Error
Event type: Communications
Default severity: Major
Meaning
Two or more network elements, that try to communicate to OMS, have the same network
element id.
OMS can not communicate with the network elements that have the same id.
Identifying additional information fields
1. Conflicting network element ID
Additional information fields
IP address of the network element.
Instructions
Give unique id for the network elements.
Clearing
Clear the alarm when ID CONFLICT problem has been solved.
182
Id:0900d805807f6c7a
Confidential
DN0962937
Issue 01A
LTE iOMS Alarms and Troubleshooting
Related information
Related information
Troubleshooting Recommendations
Instructions
Generic troubleshooting procedure
Information Sources in Fault Situations
Instructions
Generic troubleshooting procedure
Problem Types
Descriptions
Troubleshooting Recommendations
Introduction to Problem Reporting
Instructions
Generic Troubleshooting Procedure
Introduction to Problem Reporting
Descriptions
Problem Types
Instructions
Generic Troubleshooting Procedure
DN0962937
Issue 01A
Id:
Confidential
183
0
You can add this document to your study collection(s)
Sign in Available only to authorized usersYou can add this document to your saved list
Sign in Available only to authorized users(For complaints, use another form )