TROUBLESHOOTING IN LIVE WCDMA NETWORKS Master’s Thesis, Mikko Nieminen Espoo, February 14th, 2006

advertisement
Master’s Thesis, Mikko Nieminen
Espoo, February 14th, 2006
TROUBLESHOOTING IN LIVE
WCDMA NETWORKS
Supervisor: Professor Heikki Hämmäinen
Background to the Study
• The number of live WCDMA networks is growing
quickly.
• The first commercial Third Generation Partnership
Project (3GPP) compliant network, J-phone, was
opened in December 2002.
• By October of 2005, there were 80 live commercial
WCDMA networks and the amount of subscribers
was nearly 40 million. By that time, around 140
licenses had been awarded for WCDMA, the current
WCDMA license holders having more than 500
million subscribers in their Second Generation (2G)
networks.
• Especially in Europe and Asia, WCDMA network
deployment after successful field trials and service
launches has entered a new critical stage: the phase
of network optimisation and network troubleshooting.
Research Problem
• As the amount of WCDMA subscribers quickly increases,
operators and equipment vendors are facing big challenges in
maintaining and troubleshooting their networks.
– We may raise the question of how one can efficiently narrow down
the root causes of the problems when there is a huge amount of
subscribers and traffic in a live WCDMA network.
– What are the principles of examination of the fault scenarios and
narrowing down the problem investigation into logical manageable
pieces?
– Which are the tools and methods that are in practice used in
WCDMA network troubleshooting today?
• In order tackle these questions and challenges, this Thesis
presents a Framework for KPI-triggered troubleshooting in live
WCDMA networks.
• The applicability of the Framework is demonstrated by applying
it to a selection of real troubleshooting cases that have
occurred in commercial WCDMA networks.
Scope of the Study
•
This study concentrates on the KPI-triggered problems in live
WCDMA networks.
In general, the faults can be classified into three categories
•
–
–
–
•
•
Critical, which are emergency problems that require immediate
actions,
Major (which we refer in this study as KPI-triggered problems)
Minor which do not affect the services of the network.
The viewpoint of is from the equipment vendor’s side, the
main objective being to create guidelines for troubleshooting
experts and technical support personnel of WCDMA network
manufacturers in order to perform troubleshooting and
narrow the problems down following a defined logic.
This Thesis mainly concentrates on WCDMA network
troubleshooting from a Radio Access Network perspective.
The reasoning behind this approach is that the UTRAN
covers most of the WCDMA specific functionality and
intelligence, and therefore brings the majority of the
troubleshooting challenges also.
Research Methods
• This Thesis is mainly based on the study of
various technical specifications and
interviews of WCDMA network
troubleshooting experts.
• The main literature sources are the 3GPP
specifications of release 99, since the
majority of the live WCDMA networks were
based on 3GPP release 99 during the writing
of this Thesis.
• It can be noted that 3GPP release 4
networks are currently gaining foothold in the
live WCDMA networks. However, there are
only minor differences in the Radio Access
functionality of the afore-mentioned two
3GPP specification releases.
Structure of the Thesis
•
•
•
•
•
Introduction to WCDMA Networks
UTRAN Protocols
Call Trace Analysis
Key Performance Indicators
Framework for KPI-Triggered
Troubleshooting
• Cases from Live WCDMA Networks
WCDMA network architecture
PSTN
INTERNET
GMSC
GGSN
AuC
CORE
NETWORK
HLR
EIR
SGSN
MSC/VLR
UTRAN
Node B
cell
cell
RNC
RNC
Node B
cell
cell
Node B
cell
cell
UE
ME
USIM
Node B
cell
cell
UTRAN architecture
UTRAN
Iu-CS
Node B
3G
MSC
RNC
Uu
Node B
Iub
Core Network
(CN)
Iur
Node B
User Equipment
(UE)
SGSN
RNC
Node B
Iu-PS
UMTS Bearer Services
Non-Access
Stratum
Radio Access Bearer
Signalling connection
RRC
RRC connection
Iu connection
Access
Stratum
Radio bearer service
Iu bearer service
: SAP
UE
RAN
Uu
CN
Iu
Summary of Protocols (CS user plane)
Iub
Uu
Iu
CS
application
and
coding
CS
application
and
coding
RLC
RLC
MAC
MAC
WCDMA
L1
Iu-UP
protocol
FP
FP
AAL2
AAL2
AAL2
AAL2
ATM
ATM
ATM
ATM
WCDMA
L1
PDH/SDH
UE
Iu-UP
protocol
Node B
PDH/SDH
PDH/SDH
RNC
PDH/SDH
MSC
Summary of Protocols (UE control plane)
Iub
Uu
Iu
NAS
NAS
RRC
RRC
RANAP
RLC
RLC
SCCP
SCCP
MAC
MAC
MTP3b
MTP3b
SSCF-NNI
SSCF-NNI
SSCOP
SSCOP
WCDMA
L1
UE
RANAP
FP
FP
AAL2
AAL2
AAL5
AAL5
ATM
ATM
ATM
ATM
PDH/SDH
PDH/SDH
PDH/SDH
PDH/SDH
WCDMA
L1
Node B
RNC
CN
Overview of WCDMA Call Setup
MT Call
Paging
MO Call
RRC
Connection
Establishment
Radio Access
Bearer
Establishment
User Plane
Data Flow
RRC connection establishment (DCH)
UE
Node B
RNC
1. RRC CONNECTION REQUEST
RRC
RRC
2. Admission
Control
3. RADIO LINK SETUP REQUEST
C-NBAP
C-NBAP
4. Start RX
5. RADIO LINK SETUP ESPONSE
C-NBAP
ALCAP
ALCAP
C-NBAP
6. ESTABLISH REQUEST
7. ESTABLISH CONFIRM
ALCAP
ALCAP
8. UPLINK & DOWNLINK SYNC
FP
FP
9. Start TX
10. RRC CONNECTION SETUP
RRC
RRC
11. L1 SYNCH
D-NBAP
RRC
13. RRC CONNECTION SETUP COMPLETE
12. RL RESTORE INDICATION
D-NBAP
RRC
Protocol Analysers
Company
Product
Home Country
Nethawk
[47]
3G Analyser
Finland
Agilent
[48]
Signaling Analyzer
United States
K15
United States
Tektronix [49]
Radcom
[50]
Performer Analyser
Israel
Acterna
[51]
Telecom Protocol Analyzer
United States
RRC Connection Events and KPIs
UE
RNC
CN
RRC CONNECTION REQUEST
Event 1
Setup phase
RRC CONNECTION SETUP
Event 2
Event 2RRC_CONN_ATT_COMP
incremented
Access phase
RRC CONNECTION SETUP COMPLETE
Event 1 RRC_CONN_ATT_EST
incremented
Event 3
Active phase
Event 3RRC_CONN_ACC_COMP
incremented
Event 4RRC_CONN_ACT_COMP
incremented
Event 4IU RELEASE COMMAND
RRC Setup Complete Rate =
Sum of RRC_CONN_STP_COMP
Sum of RRC_CONN_STP_ATT
x 100 %
Sum of RRC_CONN_ACC_COMP
RRC Establishment Complete Rate =
x 100 %
Sum of RRC_CONN_STP_ATT
Sum of RRC_CONN_ACT_COMP
RRC Retainability Rate =
x 100 %
Sum of RRC_CONN_ACC_COMP
RRC connection Phases
Phase:
Setup
Setup
complete
Access
Active
Access
Active
Complete
Complete
Success
Access
Active
Release
Active
Failures
Attempts
Access Failures
Setup Failures, Blocking
RRC Drop
Other WCDMA network KPIs
Sum of RAB_STP_COMP
RAB Setup Complete Rate =
x 100 %
Sum of RAB_STP_ATT
Sum of RAB_ACC_COMP
RAB Establishment Complete Rate =
RAB Retainability Rate =
x 100 %
Sum of RAB_STP_ATT
Sum of RAB_ACT_COMP
Sum of RAB_ACC_COMP
Sum of
CSSR = RAB_ACC_COMP
x 100 %
Sum of RRC_CONN_STP_ATT
Sum of
CCSR
x 100 %
RAB_ACT_COMP
=
Sum of RRC_CONN_STP_ATT
x 100 %
Fault Classification
Fault Class
Description
Examples
A-CRITICAL
Total or major
outages that
are not
avoidable with
a workaround
solution.
Critical (emergency duty contacted)
problems severely affect service,
capacity/traffic, billing, and
maintenance capabilities and require
immediate corrective action,
regardless of time of day or day of
the week as viewed by the operator.
•System restart, all links down
•Simultaneous restarts of active computer units
•More than 50 per cent of traffic handling
capacity out of use
•Subscriber related network element
functionality is not working
B-MAJOR
The problem
leads to
degradation of
network
performance or
the fault affects
traffic randomly.
Major problems cause conditions
that seriously affect system
performance, operation,
maintenance, and administration and
require immediate attention as
viewed by the operator.
The urgency is less than in critical
situations because of a lesser
immediate or impending effect on
system performance, customers,
and the customers operation and
revenue.
•Capacity/quality related functionality is not
working as supposed to
•Problems seriously affecting end user service,
but avoidable with a workaround solution
•Configuration changes (network, HW, and
SW) are not working as supposed to
•Subscriber related functions are not working
completely
•Performance measurement, alarm
management or activation of a new feature
fails
•Single restart of computer units
C-MINOR
Minor fault not
affecting
operation or
service quality
Other problems that the operator
does does not view as critical or
major are considered minor. Minor
problems do not significantly impair
the functioning of the system or
affect the service to customers.
These problems are tolerable during
system use.
•Failures not seriously affecting traffic
•Errors in operating commands syntax
•Cosmetic errors in operational commands or
statistics output
•Minor errors in documentation
Framework for KPI-Triggered
Troubleshooting
• Framework is designed for investigating and
soelving B-MAJOR level i.e. “KPI-triggered” faults
• Before applying the Framework
– The general alarm status of the network has been checked.
No clear network alarms pointing to the root cause of the
fault can be detected.
– Traces from external interfaces of RNC have been taken
with a protocol analyser in order to record the fault scenario.
Also RNC internal trace has been taken when the fault took
place.
– The basic fault scenario has been analysed and clarified.
A
Is the problem new in the operator network?
No
B
Yes
Yes
Perform simulation of the fault
in test bed.
Does the fault still occur?
C
No
New SW, HW, parameters, UE
model or feature introduced?
No
Yes
D
No
Yes
E
Is the fault operator
specific?
Perform simulation of the fault
with reference conditions.
Does the fault still occur?
Yes
F
H
CN
specific
Q
R
No
Use RNC Performance Tester to generate load
in test bed and perform analysis.
I
J
G
Has average network load increased
significantly and/or does the
problem occur at a specific time of day?
Yes
No
Analyse and
investigate the
differences between
the working and faulty
conditions.
Analyse the traces. Investigate fault scope.
K
L
RNC
specific
Node B
specific
M
Transmission
specific
N
Service
specific
O
Country
specific
P
UE
specific
Analyse network element and interface specific alarms, parameters, capacity, logs
and traces. Take specific actions depending on problem scope
(refer to detailed Framework notes).
In case of MVI environment, check IOT results and contact foreign vendor.
Investigate own vendor’s default parameters and compare implementation
againts 3GPP specifications.
Compare own default parameters with other default parameters of other vendors.
Execute air interface protocol analysis and drive tests.
Case: Increased AMR call drop rate
• A decrease in RAB Retainability Rate KPI for AMR
telephony service was experienced during the last
three months in an operator network.
• The decrease was around 2% on each RNC
compared to the time when the network was
performing well. Actions that had already been taken
with no positive effect:
– Soft reset for all Node Bs and for all RNCs
– Hard reset and re-commissioning of Node Bs
– Alarms checked and no major alarms found
Case: Increased AMR call drop rate
I.
A
Is the problem new in the operator network?
Yes
C
II.
New SW, HW, parameters, UE
model or feature introduced?
Yes
III.
E
Perform simulation of the fault
in reference conditions.
Does the fault still occur?
No
G
IV.
Analyse and
investigate the
differences between
the working and faulty
conditions.
Case: Increased AMR call drop rate
• Solution
– The short term solution was that the
parameter for planned maximum downlink
transmission power of all the Node Bs in
the operator network was changed to the
default value of 34 dBm. In this way, the
problem disappeared in the operator
network.
– The long term solution was to implement a
fix of the bug into the next software
release of the Node B.
Results
•
•
•
As a result of thorough research conducted
for this Thesis, a Framework for KPItriggered troubleshooting for live WCDMA
networks was developed.
The Framework is mainly targeted for
WCDMA network equipment vendors, to
help them in solving major service affecting
faults occurring in the live WCDMA
networks of today.
Troubleshooting cases from live WCDMA
networks were solved using the Framework
developed, in order to verify the results and
test the applicability and practicality of the
Framework.
Assessment of the results
•
•
•
•
The applicability and relevance of the troubleshooting
Framework was tested against three different fault cases
from live WCDMA networks.
The results were fairly promising since all the cases were
successfully solved by utilising the Framework. The
Framework was found to be quite practical and suitable for
solving KPI-triggered problems in live WCDMA networks.
However, it must be taken into account that the Framework
was tested with a limited number of cases, because of time
and resource limitations. If more extensive testing and
verification with a large number of cases would be applied,
there is a possibility that optimisations and improvements to
the Framework could be done.
Still, the basic logic of the Framework was proven with
reasonable relevance. The results presented in this study
can be easily tested in the future against a number of cases
in order to verify the results with more extensive statistical
reliability.
Exploitation of the results
• The results of this study will be used as
source material in the development of
UTRAN troubleshooting competence
development and advanced learning solution
creation, targeted for troubleshooting experts
and customer support engineers of one of
the leading WCDMA network equipment
vendors.
• Also, the results of the Thesis will be used as
an input in creation of customer
documentation for UTRAN troubleshooting.
• There is also an intention to further test the
relevance and reliability of the results of this
Thesis by applying it in the 24/7 RAN
technical support operator service of the
equipment vendor in question.
Future Research
• The significance of Performance Indicator based
troubleshooting is increasing continuously in live
WCDMA networks.
• Once the PI and KPI specifications become more
mature, more extensive study of the most relevant
Performance Indicators used in WCDMA network
troubleshooting is essential.
• Also, there is a need to develop a Framework and
logic for solving emergency problems in WCDMA
networks.
• As the growth of complexity of telecommunication
networks increases, effective and efficient
troubleshooting procedures are essential in order to
manage the diversity of network technologies and
the increasing quality requirements of the operators.
Download