Dynamic Call Drop Analysis

advertisement
Master Thesis in Statistics and Data Mining
Dynamic Call Drop Analysis
Martin Arvidsson
Division of Statistics
Department of Computer and Information Science
Linköping University
Supervisor
Patrik Waldmann
Examiner
Mattias Villani
“It is of the highest importance in the art of detection to
be able to recognize, out of a number of facts, which are
incidental and which vital. Otherwise your energy and
attention must be dissipated instead of being
concentrated.” (Sherlock Holmes - Arthur Conan Doyle)
Contents
Abstract
1
Acknowledgments
3
1. Introduction
1.1. Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2. Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3. Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
5
7
7
2. Data
9
2.1. Data sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2. Raw data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.1. Data variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3. Methods
3.1. Text Mining and Variable creation . . . .
3.2. Sampling strategies . . . . . . . . . . . . .
3.3. Evaluation techniques . . . . . . . . . . . .
3.4. Online drop analysis through classification
3.4.1. Dynamic Logistic Regression . . . .
3.4.2. Dynamic Model Averaging . . . . .
3.4.3. Dynamic Trees . . . . . . . . . . .
3.5. Drop description . . . . . . . . . . . . . .
3.5.1. Association Rule Mining . . . . . .
3.6. Technical aspects . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4. Results
4.1. Exploratory analysis . . . . . . . . . . . . . . . . . . . . . . . . .
4.2. Online classification . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2.1. Sampling strategies . . . . . . . . . . . . . . . . . . . . . .
4.2.2. Dynamic Trees . . . . . . . . . . . . . . . . . . . . . . . .
4.2.3. Dynamic Logistic Regression . . . . . . . . . . . . . . . . .
4.2.4. Summary of results . . . . . . . . . . . . . . . . . . . . . .
4.2.5. Static Logistic Regression vs. Dynamic Logistic Regression
4.3. Online drop analysis . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.1. DMA posterior inclusion probabilities . . . . . . . . . . . .
4.3.2. Evolution of odds-ratios and reduction in entropy . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
13
13
13
15
17
17
22
26
32
32
33
.
.
.
.
.
.
.
.
.
.
35
35
38
38
39
41
48
48
50
50
54
i
Contents
Contents
4.3.3. Static Logistic Regression vs. Dynamic Logistic Regression . . 58
5. Discussion
61
6. Conclusions
65
A. Figures
A.1. Results: Online classification . . . . . . . . . . . . . . . . . . . .
A.2. Results: Online drop analysis . . . . . . . . . . . . . . . . . . .
A.2.1. Single dynamic logistic regression vs. Univariate DMA .
A.2.2. Significant covariates in interesting period . . . . . . . .
A.2.3. Static vs. Dynamic Logistic Regression: covariate effects
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
67
67
68
68
69
74
B. Tables
77
B.1. Results: Online classification . . . . . . . . . . . . . . . . . . . . . . . 77
B.1.1. Dynamic Trees . . . . . . . . . . . . . . . . . . . . . . . . . . 77
B.1.2. Dynamic Logistic Regression . . . . . . . . . . . . . . . . . . . 78
Bibliography
ii
83
Abstract
This thesis sets out to analyze the complex and dynamic relationship between mobile phone call connections that terminate unexpectedly (dropped calls) and those
that terminate naturally (normal calls). The main objective is to identify temporally discriminative features, such as to assist domain experts in their quest of
troubleshooting mobile networks. For this purpose, dynamic extensions of logistic
regression and partition trees are considered.
The data consists of information recorded in real-time from mobile phone call connections, and each call is labeled by its category of termination. Characterizing
features of the data that pose considerable challenges are: (i) class imbalance, (ii)
high-dimensional, (iii) non-stationary, and (iv) sequentially arriving in a stream.
To address the issue of class imbalance, two sampling techniques are considered.
Specifically, an online adaptation of the random undersampling technique is implemented, as well as an extension (proposed in this thesis) that accounts for the
possibility of changing degree of imbalance. The results suggest that the former is
preferable for this data, but that both improve the degree of identification of the
minority class (dropped calls).
Another characterizing feature of this dataset is that several of the covariates are
temporally sparse. This is shown to cause problems in the recursive estimation
step of the dynamic logistic regression model. Consequently, this thesis presents an
extension that accounts for temporal sparsity, and it is shown that this extension
allows for the inclusion of temporally sparse attributes, as well as to improve the
predictive capability.
A thorough evaluation of the considered models is performed, and it is found that
the best model is the single dynamic logistic regression, achieving an Area under
the curve (AUC) of 99.96%. Based on odds ratios, posterior inclusion probabilities,
and posterior model probabilities from the dynamic logistic regression, and reduction
in entropy from the dynamic trees, analysis of temporally discriminative features
is performed. Specifically, two sub-periods of abnormally high call drop rate are
analyzed in closer detail, and several interesting findings are made; demonstrating
the potential of the proposed approach.
1
Acknowledgments
Several people deserve and have my deepest appreciation for their aid and support
in making this thesis possible.
First, I would like to thank Ericsson for giving me the opportunity to work with
them, as well as for providing the data for this thesis. Special thanks to my cosupervisors Paolo Elena and Henrik Schüller for, on the one hand, defining a really
interesting problem, and on the other, providing good support. Thanks also to Leif
Jonsson, who oversaw the thesis projects and provided valuable input. Another
person that cannot be left out is domain expert Håkan Bäcks, who provided very
useful insights about the data and the functionality of the network.
I would also like to thank my supervisor at Linköping University, Patrik Waldmann,
who provided good advice and participated in many fruitful discussions.
Finally, I would also like to thank my opponent, Andreea Bocancea, for her improvement suggestions. These undoubtedly strengthened the subsequent versions of
the thesis.
3
1. Introduction
1.1. Background
Besides selling hardware and software, network equipment providers (NEPs) also
provide support to mobile network operators (MNOs). One imperative supportrelated task is that of troubleshooting, which consists of detecting problems in the
network and understanding their causes. This task poses considerable challenges,
not just because of the complexity of the systems, but also because of the enormous
quantities of information that is collected from the networks every day.
In this thesis, troubleshooting will be considered from a statistics and data analysis
point of view. More specifically, this thesis sets out to analyze the complex and
dynamic relationship between dropped calls and normal calls - where a dropped call
may be defined as a call that ends without the intention of either participants of
the call. While it is the case that a certain number of dropped calls are expected,
inevitable and not interesting, there are also dropped calls of the sort that are unexpected, and may be caused by system malfunctions. Hence, from the perspective
of the NEPs, it is of great interest to quickly identify and understand the causes for
dropped calls, such that eventual problems can be correctly addressed. In periods
of abnormally high call drop rates (percentage of calls that are dropped) the identification of drop causes is especially important. System degradation can have a wide
range of different causes and explanations, such that the problem becomes quite
complex. Two examples of high-level causes are; (i) system updates in the network,
and (ii) new phones or software updates in already existing phones. In this thesis,
statistical and machine learning methods are applied to identify low-level indicators
of dropped calls, which later can be interpreted by domain experts to put eventual
problems into context.
The issue of detecting problems in mobile networks has been considered with a range
of approaches in the literature, in particular within the subdisciplines of anomaly
detection, fault detection, and fault diagnosis. A substantial amount of research
has been done in these areas, and there are quite a few papers that consider these
problems within the context of mobile networks, for example (Brauckhoff et al.,
2012; Watanabe et al., 2008; Cheung et al., 2005; Rao, 2006). The bulk of these
papers are concerned with identifying problems at the level of defined geographical
regions, and the data is such that it describes the characteristics of particular regions
(cells/radio base stations or radio network controllers), and not individual calls, as
is the case in this thesis. A common approach is to work within the unsupervised
5
Chapter 1
Introduction
framework, where the detection of a fault or anomaly often is the result of a setup
whereby one tracks and/or model a selected number of features, such to gain an idea
of the normal behavior, and then, when large deviations from this normal behavior
are observed, through - for instance - threshold violations, as in Cheung et al. (2005)
and Rao (2006), an anomaly or fault has been identified. Various techniques have
been explored to extract and describe anomalies and faults: one approach is to apply
association rule mining, as in Brauckhoff et al. (2012).
There are relatively few papers that, within the context of mobile networks, consider
the problem of fault detection or fault diagnosis in a supervised setting. In one of
the exceptions, Khanafer et al. (2006), a Naive Bayes classifier is considered for
predicting a set of labeled faults. Zhou et al. (2013) and Theera-Ampornpunt et al.
(2013) also work within the supervised framework, with similar data (equal response
and similar input) to that of this thesis, but with a slightly different objective; to
perform early classification, such that proactive management can be implemented
to deter certain types of calls from dropping. The classification methods considered
in these two papers are Adaboost and Support Vector M achines.
A limitation of the aforementioned approaches is that they, to a varying degree,
implicitly assume a stationary and static environment - and mobile networks are in
general not static systems: as previously mentioned, internal and external modifications and updates occur irregularly. This motivates a dynamic approach, rather
than a static one. An additional limitation of the aforementioned approaches is that
they also, to some extent, assume that the the data can be stored. In the context of
processing data from mobile networks, however, this is assumption is problematic,
since the volume of the data that is processed every day is astronomical: in 2014,
Ericsson, the company at which this work was carried out, had 6 billion mobile
subscribers, with a global monthly traffic of ∼ 2400 Petabytes (Ericsson, 2014).
While it may not be feasible to thoroughly analyze the whole data, it does appear
intuitively appealing to be able to analyze more data for the same cost, and thus, approaches with such characteristics ought to be preferable. A research discipline that
has gained a lot of attention recently, and which deals with limitations of the sort
described above, is online learning. In this thesis, a framework centered around online learning is proposed for the problem of predicting dropped calls and explaining
their causes.
I addition to being non-stationary, the data is also greatly imbalanced with respect
to the response variable. To address the challenges that come with imbalanced data,
sampling techniques are explored. In particular, an adaptive undersampling scheme
is developed, where less data is sampled during periods of few dropped calls, and
more data is sampled during periods of increased number of dropped calls.
Another challenge is that several attributes in the data are temporally sparse. This
presents a limitation for one of the selected methods. Subsequently, in this thesis,
an extension of the forgetting factor framework originally proposed by McCormick
et al. (2012), is developed and evaluated.
6
1.2 Objective
1.2. Objective
The aim of this master thesis is to develop a framework that can identify temporally
discriminative features for explaining dropped calls. A key challenge is that the
underlying distribution of the data is non-stationary and changes are expected to
occur irregularly and unpredictably over time. Subsequently, this thesis sets out
to tackle this problem by using an online learning approach, wherein dynamic extensions of the logistic regression and partition trees are explored. Another (not
completely orthogonal) aim of this thesis is to predict dropped calls with high precision. This latter objective is motivated by the fact that only information recorded
up to a certain time before call termination is used, and as such, may be thought
of as a first step in exploring the possibilities of early classification for this type
of data. Finally, to evaluate the decision of using the dynamic approach, a set of
scenarios are simulated in which the best dynamic classifier is compared to its static
equivalency, both in terms of predictability and exploratory insights.
1.3. Definitions
The following definitions are needed to fully understand the context of the problem.
Troubleshooting
Troubleshooting is an approach to problem solving. Specifically, it is the systematic
search for the source of a problem - such that it can be solved.
User Equipment (UE)
User equipment (UE) constitutes of phones, computers, tablets, and other devices
that connect to the network.
Network Equipment Provider (NEP)
Companies that sell products and services to communication service providers, such
as mobile network operators, are referred to as network equipment providers (NEPs).
Mobile Network Operator (MNO)
Companies that provide services of wireless communications that either own or control the necessary elements to sell and deliver services to end users are referred to
as mobile network operators (MNOs). Examples of such companies are Telia, Tele2,
and Telenor.
7
Chapter 1
Introduction
Normal calls
Normal calls refer to connections between user equipment (UE) and the network
where the connection terminates as expected.
Dropped calls
Dropped calls refer to connections between user equipment (UE) and the network,
with the outcome of unexpected termination.
UMTS Network
A Universal Mobile Telecommunications System (UMTS), also referred to as 3G,
is a third generation mobile cellular system for telecommunication networks. The
system supports standard voice calls, mobile internet access, as well as simultaneous
use of both voice call and internet access. Although 4G has been introduced, 3G
remain the most widely used standard for mobile networks.
Radio Network Controller (RNC)
The Radio Network Controller (RNC) is the governing element in the UMTS network and is responsible for controlling the radio base stations that are connected to
it.
Radio Base Station (RBS)
Radio base stations (RBS) constitute the elements of a network that provides the
connection between UE and the RNC.
8
2. Data
2.1. Data sources
The data were supplied by Ericsson AB and consist of machine-produced trace logs.
These so called trace logs were originally collected from a lab environment at the
Ericsson offices in Kista. As such, the information contained in the data does not
reflect the behavior of any real people, but rather programmed systems. However,
these systems are programmed such that they should reflect human behavior: a
simulated call may for instance consist of texting, browsing the internet, physical
movements, and others. Moreover, even though it is a lab environment, the implemented system technology is equivalent to that which is used in most live networks;
the so called Universal Mobile Telecommunications System (UMTS), also known as
3G.
Introduced in 2001, 3G is the third generation of mobile systems for telecommunication networks, and supports standard voice calls, mobile internet access, as well as
simultaneous use of both voice call and internet access. Although 4G has been introduced, 3G still remains the most widely used system for mobile networks. The 3G
network is structured hierarchically and by geographical region. More specifically,
the network consists of three primary - interacting - elements: the user equipment
(UE), radio base stations (RBS), and radio network controllers (RNC). At the bottom of the hierarchy, there are the cells, which define the smallest geographical
regions in the network. RBSs are deployed such that they may be responsible for
multiple cells, and as described in sec. 1.3, a RBS acts similar to that of a router:
it provides the connection between the UE and the RNC. The RNC is the ruling
element of the 3G network and is responsible for managing the division of resources
at lower levels; for example which RBS a particular UE should use.
2.2. Raw data
For every call that is initiated, a trace log is produced. The contents of these logs
are recorded in real-time and contains information that corresponds to signals sent
between the user equipment (UE), radio base stations (RBS), and the radio network controller (RNC). These signals may contain connection details, configuration
information, measurement reports, failure indications, and others. This information, originally formatted as text, were first transformed into suitable format (as
9
Chapter 2
Data
described in sec. 3.1), and later used as the input to the statistical models evaluated
in this thesis. Finally, for each call, there is a recorded outcome, {normal, dropped},
which defines the response variable. More details about specific variables follow in
sec. 2.2.1.
The period for which the data were collected is January 26, 2015 - April 10, 2015,
corresponding to approximately two and a half months’ worth of data. During this
period, a total of 7.200 dropped calls were recorded. The total number of normal
calls in the same period was much greater: 670.000. That is, approximately 99% of
the calls terminated as expected (=normal), and only 1% terminated unexpectedly
(=dropped). Datasets with this characteristic are often referred to as imbalanced
in the machine learning and statistics literature. For classifiers that seek to separate two or more classes, imbalance can be problematic. In sec. 3.2, techniques for
addressing challenges accompanying imbalanced datasets are described.
500
400
300
200
100
0
Number of dropped calls
600
In Figure 2.1, a time-series plot is presented, displaying the number of dropped calls
over the period of interest. Note that the time-scale of the plot is not in minutes,
hours or days: instead the data were divided into 100 equally large subsets, and
then the sum was calculated within each subset. The rationale for presenting the
data like this, rather than in relation to actual time, is twofold: (i) this lab data do
not have any periodic dependencies, and (ii) an unequal amount of calls were traced
during different periods, and during some days, no calls were recorded.
0
20
40
60
80
100
Time
Figure 2.1.: Number of dropped calls as divided into 100 (ordered) equally large
subsets.
As one may observe in Figure 2.1, the number of drops is approximately constant
for most of the period, with no apparent trend. There are however multiple time
periods in which the number of drops increases quite drastically. Intuitively, these
periods represent some form of degradation in the system with systematic errors.
One of the goals of this thesis is to identify what factors that was important during
such periods. In an online implementation, such a framework could potentially be
used to detect causes for problems early on.
10
2.2 Raw data
2.2.1. Data variables
From the original trace logs, a total of 188 attributes were initially extracted. Exploratory analysis revealed that quite a large proportion of them was redundant,
which resulted in a final input-space of 122 attributes. To reduce the degree of distortion from events occurring a long time prior to the termination of the calls, s was
(together with domain experts at Ericsson) decided that only the last 20 seconds
of each call should be kept for analysis. Note that the main contributing factor for
including a particular variable weren’t the known significance of the variable, but
rather its intrinsic and potential relevance - in terms of future events (such that
observing changes in its degree of relevance are useful for troubleshooting the network). In this section, a brief summary and explanation of each category of variables
is presented.
2.2.1.1. Cell Ids
As described in the previous section, cells are defined geographical areas of the network, and hence, in a model context, these variables contain information about the
location of the call events. From the considered period, 17 cell ids were recorded,
resulting in 17 binary dummy variables. In a real setting with live networks, the
number of cells would increase. In such a situation, clustering methods could potentially be used to merge cells that are (i) close to each other geographically, and
(ii) similar by some relevant metric - this to reduce the dimension of variables to
include and evaluate in the model.
2.2.1.2. tGCP
GCP, short for Generic Connection Properties, describes the range of possible connection properties that a call may possess. tGCP, or target-GCP, are the connection
properties that are targeted or requested by a particular device at a particular timepoint. A maximum of 31 connection properties can be possessed for a particular call.
The presence or absence of a particular connection property is registered as 1 and
0 respectively. In this work, the last set of registered connection properties for each
call is used as input to the model - this to capture the connection properties that
were requested at the time of the drop. The 31 connection properties are treated as
binary dummy-variables.
2.2.1.3. Trace 4 Procedures
Trace, in the context of this dataset, refers to the process of monitoring the execution
of RNC functions that are relevant to a particular call. Traces are grouped such
that similar events (execution of RNC functions) are traced by the same trace group.
11
Chapter 2
Data
For the considered STP, over the relevant time-period, three trace groups were
observed: trace1, trace3, and trace4 - the latter being (by far) the most frequent
one. Trace4 describes events such as: Importation and Deportation of processes and
program procedures. More specifically, trace4 can be divided into 37 different events,
referred to as procedures. For example, procedure 10 describes the Importation or
Deportation of a “soft handover” event. In this thesis, these procedures are treated
as binary dummy-variables.
2.2.1.4. UeRcid
UeRcid, short for UE Radio Connection Id, defines - as the name suggests - the type
of radio connection that a particular UE has activated. For the considered data,
approximately 150 different such id’s exist. In this work, we group these by their
inherent properties. Specifically, we differentiate between PS (Packet switched),
CS (Circuit switched), SRB (Signaling Radio Bearer) and Mixed (a combination
of the aforementioned). PS is that connection which is concerned with data traffic,
whilst CS is that which is concerned with conversation/speech. SRB is the result
of the initial connection establishment, as well as the release of the connection.
The presence or absence of a particular radio connection is registered as 1 and 0
respectively.
2.2.1.5. evID
EvID, short for Event Id, is found in the measurement reports of the trace logs, and
constitutes of reports related to radio quality, signal strength and others. As such, a
specific evID defines a specific type of such a report or event. Consider for instance,
“evID=e2d”, which defines “Quality of the currently used frequency is below a certain
threshold”. In this work, these evID’s are treated as binary dummy-variables.
12
3. Methods
In this chapter, the framework and subsequent methods used in this thesis are explained. The framework is divided into four parts; the first step, text mining and
variable creation, is the step in which the data is transformed from machine generated text to structured matrices apt for statistical methods. The second step,
sampling, addresses the challenges of imbalanced data. The third step of the framework is the main part and consists of dynamic classification of streaming data. The
fourth and final part of the framework seeks to derive intuitive descriptions of the
results obtained from step 3, through the application of association rules.
3.1. Text Mining and Variable creation
As previously mentioned, the original format of the data was “text”, such that any
direct input to statistical methods was not possible. To address this, techniques
commonly associated with the area of text mining were applied. More specifically,
text variables were created, and defined as binary dummy-variables. For instance, if
“configuration request” appears in a particular call, then the value of that variable
is ”1”. Initially, the count of specific words was also considered, but it was found
that it did not add any discriminative value, and were consequently dismissed.
Some numerical measurements were also found in the logs - these do however (i)
not occur in all of the logs and (ii) are not missing at random: some measurements
are only triggered under certain circumstances. To cope with this type of missing
data, discretization techniques were applied such that categorical variables could be
derived from the original numerical ones (including a category ’missing’). Specifically, the CAIM discretization algorithm, proposed by Kurgan and Cios (2004) was
used. For a continuous-valued attribute, the CAIM algorithm seeks to divide its
range of values into a minimal number of discrete intervals, whilst at the same time
minimizing the loss of class-attribute interdependency. It is out of the scope of this
thesis to cover the details of this algorithm, and hence we refer to Kurgan and Cios
(2004) for more details.
3.2. Sampling strategies
Sampling, in the most general sense of the word, is concerned with selecting a
subset of observations from a particular population. In the context of classification,
13
Chapter 3
Methods
sampling techniques are popular for dealing with the issue of class imbalance. An
imbalanced dataset is defined as one in which the distribution of the response variable
is skewed towards one of the classes (He and Garcia, 2009). The motivation for
considering sampling techniques in this thesis is three-fold: (i) due to limitations
in memory & computational power, and the overwhelming size of the unformatted
source (txt) files, only a limited number of logs could feasibly be extracted from these
source files, (ii) for imbalanced datasets, classifiers tend to learn the response classes
unequally well, where the minority class often is ignored, such that the separation
capability becomes poor (Wang et al., 2013), and (iii) sampling techniques has shown
to be effective for addressing class imbalance in other works (He and Garcia, 2009).
Sampling is a well-researched subject, and a wide range of techniques have been
proposed over the years. The great bulk of these techniques are however limited
to environments where the data is assumed to be fixed and static. For example,
the random undersampling technique, that has a simple and intuitive appeal: observations from the majority class are selected at random and removed, until the
ratio between the response classes has reached a satisfactory level. Japkowicz et al.
(2000) evaluated this simpler technique and compared it to more sophisticated ones,
and concluded that random undersampling held up well. The issue of online class
imbalance learning has so far attracted relatively little attention (Wang et al., 2013).
Most of the proposed methods for addressing non-static environments assume that
the data arrives in batches (Nguyen et al., 2011), and are thus not directly applicable to online learning. One of the first papers to address the issue of imbalanced
data in an online learning context was Nguyen et al. (2011). In it, a technique here referred to as ORUS - that allows the analyst to choose a fixed rate at which
undersampling should occur were proposed: observations from the minority class
are always accepted for inclusion, whilst observations from the majority class are
included only with a fixed probability. In other words random under sampling in
an online context. This simple implementation is described more formally in equation (3.1), where q is the parameter determining the fixed sampling rate. Nguyen
et al. (2011) shows that this approach is able to provide good results for an online
implementation of the naive Bayes classifier:
ORU S :
p(inclusionxt ) =

1
q
yt = 1
yt = 0
(3.1)
This technique does however not account for the possibility of changing levels of
imbalance over time; it assumes a fixed rate, to be known a priori. In Wang et al.
(2013), an extension were proposed in which the degree of imbalance is continuously
estimated, using a decay factor, such that the inclusion probability is allowed to
change over time.
In this thesis, a simple adaptive sampling scheme, sharing traits with both Wang
et al. (2013) and Nguyen et al. (2011), is developed. Specifically, a sliding window is
14
3.3 Evaluation techniques
used to estimate the local imbalance at different time points, such that the undersampling rate (inclusion probability) of the majority class is allowed to change over
time. If the proportion of dropped calls during a particular period is relatively high,
the inclusion probability for normal calls is increased. If, on the other hand, the
proportion of dropped calls is relatively low, the inclusion probability for normal
calls is decreased. More formally, as in Nguyen et al. (2011), we let the analyst
select a constant (q): it should be the baseline expectation of the class-imbalance
prior to observing any data. In the case of mobile networks, the call-drop rate is
well-understood such that this “pseudo prior” can be set with confidence. The idea
is then to use this baseline expectation to construct the sliding window: w = 1q .
This sliding window moves incrementally, one observation a time, and estimates the
local imbalance-rate at every time point as a result of the number of minority observations found in that particular time-window. This is described mathematically
in equation (3.2), where q is the constant describing the baseline expectation:
O − ARU St :


1


 Pt−1
p(inclusionxt ) = 
t−w
w


q
yt = 1
yt
yt = 0 & t−1
t−w yt > 1
Pt−1
yt = 0 & t−w yt ≤ 1
P
(3.2)
For instance, let’s consider a scenario in which the analyst has set a baseline ex1
pectation of 1%: the sliding window would then become w = 0.01
= 100. Consider
further that we stand at time point (t), and in the past 100 observations, 3 observations from the minority class have been encountered. The inclusion probability for
3
a majority class observation would then, at time point (t), be equal to 100
= 3%.
3.3. Evaluation techniques
The question of how one should evaluate a classifier depends on the data, and the
objective of the classification. What is of particular interest in this thesis is to
identify and discriminate positive occurrences from negative ones, e.g. identifying
and separating ’dropped calls’ from ’normal calls’ - largely because the main objective of this thesis is to explore what factors contribute towards the classification
of ’dropped calls’. The most commonly used metric for evaluating classifiers is the
Accuracy measure:
Accuracy =
TP + TN
TP + TN + FP + FN
(3.3)
Where T P = T rue P ositives, T N = T rue N egatives, F P = F alse P ositives,
F N = F alse N egatives. It simply describes the total number of correct predictions as a ratio of the total number of predictions.
15
Chapter 3
Methods
In cases where the number of positive and negative instances differ greatly (imbalanced data), the accuracy measure can be misleading. For instance, with an
imbalance-ratio of 99:1, it would be possible to achieve an accuracy of 99% simply
by classifying all observations as negative instances. To avoid such pitfalls, a variety of evaluation metrics have been proposed: one being AUC, which represents the
area under the ROC curve (Hanley and McNeil, 1982). The ROC curve displays the
relationship between the true positive rate, TPR (Sensitivity) and the false positive
rate, FPR (1 − Specif icity). More specifically, the ROC curve is constructed by
considering a range of operating points or decision thresholds, and for each such point
(or threshold), it calculates the true positive rate and false positive rate. The intersection of these two scores, at each threshold, produces a dot in a two-dimensional
display. Between the plotted dots, a line is drawn: this constitutes the ROC curve
(Obuchowski, 2003).
Sensitivity = T P R =
TP
TP + FN
1 − Specif icity = F P R =
(3.4)
FP
TN
=1−
FP + TN
TN + FP
(3.5)
AUC can be interpreted as the probability that a randomly selected observation
from the positive class is ranked higher than a randomly selected observation from
the negative class - in terms of belonging to the positive class.
It should be emphasized that, in the context of online learning, and hence for the
methods considered in this thesis, there is no training- or test- dataset: as the
models are constructed and updated sequentially, we instead evaluate the one-stepahead predictions of the models. The first papers to address the issue of imbalanced
online learning, (Nguyen et al., 2011) and (Wang et al., 2013), proposed the use of
the G-mean as an evaluation metric. G-mean is short for Geometric mean and is
constructed as follows (Powers, 2011):
G − mean =
q
(3.6)
precision × recall
where
precision =
TP
TP + FP
recall =
TP
TP + FN
(3.7)
AUC and G-mean will constitute the main measurements upon which comparisons
and evaluations are founded in this thesis.
16
3.4 Online drop analysis through classification
3.4. Online drop analysis through classification
Classification, as a statistical framework, defines the process of modeling the relationship between a set of input variables, X, and an outcome variable, y, where the
outcome variable is discrete. As the main objective of this thesis is to study the
relationship between ’dropped calls’ and ’normal calls’, it is naturally framed as a
classification problem - with the response: {Dropped call, Normal call}.
An extensive number of classification techniques have been proposed over the years,
and what amounts to the “best one” is often data and task specific. In the case of
this thesis, there are four fundamental criteria that a classifier must meet: (i) it must
be transparent, in the sense that insight of what variables contributes to a certain
outcome is required, (ii) it must be able to cope with a high-dimensional input, as
there is a great deal of interesting information recorded for each call, (iii) it must be
able to handle the sequential nature of the data, e.g. that data arrives continuously
in a stream, and finally (iv) it should be adaptive and be able to capture local
behaviors, since - as explained before - the cause for drops are expected to change
over time. These criteria drastically reduce the space of apt classifiers: popular
techniques such as Support Vector Machines and Artificial Neural Networks are good
alternatives to deal with complex high-dimensional input (and may be extended to
deal with streaming data), but they fail on the important issue of transparency in
regards to variable importance.
The sequential and adaptive aspects described above are naturally addressed in
the field of online learning, which assumes that data is continuously arriving and
may not be stationary. Hence, an ideal intersection would be an online learning
classifier that is transparent and can handle higher dimensions. Two such techniques
were identified, the Dynamic Logistic Regression and Dynamic Trees. The static
versions of these two, the logistic regression and partition trees, are known for their
transparency in regards to variable contribution, and hence the dynamic extensions
are appealing for this work.
3.4.1. Dynamic Logistic Regression
This technique, originally proposed by Penny and Roberts (1999), extends the standard logistic regression by considering an additional dimension: time. Through a
Bayesian sequential framework, the parameter estimates are recursively estimated,
and hence allowed to change over time. The particular version of the dynamic logistic regression that is applied in this work follows McCormick et al. (2012), and it
is described below. But first, let’s consider - what in this thesis is referred to as the static logistic regression.
17
Chapter 3
Methods
3.4.1.1. Static logistic Regression
The static logistic regression, or just logistic regression, is a technique for predicting
discrete outcomes. It was originally developed by Cox (1958), and still remains
one of the most popular classification techniques. Logistic regression has several
attractive characteristics, in particular its relative transparency, and the way in
which one is able to evaluate the contribution of the covariates to the predictions.
Logistic regression is a special case of generalized linear models, and may be seen
as an extension of the linear regression model. Since the dependent variable is
discrete, or more specifically Bernoulli distributed, it is not possible to derive the
linear relationship between the response and the predictors directly, and hence a
transformation is needed.
y ∼ Bernoulli(p)
In the case of the logistic regression, a logit-link is used for the purpose of transformation. Consider the logistic function in equation (3.8):
F (x) =
1
1 + e−(β0 +β1 x1 +...)
(3.8)
Where the exponent describes a function of a linear combination of the independent
variables. The logit-link is derived through the inverse of the logistic function, as in
equation (3.9):
logit(p) = g(F (x)) = ln
F (x)
= β0 + β1 x1 + ... = xT θ
1 − F (x)
(3.9)
3.4.1.2. State-space representation
Given the objective of exploring temporal significance of independent variables, a
natural extension of the static logistic regression model is to add a time dimension.
As in McCormick et al. (2012), we do so by defining the logistic regression through
the Bayesian paradigm, and by applying the concept of recursive estimation: this
allows sequential modeling of the data, and - what in the literature commonly is
referred to as - online learning. Equation (3.9) is hence updated to:
logit(pt ) = xTt θt
(3.10)
Notice the added subscript t. The recursive estimation is computed in two steps:
the prediction step and the updating step:
18
3.4 Online drop analysis through classification
Prediction step:
At a given point in time, (t), the posterior mode of the previous time step (t − 1)
is used to form the prior for time (t). The parameter estimates at time (t) are
hence based on the observed data up and till time (t − 1). Using these estimates, a
prediction of the outcome at time (t) is made.
More formally, we let the regression parameters θt evolve according to the state equation θt = θt−1 +δt , where δt ∼ N (0, Wt ) is a state innovation. That is, the parameter
estimates at time (t) are based on the parameter estimates at time (t − 1) plus a
delta term. Inference is then performed recursively using Kalman filter updating,
Suppose that, for set of past outcomes Y t−1 = {y1 , ..., yt−1 }:
θt−1 |Y t−1 ∼ N (θ̂t−1 , Σ̂t−1 )
The prediction equation is then formed as:
θt |Y t−1 ∼ N (θ̂t−1 , Rt )
(3.11)
where
Rt =
Σ̂t−1
λt
(3.12)
λt is a forgetting factor, and is typically set slightly below 1. The forgetting factor
acts as a scaling factor to the covariance matrix from the previous time point,
this to calibrate the influence of past observations. The concept of using forgetting
factors for this particular purpose is quite common in the area of dynamic modeling,
and there has been a range of proposed forgetting strategies. For a review, see
(Smith, 1992). In this work, we apply the adaptive forgetting scheme proposed by
McCormick et al. (2012), which allows the amount of change in the model parameters
to change over time - an attractive feature, considering the complex dynamics of the
mobile network systems. More about the specifics of the forgetting factor later in
this section.
Updating step:
The prediction equation in (3.11) is, together with the observation arriving at time
(t), used to construct the updated estimates. More specifically, having observed yt ,
the posterior distribution of the updated estimate θt is:
p(θt |Y t ) ∝ p(yt |θt )p(θt |Y t−1 )
(3.13)
19
Chapter 3
Methods
where p(yt |θt ) is the likelihood at time (t), and the second term is the prediction
equation (which now acts a prior). Since the Gaussian distribution is not the conjugate prior of likelihood function in logistic regression, the posterior is non-standard,
and there is no solution in closed form of equation (3.13). Consequently, McCormick
et al. (2012) approximate the right-hand side of equation (3.13) with the normal distribution, as is common practice. More formally, θ̂t−1 is used as a starting value,
and then the mean of the approximating normal distribution at time point (t) is:
θ̂t = θ̂t−1 − D2 l(θ̂t−1 )−1 Dl(θ̂t−1 )
(3.14)
where second and third term of the right-hand side are the second and first derivatives of l(θ) = log p(yt |θ)p(θ|Y t−1 ) respectively, e.g. the logarithm of the likelihood
times the prior. The variance of the approximating normal distribution, which is
used to update the state variance, is estimated using:
X
ˆ
t
= {−D2 l(θ̂t−1 )}−1
(3.15)
In McCormick et al. (2012), a static (frequentist) logistic regression is used in a training period to obtain some reasonable starting points for the coefficient estimates.
Now, since the data which is used in this thesis is sparse with regards to several of
the input variables, this approach cannot straightforwardly be implemented. This
is so because, for some of the covariates, none or very few occurrences are recorded
during the first part of the data. Consequently, we here apply a pseudo-Bayesian
framework, introducing two pseudo priors (mean, variance): θ0 , σ02 , for every coefficient. If no observations are observed during the training period, these priors are
simply not updated.
The forgetting factor, λ
In Raftery et al. (2010), the predecessor to McCormick et al. (2012), a forgetting
scheme where λ is a fixed constant were introduced, and more specifically they set
λ = 0.99. It is noted that this constant ought to be determined based on the belief
of the stability of the system. If the process is believed to be more volatile and nonstationary, a smaller λ is preferable, since the posterior update at each time-point
then weighs the likelihood - relative to the prior - higher, and hence the parameter
estimates are more locally fitted, and updated more rapidly. More formally, this
forgetting specification implies that an observation encountered j time-points in the
past is weighted by λj (Koop and Korobilis, 2012). For instance, with λ = 0.99, an
observation encountered 100 time-points in the past receive approximately 37% as
much weight as the current observation.
McCormick et al. (2012), in addition to extending (Raftery et al., 2010) from dynamic linear regression to dynamic binary classification, also proposed a new adaptive forgetting scheme. The forgetting factor, λt (now defined with a subscript: t), is
20
3.4 Online drop analysis through classification
extended such that it is allowed to assume different values at different time-points.
This has the effect of allowing the rate of change in the parameters to change over
time. The predictive likelihood is used to determine the λ to be used at each timepoint. More specifically, the λt that maximizes the following argument is selected;
ˆ
λt = arg maxλt
p(yt |θt , Y t−1 )p(θt |Y t−1 )dθt
(3.16)
θt
However, since this integral is not available in closed form, McCormick et al. (2012)
uses a Laplace approximation:
f (yt |Y t−1 ) ≈ (2π)d/2 |{D2 (θ̂t )}−1 |1/2 p(yt |Y t−1 , θ̂t )p(θ̂t |Y t−1 )
(3.17)
Which, according to Lewis and Raftery (1997), should be quite accurate. Instead
of evaluating a whole range of different λ0t s to maximize the expression in equation
(3.16), McCormick et al. (2012) uses a simpler approach, that only considers two
possible states: some forgetting (λt = c < 1) and no forgetting (λt = 1). Different parameters are allowed to have different forgetting factors, and hence it would
computationally difficult to evaluate multiple λ0 s for models consisting of more than
just a few variables, because the combinatorics grows exponentially. In their experiments, they conclude that the results were not sensitive to the chosen constant. In
this thesis, both single and multiple λ0 s will be evaluated. In the case of multiple
λ0 s, the model will share a common forgetting factor.
Quite early on, it was empirically found that the forgetting schemes described above
encountered problems with temporally sparse covariates, and that the smaller the
λ, the bigger the trouble. In an attempt to remedy this issue, we propose a simple, yet intuitively reasonable, modification. The basic idea is that c, the constant
selected by the analyst, is - for each observation, and each attribute - scaled based
on an estimate of the local sparsity, such that, during periods of mostly zeros for a
particular covariate, λ is scaled towards 1:
(1)
(2)
λt
=
(1)
λt
(1 − λt )
+
P
3
( ti=t−w xi )
1+
w
(3.18)
Where w is a constant to be selected by the user: it is the window upon which the
local sparsity is estimated. The summation in the denominator reflects the number
of non-zero occurrences in the past w observations. The more occurrences that are
(1)
observed, the larger the number that (1 − λt ) is divided by, and consequently the
(1)
less λt is scaled.
21
Chapter 3
Methods
For instance, consider a fictive scenario in which an analyst has selected c = 0.95,
and w = 10, and for a particular covariate, at a particular time-point, 9 out of the
last 10 observations are zero for this attribute, e.g. sparse. Equation (3.18) would
(1)
(2)
have the effect of modifying λt = 0.95 to λt = 0.995. If, at another time-point,
(1)
say 8 out of the 10 occurrences in w are non-zero values, λt =0.95 is only changed
(2)
to λt = 0.9501. The effect of this modification is further analyzed in sec. 4.2.
Evolution of the odds-ratios
In McCormick et al. (2012); Koop and Korobilis (2012), two approaches were considered for studying the temporal significance of covariates and how the conditional
relationships change over time; one being through the evolution of odds-ratios for
specific covariates.
Just as in the static logistic regression, odds-ratios are obtained by exponentiation the logit coefficients. Odds-ratios may be interpreted as the effect of one unit
change in X in the predicted odds, with all other independent variables held constant
(Breaugh, 2003). An odds-ratio > 1.0 implies that a particular covariate potentially
has a positive effect, while an odds-ratio < 1.0 implies a potential negative effect.
The farther the odds ratio is from 1.0, the stronger the association. In (Haddock
et al., 1998), guidelines for interpreting the magnitude of an odds ratio are provided,
and in particular a rule of thumb which states that odds ratios close to 1.0 represent a ’weak relationship’, whereas odds ratios over 3.0 indicate ’strong (positive)
relationships’. In (McCormick et al., 2012), ±2 standard errors are computed, and
if the confidence interval doesn’t overlap 1.0, a covariate is concluded to have a
significant effect.
In this thesis, both of the aforementioned approaches are considered in the process
of reflecting upon temporal significance of covariates.
3.4.2. Dynamic Model Averaging
Dynamic model averaging (DMA), originally proposed by Raftery et al. (2010), is an
extension of Bayesian Model Averaging (BMA) that introduces the extra dimension
of time through state-space modeling. In this thesis, DMA is used together with
the dynamic logistic regression, as in McCormick et al. (2012). This combination
is attractive, considering the objectives of this work, in that the dynamic logistic
regression allows the marginal effects of the predictors to change over time, whilst
the dynamic model averaging allows for the set of predictors to change over time.
BMA, first introduced by Hoeting et al. (1999), addresses the issue of model uncertainty by considering multiple (M1 , ..., Mk ) models simultaneously, and computes
the posterior distribution of a quantity of interest, say θ, by averaging the posterior
distribution of θ for every considered model - weighting their respective contribution
22
3.4 Online drop analysis through classification
by their posterior model probability (Hoeting et al., 1999), as in equation (3.19):
p(θ|X) =
K
X
(3.19)
p(θ|Mk , X)p(Mk |X)
k=1
The posterior model probability for model Mk can be written as follows:
p(X|Mk )p(Mk )
(3.20)
p(Mk |X) = PK
l=1 p(X|Ml )p(Ml )
´
where p(X|Mk ) = p(X|θk , Mk )p(θk |Mk )dθk is the integrated likelihood of model
Mk , and θk is the vector of parameters of model Mk .
3.4.2.1. State-space representation
By introducing a state-space representation of the BMA, leading to DMA, the posterior model probabilities become dynamic, and are hence allowed to change over
time. Just as in regular BMA, one considers K candidate models {M1 , ..., MK }.
Considering the specific combination of DMA and dynamic logistic regression, we
re-define equation (3.10) as follows:
(k)
(k)T (k)
θt
(3.21)
logit(pt ) = xt
(k)T
(k)
and θt , implying that canNotice the superscript (k) that is present for both xt
didate models may have different setups of covariates, and their parameter estimates
may also differ.
Estimation with DMA, following McCormick et al. (2012), is computed using the
same framework as in the (single-) dynamic logistic regression, e.g. the two steps
of prediction and updating. Different from the single-model case, however, is the
definition of the state space, which here consist of the pair (Lt , Θt ), where Lt is a
model indicator - such that if Lt = k, the process is governed by model Mk at time
(1)
(k)
(t), and Θt = {θt , ..., θt }. Recursive estimation is performed on the pair (Lt , Θt ):
K
X
(l)
p(θt |Lt = l, Y t−1 )p(Lt = l|Y t−1 )
(3.22)
l=1
Equation (3.22) may be compared to (3.19), which is the corresponding equation for
BMA. An important aspect of (3.22) is that θtk is only present conditionally when
Lt = l.
23
Chapter 3
Methods
Before we consider the prediction and updating steps, it is worth noting that, as
in McCormick et al. (2012), a uniform prior is specified for the candidate models:
p(Lt = l) = 1/K.
Prediction step
We here consider the second term of equation (3.22), which is the prediction equation
of model indicator Lt : in other words, the probability that the considered model
is the governing model at time (t), given data up and till (t − 1). The prediction
equation is defined as follows:
P (Lt = k|Y t−1 ) =
K
X
p(Lt−1 = l|Y t−1 )p(Lt = k|Lt−1 = l)
(3.23)
l=1
The term p(Lt = k|Lt−1 = l) implies that a K × K transition matrix needs be
specified. To avoid this, Raftery et al. (2010) redefines equation (3.23) and introduce
another forgetting factor, αt :
P (Lt−1 = k|Y t−1 )αt
P (Lt = k|Y t−1 ) = PK
t−1 )αt
l=1 P (Lt−1 = l|Y
(3.24)
where αt has the effect of flattening the distribution of Lt , and hence increase the
uncertainty. Just as with λt , αt is adjusted over time using the predictive likelihood
(but here across candidate models).
Updating step
The (model-) updating step is defined through equation (3.25):
(k)
ωt
P (Lt = k|Y ) = PK
t
l=1
(l)
ωt
(3.25)
where
(l)
ωt = P (Lt = l|Y t−1 )f (l) (yt |Y t−1 )
(3.26)
Notice that the first term on the right side of equation (3.26) is the prediction
equation and the second term is the predictive likelihood for model (l). An important
feature here is that this latter term (the predictive likelihood) has already been
calculated (recall that it was used to determine the model-specific forgetting factor
λt ).
24
3.4 Online drop analysis through classification
Just as λt is allowed to take different values at different time-points, the forgetting
factor for the model indicator αt is as well. To determine which αt to be used at
time t, (McCormick et al., 2012) suggests maximizing:
K
X
arg maxαt
f (k) (yt |Y t−1 )P (Lt = k|Y t−1 )
(3.27)
k=1
That is, maximizing the predictive likelihood across the candidate models. The first
term in equation (3.27) is the model-specific predictive likelihood (which we already
have computed), and the second term is (3.24). As such, this adds minimal additional computation. Now, in practice, McCormick et al. (2012) takes the approach
of evaluating two α values at each time-point {some forgetting/no forgetting}.
Finally, upon predicting yt at time t, equation (3.28) is applied:
ŷtLDM A =
K
X
(l)
P (Lt = l|Y t−1 )ŷt
(3.28)
l=1
(l)
where ŷt is the predicted response for model l and time t. That is, to form the
DMA prediction, each candidate model’s individual prediction is weighted by its
posterior model probability.
Evolution of the inclusion probabilities
The second approach considered by McCormick et al. (2012); Koop and Korobilis
(2012) for the purpose of studying the temporal significance of covariates is that
which is centered around posterior inclusion probabilities. They are derived by
summing the posterior model probabilities for those models that include a particular
variable at a particular time. To do so, first all 2p combinations of the input variables
need to be computed as to construct 2p candidate models - where p is the number of
predictors. More formally, the posterior inclusion probability for variable i at time
t is (Barbieri and Berger, 2004):
pi,t ≡
X
P (MI |y)
(3.29)
I: l=1
This approach is feasible in both McCormick et al. (2012) and Koop and Korobilis
(2012) since the number of covariates is - in comparison to this thesis - relatively
small, and the length of the time-series are also relatively short. For this thesis,
the ideal would have been to set up candidate models such to represent all possible
combinations of variables, but since the number of covariates is quite large (> 100),
and the length of the time-series is long, that isn’t computationally feasibly. Consequently, we do not consider all possible combinations of all covariates, but rather use
25
Chapter 3
Methods
the “interesting variable groups” (as defined in sec. 4.2), and consider all the possible
combinations of these. Although limiting, this approach is reasonable since many of
the covariates have quite a clear group structure, and the motivation for exploring
this approach is that it may give some (high-level) insights into what variable groups
are important at different time-points.
The univariate scanner
An additional approach considered in this thesis for exploring temporal significance
of covariates is one in which candidate models are constructed to be univariate.
This approach is explored because (i) it allows for covariate-specific updating of the
forgetting factor in a computationally feasible way, and (ii) it avoids eventual issues
of multicollinearity that the first approach of McCormick et al. (2012) may suffer
from.
To determine the significance of a particular variable at a particular time, the oddsratios may be interpreted as described in the last section, or through the posterior
model probabilities (> 0.5), as recommended by Barbieri and Berger (2004).
3.4.3. Dynamic Trees
Dynamic trees, first proposed by Taddy et al. (2011), is an extension of the popular
non-parametric technique partition trees. This thesis follow the particular version
developed by Anagnostopoulos and Gramacy (2012), which extends the former by
introducing a retiring scheme that allows the model complexity of the tree to not
increase in a monotonic way over time, but rather change in accordance with local
structures of the data. We first outline some basic concepts of partition trees and
relevant notations, and then the dynamic extension is introduced.
3.4.3.1. Static partition trees
The basic idea of (static) partition trees is to hierarchically partition a given input space X into hyper-rectangles (leaves), by applying nested logical rules. The
standard approach is to use binary recursive partitioning.
A tree, here denoted by T , consists of a set of hierarchically ordered nodes ηT ,
each of which is associated with a subset of the input covariates xt = {xs }t . These
subsets are the result of a series of splitting rules.
Considering the tree structure in a bit more detail, one may differentiate between
different types of nodes: (i) at the top of every tree, one finds the root node, RT ,
which includes all of xt , (ii) using binary splitting rules, a node η may be spitted into
two new nodes that are placed lower in the hierarchy, these are referred to as η’s child
nodes, or more specifically η’s left and right children: Cl (η) and Cr (η) respectively,
and are disjoint subsets of η such that Cl (η) ∪ Cr (η) = η, (iii) the parent node, P (η),
on the other hand, is placed above η in the hierarchy, and contains both η and its
26
3.4 Online drop analysis through classification
sibling node S(η), such that P (η) = η ∪ S(η). A node that has children is defined
as an internal node, whilst nodes that do not are referred to as leaf nodes. The sets
of internal nodes and leaf nodes in T are denoted by IT and LT respectively.
At every leaf node, a decision rule is deployed and is parametrized by θη . IndeQ
pendence across tree partitions leads to likelihood p(y t |xt , T, θ) = ηLT p(y η |xη , θη ),
where [xη , y η ] is the subset of data allocated to η. This way of considering the
leaf nodes is often referred to as a Bayesian treed models in the literature. Whilst
flexible, this approach poses challenges in terms of selecting a suitable tree structure. To address this problem, Chipman et al. (1998) designed a prior distribution,
π(T ) (often referred to as the CGM tree prior), over the range of possible partition
structures, that allows for a Bayesian approach with inference via the posterior:
p(T |[x, y]t ) ∝ p(y t |T, xt )π(T ), where [x, y]t is the complete data set. The CGM
prior specifies a tree probability by placing a prior on each partition rule:
π(T ) ∝
Y
psplit (T, η)
ηIT
Y
[1 − psplit (T, η)]
(3.30)
ηLT
Where psplit (T, η) = α(1 + Dη )−β is the depth-dependent split probability (α, β > 0
and Dn = depth of η in the tree). Equation (3.30) implies that the tree prior is
the probability that internal nodes have split and leaves have not. In Chipman
et al. (1998), a Metropolis-Hastings MCMC approach is developed for sampling
from the posterior distribution of partition trees. Specifically, stochastic modifications referred to as “moves” (grow, prune, change, and swap) of T are proposed
incrementally, and accepted according to the Metropolis-Hastings ratio. It is upon
this framework that Taddy et al. (2011) base its dynamic extension.
3.4.3.2. Dynamic Trees
The extension from static partition trees (or more specifically, Bayesian static treed
models) to dynamic trees is the result of defining the tree as a latent state which is
allowed to evolve according to a state transition probability: P (Tt |Tt−1 , xt ), referred
to as the evolution equation, where Tt−1 represents the set of recursive partitioning
rules observed up to time t − 1. A key insight here is that the transition probability
is dependent on xt , which implies that only such moves (grow, prune, etc.) that
are local to the current observation (e.g. leaf η(xt )) are considered. This makes
this approach computationally feasible. Following Anagnostopoulos and Gramacy
(2012), we let:

0,
P (Tt |Tt−1 , xt ) = 
pm π(Tt ),
if Tt is not reachable f rom Tt−1 via moves local to xt
otherwise
27
Chapter 3
Methods
(3.31)
where pm is the probability of a particular move, and π(Tt ) is the tree prior. The
moves that are considered in this sequential approach are: {grow, prune, stay}.
Taddy et al. (2011) argues that the exclusion of the change and swap moves allows considerably more efficient processing. The three considered moves are equally
probable, and are defined as follows:
• Stay: The tree remains the same: Tt = Tt−1
• Prune: The tree is pruned such that η(xt ) and all of the nodes below in he
hierarchy are removed, including η(xt )0 s sibling node S(η(xt )). This implies
that η(xt )’s parent node P (η(xt )) after the prune becomes a leaf node.
• Grow: A new partition is created within the hyper-rectangle defined for η(xt ).
More specifically, this move first uniformly chooses a split dimension (covariate
dimension) j, and split point xgrow
. Then the observations of η(xt ), are divided
j
according to the defined split rule.
3.4.3.2.1 Prediction and the Leaf Classification Model
For posterior inference with dynamic trees, two quantities are imperative: (i) the
marginal likelihood for a given tree, and (ii) the posterior predictive distribution for
new data.
The marginal likelihood is obtained by marginalizing over the regression model parameters, which in this case are the leaves ηLT , each parametrized by θη ∼ π(θ):
p(y t |Tt , xt ) =
Y
p(y η |xη )
ηLTt
ˆ
=
Y
p(y η |xη , θη )dπ(θη )
(3.32)
ηLTt
That is, by conditioning a given tree, the marginal likelihood is simply the product of
independent leaf likelihoods. Combining (3.32) with the prior described earlier, we
obtain the posterior p(Tt |[x, y]t , Tt−1 ). Considering next the predictive distribution
for yt+1 , given xt+1 , Tt , and data [x, y]t :
p(yt+1 |xt+1 , Tt , [x, y]t ]) =
= p(yt+1 |xt+1 , Tt , [x, y]η(xt+1 ) ])
28
3.4 Online drop analysis through classification
ˆ
=
p(yt+1 |xt+1 , θ)dP (θ|[x, y]η(xt+1 ) )
(3.33)
Notice that in the second step of the derivation that [x, y]t is re-written as [x, y]η(xt+1 ) ,
this is so because we only consider the leaf partition which contains xt+1 . The
second term in (3.33), dP (θ|[x, y]η(xt+1 ) ), is the posterior distribution over the leaf
parameters (classification rules), given the data in η(xt+1 ). As such, the predictive
distribution is simply the classification function at the leaf containing xt , integrated
over the conditional posterior for the leaves (model parameters).
The model defined at each of the leaves may be linear, constant or multinomial.
Since the response variable in this work is binary, the approach of binomial leaves is
applied. As such, each leaf response ysη is equal to one of 2 alternative factors. The
set of outcomes for a particular leaf is summarized by a count vector: zη = [z1η , z2η ]0 ,
P|η|
such that the total count for each class is zcη = s=1 1(ysη = c). Following Taddy
et al. (2011), we then model the summary counts for each leaf as follows:
zη = Bin(pη , |η|)
(3.34)
where Bin(p, n) is a binomial with expected count pc /n for each category. A Dirichlet Dir(1C /C) prior is assumed for each leaf probability vector, and as such, the
posterior information about pη is given by:
p̂η =
(zη + 1/C)
(zη + 1/2)
=
(|η| + 1)
(|η| + 1)
.
The marginal likelihood for leaf node η is then defined by equation 3.35:
p(y η |xη ) = p(zη ) =
2
Y
Γ(zcη + 1/2)
Γ(zcη + 1/C)
=
η
η
c=1 zc ! × Γ(1/2)
c=1 zc ! × Γ(1/C)
C
Y
(3.35)
Finally, the predictive response probabilities for leaf node η containing covariates x
is:
p(y = c|x, η, [x, y]η ) = p(y = c|zη ) = p̂ηc f or c = 1, 2
(3.36)
3.4.3.2.2 Particle Learning for Posterior Simulation
29
Chapter 3
Methods
As in the static version of Chipman et al. (1998), a sampling scheme is applied to
approximate the posterior distribution of the tree. More specifically, Taddy et al.
(2011) uses a Sequential Monte Carlo (SMC) approach: at time t − 1, the posterior
distribution over the trees is characterized by N equally weighted particles, each
(i)
(i)
of which includes a tree Tt−1 as well as sufficient statistics St−1 for each of its leaf
(i) N
(i)
classification models. This tree posterior, {Tt−1 }N
i=1 , is updated to {Tt }i=1 through
a two-step procedure of (i) resampling and (ii) propagating. In the first step, particles are resampled, with replacement, according to their predictive probability for
(i)
the next (x, y) pair: wi = p(yt |Tt−1 , xt ). In the second step, each tree particle is
(i)
(i)
updated by first proposing local changes: Tt−1 → Tt via the moves {stay, prune
or grow}, resulting in three candidate trees: {T stay , T prune , T grow }. As the candidate trees are equivalent above the parent node for xt , P (η(xt )), one only needs to
calculate the posterior probabilities for the subtrees rooted at this particular node.
Denoting subtrees by Ttmove , the new Tt is sampled with probabilities proportional
to: π(Ttmove )p(y t |xt , Ttmove ), where the first term, the prior, is equal to (3.31) and the
second term, the likelihood, is (3.32) with leaf marginal (3.35). As noted in Taddy
et al. (2011) and Anagnostopoulos and Gramacy (2012), this sequential filtering
approach enables the model to inherit a natural division of labor that mimics the
behavior of an ensemble method - without explicitly maintaining one.
3.4.3.2.3 Data retirement
What has been considered so far for this method is the original approach developed
by Taddy et al. (2011). Whilst being sequential, this approach is not strictly online,
because the tree moves may require access to full data history. Furthermore, the
complexity of the original dynamic trees model grows with log t, and in terms of
classification in non-stationary environments, this isn’t ideal, as we suspect that
the data generating mechanism may change over time. In Anagnostopoulos and
Gramacy (2012), an extension is proposed where data is sequentially discarded and
down-weighted. Specifically, an approach referred to as data point retirement is
developed, where only a constant number, w, of observations are active in the trees
(referred to as the ’active data pool’). Whilst data points are sequentially discarded
they are still ’remembered’ in this approach. This is achieved by retaining the
discarded information in the form of informative leaf priors.
More specifically, suppose we have a single leaf ηTt , for which we have already
0
discarded some data, (xs , ys ){s} , that was in η at some time t0 ≤ t in the past.
Anagnostopoulos and Gramacy (2012) suggests that this information can be “remembered” by taking the leaf-specific prior, π(θη ), to be the posterior of θη , given
only the retired data. If we generalize this to trees of more than one leaf, we may
take:
π(θ) =df P (θ|(xs , ys ){s} ∝ L(θ; (xs , ys )π0 (θ)
(3.37)
where π0 (θ) is a baseline non-informative prior to all of the leaves. Following Anag-
30
3.4 Online drop analysis through classification
nostopoulos and Gramacy (2012), we update the retired information through the
recursive updating equation:
π (new) (θ) =df P (θ|(xx , ys ){s},r ) ∝ L(θ; xr , yr )P (θ|(xs , ys ){s} )
(3.38)
where (xr , yr ) is the new data point that is retired. Anagnostopoulos and Gramacy
(2012) shows that equation 3.38 is tractable whenever conjugate priors are employed.
In our case, with the binomial model, the discarded response values ys are represented as indicator vectors zs , where zjs = 1(ys = j). The natural conjugate is
the Dirichlet D(a), where a is a hyperparameter vector that may be interpreted as
counts. It is updated through: a(new) = a + zr , where zjm = 1(yr=j ). Anagnostopoulos and Gramacy (2012) shows that through this approach, the retirement preserves
the posterior distribution, and as such, the posterior predictive distributions and
marginal likelihoods required for SMC updates are also unchanged.
A dynamic tree with retirement manage two types of information: (i) a non-parametric
memory of an active data pool of (constant) size w < t, as well as (ii) a parametric
memory of possibly informative priors. The algorithm proposed by Anagnostopoulos and Gramacy (2012) may be summarized by the following steps:
1. At time t, add the tth data point to the active data pool.
2. Update the model through the Sequential Monte Carlo scheme described in
3.4.3.2.
3. If t exceeds w, select some data point (xr , yr ) and remove it from the active
data pool. But before doing so, update the associated leaf prior for η(xr )(i) for
each particle i = 1, ..., N, as to ’remember’ the information present in (xr , yr ).
More details are found in (Anagnostopoulos and Gramacy, 2012).
3.4.3.2.4 Temporal adaptivity using forgetting factors
To address the possibility of a changing data generating mechanism in a streaming
context, Anagnostopoulos and Gramacy (2012) further introduced a modification
of the retiring scheme described in the previous section. Specifically, retired data
history, s, is exponentially down-weighted when a new point ym arrives:
(new)
πλ
(θ) ∝ L(θ|ym )Lλ (θ; (ys , xs ){s} )π0 (θ)
(3.39)
Where λ is a forgetting factor. At the two extremes, when λ = 1, the standard
conjugate Bayesian updating is applied, as in the previous section, and when λ = 0,
the retired history is disregarded completely. A λ in-between these two extremes has
the effect of placing more weight on recently retired data points. More specifically,
31
Chapter 3
Methods
in the context of the binomial model, the conjugate update is modified from a(new) =
a + zr to a(new) = λa + zm .
In the algorithm described in 3.4.3.2, one of the steps noted “select some data point
(xr , yr ) and remove it”. We may here specify that, in the context of this thesis, and
following Anagnostopoulos and Gramacy (2012), this data point is the oldest data
point in the active data pool.
3.4.3.2.5 Variable Importance
To measure the importance of predictors for dynamic trees, where the response
variable is discrete, Gramacy et al. (2013) proposed the use of predictive entropy
based on the posterior predictive probability (p̂) of each class c in node η. This leads
to the entropy reduction:
4(η) = nη Hη − nl Hl − nr Hr
(3.40)
where Hn = − c p̂c log p̂c and n is the number of data points in η. The second and
third term on the right hand side of equation (3.40) describes the entropy for node
η’s left- and right- children respectively. In Gramacy et al. (2013), however, variable
importance is not considered in an online setting: each covariates’ predictive entropy
is calculated based on results from the full dataset. In this thesis, we are interested
in the temporal variable importance, and as such, we instead consider the mean
entropy reduction for a particular covariate at each time point, by averaging over
the N particles. This allows us to display the variable importance as a time-series;
a simple and intuitive way to study its relative importance over time.
P
3.5. Drop description
3.5.1. Association Rule Mining
From the analysis of sec. 3.4, one may gain insights about which variables are relevant
at different time-points. As an additional layer to this aforementioned analysis, we
further consider the application of association rule mining, originally proposed by
Agrawal et al. (1993), with the objective of gaining intuitive descriptions that are
easy to interpret for domain experts. This approach is convenient since the data
has been formatted such that it constitutes of binary variables. The possessed
knowledge of which variables that are interesting at different time-points (inherited
from sec. 3.4), and hence which variables to consider for deriving association rules
at these different time-points, has the positive effect of reducing the search-space
needed to be explored for obtaining association rules.
Specifically, the Apriori algorithm is used to generate the association rules. The
Apriori algorithm is designed to operate on transaction databases, and hence the
32
3.6 Technical aspects
first step constitutes of transforming the original data into a transaction database
format. Following the transformation, the data consist of a set of transactions, where
each transaction (T ) is a set of items (I), and is identified by its unique T ID (=
transaction identifier). An association rule is an implication of the form X → Y ,
where X ⊆ I, Y ⊆ I, and X ∩ Y = ∅. The Apriori algorithm works in a bottom-up
approach, and first identifies frequent individual items in the database and then
extends them into larger item sets, as long as those item sets appear sufficiently
often in the database (Agrawal et al., 1993).
Given a set of transactions, the problem of mining association rules is to generate
all association rules that have support and confidence greater than the user-specified
minimum support (minsup) and minimum confidence (minconf ) respectively. Support is simply the count of the number of transactions in the dataset that contain
the association rule, divided by the total number of transactions, whilst the confidence measures how many of the transactions containing a particular item (say X)
that also contain another item (say Y ). More formally, the support for association
rule X → Y is defined by equation (3.41):
Support(X → Y ) =
count(X ∪ Y )
N
(3.41)
where N is the number of transactions (observations). The confidence for association
rule X → Y is obtained through equation (3.42):
Conf idence(X → Y ) =
count(X ∪ Y )
count(X)
(3.42)
In this thesis we are interested in those association rules which have {Drop = 1} in
the right-hand-side of the association rule, and hence we introduce such a constraint
into the process - in addition to minsup and minconf.
3.6. Technical aspects
For the purpose of data cleaning, pre-processing, and sampling, the Python programming language was used. The analysis part of this thesis was carried out using
the R programming language. For the dynamic logistic regression and dynamic
model averaging, code was first extracted from the dma library - and on the basis of
this code, various extensions and modifications were implemented. For the dynamic
trees model, the dynaTree package was used. Finally, association rule mining relied
on the arules package.
33
4. Results
This section presents the results of this thesis, and is divided into three parts: the
first containing a brief exploratory analysis of the data. In the second part, the
task of deriving an online classifier with high prediction capabilities is tackled. In
the third and final part, exploration of the temporal significance of covariates is
considered, where interesting periods are analyzed in more detail - with the objective
of identifying potential causes for drops at those particular periods.
4.1. Exploratory analysis
500
400
300
200
100
0
Number of dropped calls
600
In Figure 4.1 the number of dropped calls over the relevant period is displayed.
0
20
40
60
80
100
Time
Figure 4.1.: Number of dropped calls over the period January 26 - April 11 for
STP9596, as divided by 100 equally large time-ordered subsets.
As one may observe, there are at least ~5 time-periods in which the call drop rate
increases considerably. Upon exploring the temporal significance of covariates in
sec. 4.3, one of these periods will receive special attention.
Worth noting is that no effort has been made to account for periodicity, and this
is because the data originates from programmed systems that does not have any
periodicity-dependencies. Although, even if there would have been any, the sequential Bayesian framework would naturally have incorporated that aspect by updating
the parameters accordingly.
Initially, 188 covariates were extracted. Having considered the aspect of redundancy
and multicollinearity, several covariates could be removed. For instance, a lot signal
35
Chapter 4
Results
types have both a ’request’ and a ’response’ signal, and hence almost always occur
together. In such cases, one of them were removed. The resulting dataset of 122
covariates is one in which the degree of multicollinearity is low, as one may observe
from the heatmap plot of the correlation matrix in Figure 4.2:
value
1.0
Var 2
0.5
0.0
−0.5
−1.0
Var 1
Figure 4.2.: Heatmap of correlation matrix
To demonstrate the concept of temporal significance, let’s consider two of the 122
covariates, starting with PS. As mentioned in sec. 2.2.1, this covariate describes the
“type of radio connection that a particular UE has”, or more specifically, that it
has a data connection activated. In Figure 4.3, the percentage of such calls that
terminate unexpectedly is displayed, as divided by four equally sized (ordered) time
periods. Considered as a univariate classifier, the red bars in this plot represents
the true positive rate at four different time periods.
It can be observed that the proportion of calls with PS that drops is not constant
over the considered time period. In the first time period, the percentage of normal
outcomes outweighs the dropped ones. This changes quite drastically in the second
period, where more than 75% of the calls with PS terminate unexpectedly. In
period three, the percentage of dropped calls still outweighs the number of normal
ones considerably. In the fourth and final period, the proportions are almost equal.
Next, we consider one of the GCP covariates, more specifically the GCP combination
“000011000000011000011011 ”. In Figure 4.4, the proportion of calls that has this
36
4.1 Exploratory analysis
1.00
0.75
%
Termination
Drop
0.50
Normal
0.25
0.00
T1
T2
T3
T4
Timeperiod
Figure 4.3.: Percentage of PS that terminates unexpectedly, as divided by 4
equally sized time periods
specific combination of generic connection properties and terminates unexpectedly
is displayed - again, as divided by four equally sized time periods.
1.00
0.75
%
Termination
Drop
0.50
Normal
0.25
0.00
T1
T2
T3
T4
Timeperiod
Figure 4.4.: Percentage of calls with GCP=000011000000011000011011 that terminates unexpectedly, as divided by 4 equally sized time periods
One may observe that the proportion of calls having this particular GCP, that drops,
changes over time. Specifically, during the first 1/4th of the time-series, close to 70%
of the calls that attain this GCP terminates unexpectedly. For the following three
periods, however, this relationship shifts such that the calls attaining this property
instead tend to correlate with normal calls.
37
Chapter 4
Results
4.2. Online classification
In this section, the sampling and classification techniques described in chapter 3
are evaluated as to derive a model that can discriminate between dropped calls and
normal calls with high precision. To conclude this section, the (best) online classifier
is compared to its static equivalence in a few fictive scenarios.
4.2.1. Sampling strategies
As previously mentioned, the number of normal calls far outweighs the number of
dropped calls. This part of the results is concerned with studying the effects of
the imbalance on the capability of the classifiers. The first question one reasonably
may pose is whether sampling is needed at all? If yes, what sampling technique
and what sampling rate is suitable? To answer these questions, the online random
undersampling (ORUS) technique, as well as the proposed extension, adaptive online
random undersampling (A-ORUS) that were described in sec. 3.2 are evaluated using the same evaluation metric as in Nguyen et al. (2011); Wang et al. (2013), the
geometric mean.
Datasets of different rates of imbalance were created via these sampling techniques,
and then the dynamic logistic regression model and the dynamic trees model (with
fixed-parameter settings) were applied to these datasets - e.g. holding everything
except the sampling size constant. For ORUS, the considered imbalance rates are: (i)
10%/90%, (ii) 30%/80%, and (iii) 50%/50%. For A-ORUS, the considered imbalance
rate is: (i) 50%/50%. The original imbalance rate of 1.2%/98.8% is also considered.
Let us first evaluate the results of the dynamic logistic regression. In Table 4.1, the
results for this evaluation are presented.
Sampling Strategy G-mean
ORIG 1/99
0.487
ORUS 10/90
0.810
ORUS 30/70
0.875
ORUS 50/50
0.913
A-ORUS
0.890
Table 4.1.: Evaluation of Sampling strategies using Dynamic Logistic Regression
It can be seen that the original imbalance rate (∼ 1% dropped calls) has resulted
in a G − mean score that is considerably worse than the other four; an indication that sampling may be justified. A general tendency that one may observe is
that, as the undersampling rate increases for ORUS - and the distribution over the
classes becomes more uniform - the G−mean score increases, reflecting the increased
capability of the model to predict positive instances (dropped calls) correctly. Considering the proposed adaptive technique, it can be observed that it does not affect
38
4.2 Online classification
the G−mean as positively as the 50/50 sampling rate of ORUS. Let us next consider
the corresponding results for the dynamic trees.
Sampling Strategy TPR TNR G-mean
ORUS 1/99
0.370 0.994
0.545
ORUS 10/90
0.610 0.984
0.719
ORUS 30/70
0.718 0.946
0.789
ORUS 50/50
0.779 0.876
0.827
A-ORUS
0.743 0.911
0.798
Table 4.2.: Evaluation of Sampling strategies using Dynamic Trees
The results from Table 4.2 align well with those of Table 4.1, in that the G − mean
score steadily increases as the distribution between the classes becomes more even.
To confirm the conclusions based on the G − mean, one may further consider the
TPR and TNR values, and in particular, the general trend that the TPR increases as
the undersampling-rate increases. This, however, come at the cost of reductions in
the TNR. Even so, the overall performance of the classifier is improved. Since what
is of particular interest in this thesis is to discriminate dropped calls from normal
calls, e.g. the positive cases, the G − mean and TPR are of particular importance.
Considering the proposed adaptive technique, it can again be observed that it does
not affect the G − mean or TPR as positively as the 50/50 sampling rate for ORUS.
Based on the results presented in Table 4.1 and 4.2, the decision was made to use
the data resulting from the 50/50 ORUS (e.g. the one with a 50%/50% distribution
between the classes) for the remainder of the analysis.
4.2.2. Dynamic Trees
The first online classification technique to be considered is the dynamic trees. As
previously described, the tree prior (affecting the split probability) is specified by
two parameters: α and β. Sensitivity analysis of these was performed, and it was
found that the results were only marginally affected by their specification. Based on
this analysis, the parameters were set to α = 0.99 and β = 2. These settings align
well with what usually is applied in the literature. Tables displaying the sensitivity
analysis are found in Table A.1 and A.2 in the Appendix.
In addition to the tree prior, there is also the forgetting factor (λ), the active data
pool size (w), and the number of particles (N ). The latter was set in accordance with
the literature; N = 1000. For the former two, an empirical evaluation is performed,
as to derive the best DT. Let’s first consider the forgetting factor λ, holding w
constant. In Table 4.3, the result of this evaluation is presented:
It can be observed that, between λ = 1 and λ = 0.90 , the prediction capability
- as measured by AUC and G-mean - monotonically improves. At λ = 0.80, the
39
Chapter 4
Results
Lambda
w
1.00 1000
0.99 1000
0.95 1000
0.90 1000
0.85 1000
0.80 1000
0.70 1000
0.60 1000
0.50 1000
Table 4.3.: Evaluation of
TPR TNR AUC G-mean
0.747 0.858 0.869
0.799
0.839 0.841 0.921
0.850
0.831 0.871 0.926
0.856
0.850 0.875 0.933
0.864
0.823 0.877 0.927
0.857
0.839 0.889 0.934
0.864
0.802 0.900 0.924
0.849
0.819 0.882 0.927
0.855
0.813 0.894 0.927
0.852
forgetting factors for the Dynamic Trees
best score is obtained. Lowering λ further does not present any improvement. One
important takeaway here is that a λ < 1 is rewarding, which implies that weighting
(retired) observations that are observed more recently higher improves the result.
This reflects the time-dependence of the system. Let us next consider the active
data pool size, w, holding λ constant.
w
50
100
250
500
750
1000
1500
2000
3000
4000
Table 4.4.: Evaluation
TPR TNR AUC G-mean
0.727 0.886 0.874
0.799
0.777 0.875 0.894
0.823
0.808 0.874 0.913
0.843
0.797 0.884 0.917
0.839
0.815 0.885 0.926
0.851
0.823 0.889 0.928
0.856
0.817 0.894 0.931
0.859
0.837 0.883 0.930
0.866
0.827 0.886 0.931
0.871
0.800 0.903 0.929
0.874
of active data pool size (w) for the Dynamic Trees
From Table 4.4, one can observe that, between w = 50 and w = 1000, the performance of the classifier is monotonically increased. After this point, the AUC and
G-mean does still increase (up and till w = 3000), but only marginally (relative to
the increase in w). Considering the notable increase in terms of computational cost
(see Table A.3), the marginal gains in performance is not enough to offset w = 1000
as the best alternative.
The observation that the performance is improved as w is increased is not that surprising, because, as previously described, the size of the active data pool determine
the total number of observations to be stored in the tree (at any given time point),
and hence, a lower w has the effect of forcing the tree to be smaller, whilst a larger
w allows the tree to grow larger. A larger tree has the advantage of being able to
40
4.2 Online classification
capture more complex structures in the data, but it comes at the cost of potentially
not being as flexible as the smaller tree.
0.6
0.4
0.0
0.2
Accuracy
0.8
1.0
Based on the results and analysis of Table 4.3 and 4.4, it is concluded that the best
DT is the one that has parameter settings: λ = 0.80, w = 1000: it achieved an AUC
of 93.4% and a G-mean of 86.4%. To gain insight of how well this model performed
over time we, in Figure 4.5, consider a rolling window displaying the accuracy at
different time-points.
0
2000
4000
6000
8000
10000
12000
14000
Time
Figure 4.5.: Rolling window measuring the Accuracy for the best Dynamic Trees
model over the considered period
One can observe that the performance of the classifier at approximately 7 times
degrades to an accuracy of less than 60%. The time-points for the degradations may
be compared to the call drop time-series for the undersampled data in Figure 4.7. In
doing so, one finds that four of the degradations happen during the second period of
abnormally high drop-rate (corresponding to subset 13-22 in Figure 4.7). The sixth
and seventh degradation are related to the third and fifth period of abnormally high
drop-rate respectively. As described in sec. 3.4.3, the latent state of the dynamic
tree consist of the tree-structure. The degradations in this plot reflects the inability
of the model to update its structure fast enough.
A general observation one might make is that there is no clear trend, which can imply
one of two things, either (i) there are no structural changes that occurred over this
time-period, or (ii) the classifier is able to adapt to changing circumstances, and thus
avoids any longer period of degradation. Seeing it as there clearly are reductions
in the performance, but that the classifier recovers, the second alternative appear
more likely. In sec. 4.3, we will consider the period 4200-6400 in more detail, as to
explore what might have caused these degradations.
4.2.3. Dynamic Logistic Regression
In this subsection, the dynamic logistic regression, as well as the extension with
dynamic model averaging are evaluated.
41
Chapter 4
Results
4.2.3.1. Allowing for the inclusion of sparse attributes
As previously mentioned, in addition to being imbalanced, the data is also sparse
with regards to several of the input variables, and this proved to pose another
challenge. This part of the results is concerned with shedding light on this problem,
as well as to evaluate the proposed forgetting factor modification and compare it to
the original proposed in McCormick et al. (2012).
To explore how many, and which variables the original forgetting framework has
trouble with, an experiment was set up such that 122 univariate dynamic regression
models were fitted, one for each covariate. If the model-fitting failed to execute
correctly during the recursive-updating step, a ”1” was recorded for that attribute. If
no problem occurred, a ”0” was registered. The same experiment was then performed
for the modified forgetting framework. The forgetting factor λ was set to 0.95 for
both versions. For the modified version, the additional parameter w was set to 10.
In Table 4.5, the outcome of these runs is presented:
Forgetting scheme Success Failed
Original
92
30
Modified
122
0
Table 4.5.: Evaluation forgetting frameworks
One may observe that approximately 1/4 of the covariates could not be used with
the original forgetting framework. By applying the modified version, however, all
covariates could be included. The full list of variables for which the original forgetting framework failed is found in Table A.4. One characteristic that they all share,
is that they are temporally sparse. Let’s consider one of these covariates, cell_526,
in a bit more detail. The updating step fails to converge at time-point 3441, and in
Figure 4.6, the log-odds, the values of the covariate, and the values of the response
variable are presented between time-point 3350-3450.
From Figure 4.6 it can be observed that during this sub-period, 6 calls were made
from cell_526, of which 3 were dropped. At time-point 22 (in this plot), a match
occurs (cell_526 = 1 and y = 1), the model react by updating the parameter estimate to ∼ 500+. At time-point 36, the next call from cell_526 is made, however
this time it’s not a match (cell_526 = 1, y = 0), consequently, the model updates
the parameter estimate to ∼ −200. Finally, at time-point 91, we observe two consecutive matches, and this is what causes the model updating to crash: the model
updates the parameter to such an extent that when the logit prediction is made, exponentiation of the log-odds produces an infinite value. As previously mentioned, by
lowering the λ value, we assign larger weights to more recent observations (according to: weightobservation j = λj ) , and what we observe here is that, during periods
of sparsity, this has the potential effect of causing extreme inflation of parameter
estimates that crashes the algorithm. The proposed modified forgetting framework
42
4.2 Online classification
200
−200
0
Log−odds
400
addresses this problem by scaling λ closer to 1 during periods of sparsity, and hence
base its parameter updates on longer spans of observations.
0
20
40
60
80
100
1.0
Time
●
●
●●
●
0.6
0.4
0.0
0.2
cellid_526
0.8
●
●●●● ●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
0
20
40
60
●●●●●● ●●
80
100
1.0
Time
● ● ●
●
●
●●●●
● ● ●
●
●●●●●
●
●
●
●●●
●
●
●
●●●
●●
●●
0.0
0.2
0.4
y
0.6
0.8
●
● ●●●●●●● ● ● ●● ●●●● ●●●
0
20
●● ● ● ●●●●●●●●● ●●
●●●● ●●● ●●● ●●
40
60
●● ●● ●● ●●
80
●●●●
●●●
●●●●●
100
Time
Figure 4.6.: Example of breakdown for original forgetting framework: cell_526
43
Chapter 4
Results
4.2.3.2. Evaluation of forgetting factors
Given the central importance of the concept of forgetting, this section is dedicated to
evaluate the different forgetting strategies described in Section sec. 3.4.1, as well as
different forgetting constants (c = λ < 1), such to obtain the best fit and prediction
capability. Note that, when using the modified forgetting framework, the additional
parameter, w (defining the window upon which local sparsity should be estimated), is
set to 10 throughout this work: it was empirically found to be a suitable value. First
we consider the simplest strategy, that of the fixed λ, using the original forgetting
framework.
Lambda
AUC G-mean
1.000 0.94746 0.88126
0.999 0.97700 0.92302
Table 4.6.: Evaluation of Simple Forgetting Strategy
Lambda
0.99
0.95
0.90
0.85
0.80
0.75
Table 4.7.: Evaluation
AUC G-mean
0.97009 0.91377
0.99656 0.98119
0.99930 0.99508
0.99954 0.99649
0.99963 0.99688
0.99956 0.99657
of Adaptive Forgetting Strategy
Lambda
AUC G-mean
Multiple: 1, 0.99, 0.95, 0.90, 0.85, 0.80 0.99972 0.99867
Table 4.8.: Evaluation of Multiple Adaptive Forgetting Strategy
From Table 4.6, it can be observed that, out of the two considered λ values, the
best result is obtained by using λ = 0.999. As this strategy were implemented via
the original forgetting framework (because we want a fixed λ), the process collapses
for λ values lower than 0.999, and hence lower values were not explored for this
approach. The results from Table 4.6 do nonetheless give a hint that a more local fit
is probably preferable - giving relatively higher weights to more recently observed
data points.
From Table 4.7, one can observe that, using the extended forgetting strategy proposed by McCormick et al. (2012) coupled with the modification proposed in this
thesis, we are able to improve the results of the fixed λ considerably. The forgetting
factor which has resulted in the best prediction capability is λ = 0.80, obtaining
44
4.2 Online classification
an AUC of 99.963% and G − mean = 99.688%. This again tends to imply that a
local fit is preferable to a global one. A λ value of 0.80 implies that an observation
occurring 10 time-points back is assigned approximately 1/10th of the weight that
the past observation has.
In Table 4.8, results for the third strategy of multiple λ’s are displayed. It is found
that extending the number of λ0 s to evaluate at each iteration leads to a marginal
improvement in this case (AU C = 99.972% and G − mean = 99.867%). This
improvement comes at the cost of slower computational time, such that a trade-off
has to be made. Since the improvement is only marginal in this case, the extension
may not be worth the computational cost. However, if the degree of change changes
over time, or is temporal, multiple λ0 s may be worthwhile.
A final comment is that, using the original (adaptive) forgetting framework proposed
by McCormick et al. (2012), the lowest λ value that could be used were λ = 0.98,
and it resulted in AU C = 99.688% and G − mean = 98.234%. Hence, the modified
forgetting framework, in addition to being able to include more covariates, is able
to outperform the original approach on this data.
4.2.3.3. Extension with Dynamic Model Averaging
Two approaches for the construction of candidate models are considered: (i) one
candidate model per “variable group”, and (ii) one candidate model for all possible
combinations of “the most interesting variable groups”. It should be emphasized
that the set of variables and variable groups are different in (i) and (ii): the former
constitutes of all 122 variables (22 variable groups), whilst the latter only contains
92 variables and 6 variable groups. Let’s begin by considering the former:
Strategy 1
First, the model forgetting factor, α, is considered (holding the within-model forgetting factor, λ, constant):
Lambda Alpha AUC G-mean
0.99
0.99 0.881
0.798
0.99
0.95 0.917
0.839
0.99
0.90 0.931
0.855
0.99
0.85 0.935
0.860
0.99
0.80 0.937
0.862
0.99
0.75 0.937
0.863
0.99
0.70 0.937
0.863
Table 4.9.: Evaluation of alpha for DMA
From Table 4.9, one can observe that as the α value is lowered, the predictive
capability of the model steadily increases. This reflects, on the one hand, that
45
Chapter 4
Results
we have many small models that by themselves may not be very predictive, and
on the other hand, that these models discriminate the data with varying quality
over the span of the time-series: e.g. that one variable group (candidate model)
may explain the data relatively well at one point in time, but not at the other.
The gains in predictive capability from a lowered α takes a decaying form, and
stops around 0.75 (AUC is marginally lower at α = 0.70). As such, we move on
to the second parameter, the within-model forgetting factor, λ (holding the model
forgetting factor, α, constant):
Lambda Alpha AUC G-mean
1.00
0.75 0.928
0.847
0.99
0.75 0.936
0.863
0.98
0.75 0.945
0.875
0.97
0.75 0.950
0.883
0.96
0.75 0.955
0.890
0.95
0.75 0.958
0.895
0.94
0.75 0.960
0.897
0.93
0.75 0.962
0.899
0.92
0.75 0.963
0.902
0.91
0.75 0.965
0.905
0.90
0.75 0.966
0.907
0.89
0.75 0.967
0.909
0.88
0.75 0.968
0.911
0.87
0.75 0.969
0.913
0.86
0.75 0.970
0.913
0.85
0.75 0.970
0.913
Table 4.10.: Evaluation of lambda for DMA
In Table 4.10 it can be seen that the comments made for α also apply to λ: as
the forgetting factor is lowered, the overall performance increases. As previously
described, what this translates into, in practice, is that the candidate models adapt
to local behaviors through rapid and local updating of the coefficient estimates.
It should be noted that some trouble were encountered here when lowering the λ
value (even with the modified forgetting factor). These problems were limited to
specific candidate models, and as such, a specific (higher) λ value was set for those.
Strategy 2
In the second strategy, we construct candidate models by considering all the possible
model-combinations of “the most interesting variable groups”. Selecting six variable
groups translates into 64 candidate models. In Table 4.11 these variable groups are
displayed. Given the encountered limitations of the previous strategy, we here set a
rather conservative λ of 0.95: this to ensure stability.
46
4.2 Online classification
Variable.Groups Variables
tProc4
t_proc1,...,t_proc37
Cell ID
cell_1,...,cell_17
GCP
X1,...,X23
Radiolink
Radiolinkrequest,...,Radiolinkfailure
UeRcid
PS, CS, SRB, Mixed
EvID
e1a, e1d,...e1f
Table 4.11.: Variable groups
Model
Alpha AUC G-mean
DMA
0.99
0.983
0.939
DMA
0.9
0.990
0.954
DMA
0.8
0.993
0.961
DMA
0.7
0.994
0.965
DMA
0.6
0.995
0.968
DMA
0.5
0.995
0.970
DMA
0.4
0.996
0.971
DMA
0.3
0.996
0.972
DMA
0.2
0.996
0.973
DMA
0.1
0.997
0.974
DMA
0.01
0.997
0.974
Full Model –
0.990
0.954
Table 4.12.: Strategy 2: Evaluation of alpha for DMA
From Table 4.12 it can observed that as the model-forgetting factor α is lowered the
classification capability of DMA monotonically increases, and outperforms the single
(full) model at α = 0.90. The gains in AUC from lowering α gradually decays, and
at α = 0.10, only the 5th decimal changes, and hence we stop there. These results
implies, on the one hand, that the variable groups has a non-constant and varying
degree of importance over the time-period, and being able to shift more weight to
models excluding less relevant variables, is rewarding. Decreasing α as low as 0.10
has the effect flattening the distribution of the model indicator quite extensively, and
as such, all candidate models are assigned relatively low weights and the prediction
of DMA becomes more of an averaged prediction of many candidate models, rather
than a few. Whilst presenting promising results, this approach has the downside of
having to consider 64 candidate models rather than 1, implying a hefty decrease in
computational speed. Although, as the candidate models are updated independently
of one another, it is possible parallelize this process, as to reduce the computational
constraints.
47
Chapter 4
Results
4.2.4. Summary of results
In Table 4.13, the best results for each of the considered approaches are presented.
Model
AUC G-mean
Single DLR
0.9996
0.9969
Group DMA
0.9965
0.9737
Dynamic Trees 0.9341
0.8635
Table 4.13.: Summary of results
It can be observed that the single dynamic logistic regression is the clear winner; it
has obtained the highest AUC and G-mean scores. Recall that in contrast to the
single dynamic logistic regression and the dynamic tress model, the Group DMA
model only consists of 92 variables, divided into 6 variable groups. This is worth
underscoring since we in sec. 4.2.3.3 concluded that the Group DMA outperformed
the single model.
From Table 4.13, one can also observe the rather big difference that exists in terms
of predictive capability between the models that are centered around the dynamic
logistic regression and the dynamic trees. One possible explanation for this is that
the process of updating the tree-structure may not be as rapid as the process of
updating the parameters in the dynamic logistic regression. To demonstrate this
point, consider Figure A.1. It can be seen that the covariate cell_id220524 were
included in 100% of the N particles (trees) between time-point 4200-9800. If we
next consider Figure A.2, where the reduction in entropy for the same covariate is
displayed, one may observe that the period for which the reduction in entropy is
large (4200-6000), is considerably shorter than the period for which the covariate is
included in the trees (4200-9800).
4.2.5. Static Logistic Regression vs. Dynamic Logistic
Regression
The two classifiers that were selected for this thesis have in common that they are
dynamic and updated online. In this subsection, the question of whether a dynamic
model is preferable to a static one is evaluated in terms of the predictive capability
over the considered period. Seven scenarios are considered, and what differentiates
them is the size of the batch in the training set compared to the test set: from {10%
training, 90% test}, to {90% training, 10% test}. For all of the scenarios, a static
logistic regression is fitted in the training period, and then used for predicting the
incoming calls for the next period. The results are presented in Table 4.14.
It can be observed that the static classifier performs gradually better as it is fed
more data: the highest AUC score is obtained when 80% of the data is in the
48
4.2 Online classification
Training set proportion AUC
0.100 0.846
0.200 0.855
0.333 0.878
0.500 0.881
0.667 0.900
0.800 0.911
0.900 0.880
Table 4.14.: AUC Static Logistic Regression
training set (AU C = 91.11%). In comparison to the dynamic logistic regression, the
performance is considerably worse: recall that the best dynamic logistic regression
obtained an AUC of 99.96%.
In addition to the scenarios described above, in which data is accumulated sequentially, let’s consider a scenario in which all the data is available. First, we randomly
assign 70% of the observations of the data to a training set (without accounting for
the sequential order of the data), and use this data to fit a logistic regression model.
Secondly, we use this fitted model to predict the outcomes of the remaining 30%
of the data (in the test set). Following step 1 and 2, one obtains predictions that
results in an AUC of 93.8%. In other words, an improvement to the sequential modeling scheme of the static classifier, but still far less well compared to the dynamic
logistic regression.
49
Chapter 4
Results
4.3. Online drop analysis
Besides resulting in actual predictions, the chosen classifiers have - as previously
mentioned - the caveat of leaving varying degrees of “traces” (not to be confused
with the covariate) as to how these predictions were made. The idea of this part of
the results, which is termed online drop analysis, is to explore these “traces”, and
in particular what they can tell us about periods of abnormal number of dropped
calls.
200
150
100
Number of dropped calls
250
Having randomly undersampled the data as if it arrived online (according to the
concluded best sampling rate: 50/50), the call-drop time-series from Figure 4.1 is
transformed as to appear like Figure 4.7:
0
10
20
30
40
50
Time
Figure 4.7.: Number of dropped calls, as divided into 50 equally large time-ordered
subsets - based on the undersampled data (ORUS 50/50)
In the previous section, it was shown that the dynamic logistic regression clearly
outperformed the dynamic trees model, and hence this section will be centered
around the use of the dynamic logistic regression, although the results of the DT
are supplied for comparative and confirmatory purposes. As previously mentioned,
in McCormick et al. (2012); Koop and Korobilis (2012), two approaches were considered for studying the temporal significance of covariates and how the conditional
relationships change over time. The first considering posterior inclusion probabilities, and the second considering odds ratios. Both of these approaches are considered
in this section.
4.3.1. DMA posterior inclusion probabilities
As mentioned in sec. 3.4.2, candidate models are - in this work - not constructed on
the basis of individual variables, but by variable groups, such that we do not consider
all the possible combinations of variables, but rather all the possible combinations of
variable groups: 26 = 64. A DMA model was set up with the parameter-settings that
were found to produce the best results in sec. 4.2.3.3. In Figure 4.8 and Figure 4.9,
50
4.3 Online drop analysis
posterior inclusion probabilities for these variable groups over the considered period
are displayed.
A first broad observation is that the inclusion probabilities are quite volatile for
all of the variable groups. This is a byproduct of setting a low α value, since
this parameter controls how rapid the dynamic updating of the model probabilities
should be. It may further be noted that neither of the individual candidate models
assume a posterior model probability higher than 0.5 for any notable length of time.
Neither of the variable groups assumes a posterior inclusion probability of 1 or 0 for
the whole period. The two variable groups with the overall lowest inclusion probability are Variable Group 1 (Trace 4 Procedures) and Variable Group 4 (Uercid).
The two variable groups with the highest overall inclusion probability are Variable
Group 5 (Radiolink) and Variable Group 2 (GCP).
A more specific and possibly more interesting observation that can be made is in
regards to Variable Group 6 (Cell IDs). In Figure 4.9 one can observe that for 2-3
periods (one around 4500, one around 9000, and one around 11500) the inclusion
probabilities remain close to 1, in a way that is not observed for the rest of the
time-series.
51
Results
0.6
0.4
0.0
0.2
Inclusion probability
0.8
1.0
Chapter 4
0
2000
4000
6000
8000
10000
12000
14000
8000
10000
12000
14000
0.6
0.4
0.0
0.2
Inclusion probability
0.8
1.0
Time
0
2000
4000
6000
0.6
0.4
0.0
0.2
Inclusion probability
0.8
1.0
Time
0
2000
4000
6000
8000
10000
12000
14000
Time
Figure 4.8.: Posterior inclusion probabilities: Variable Group 1 = Trace 4 Procedures || Variable Group 2: GCP || Variable Group 3: evID
52
0.6
0.4
0.0
0.2
Inclusion probability
0.8
1.0
4.3 Online drop analysis
0
2000
4000
6000
8000
10000
12000
14000
8000
10000
12000
14000
0.6
0.4
0.0
0.2
Inclusion probability
0.8
1.0
Time
0
2000
4000
6000
0.6
0.4
0.0
0.2
Inclusion probability
0.8
1.0
Time
0
2000
4000
6000
8000
10000
12000
14000
Time
Figure 4.9.: Posterior inclusion probabilities: Variable Group 4: UeRcid || Variable
Group 5: Radiolink || Variable Group 6: Cell ID
53
Chapter 4
Results
4.3.2. Evolution of odds-ratios and reduction in entropy
In this subsection, the temporal significance of covariates is analyzed by considering
the evolution of the odds-ratios as well as the posterior model probabilities from the
dynamic logistic regression and the univariate scanner. The reduction in entropy
from the DT model is also considered as a way to confirm the main results.
To explore all of the covariates individually is out of the scope of this thesis. As
such, we limit the analysis to one interesting period, or more specifically, a period
of abnormal call-drop ratio. It is worth emphasizing that whilst we here first select
a period, and then secondly evaluate which variables were important, this order
of events is not required. That is, one could just as well have assumed that no
knowledge of the actual call-drop ratio were possessed, and instead monitor the
actual evolution of the output from the considered models.
Before considering the “interesting period”, we shall first compare the degree of
insight from the (full) single dynamic logistic regression to that of the univariate
DMA (also referred to as the “univariate scanner”), as to determine which to use
for analyzing the “interesting period”.
4.3.2.1. Single Dynamic Logistic Regression vs. Univariate DMA
As previously described, in the single dynamic logistic regression, all of the covariates are included in the same model, whilst in the univariate DMA (the “univariate
scanner”) there are as many candidate models as there are covariates, one covariate in each. An important difference between these two approaches concerns the
forgetting factor: in the full model, we set a common forgetting factor for all of
the covariates, whilst in the univariate case, we allow for each covariate to have its
own forgetting factor. A priori this reasonably suggests that the latter allows for
a more precise recursive estimation for each of the covariates. To evaluate whether
this is the case, we compare the recursive estimates from the single dynamic logistic
regression that were found to obtain the best results in sec. 4.2 (λt = 0.80), to a univariate DMA (with λt = 0.95). In Figure 4.10 and Figure 4.11, one such comparison
is displayed.
Figure 4.10 and Figure 4.11 demonstrates a general finding, that the univariate
DMA indeed is able to more precisely update the coefficient estimates. For instance, note that that the confidence band of Figure 4.11 is narrower compared to
Figure 4.10, reflecting the ability of the former to not update unnecessarily. The
degree of similarity between the recursive estimates of the two approaches is linked
to the posterior model probabilities in the univariate DMA, such that if the posterior
model probability - for a particular covariate - is large at a particular time-point,
the updating (at that time-point) of the single dynamic logistic regression is likely
to be similar to that of the univariate DMA. For instance, consider Figure A.3 and
Figure A.4 in the Appendix. This is reasonable, since what determines whether
54
0
−2
Log odds
2
4
4.3 Online drop analysis
0
2000
4000
6000
8000
10000
12000
14000
Time
“GCP
0
−2
Log odds
2
4
Figure 4.10.: Log-odds from the single dynamic logistic regression:
000011000000001100001000”
0
2000
4000
6000
8000
10000
12000
14000
Time
Figure 4.11.: Log-odds
from
000011000000001100001000”
the
“univariate
scanner”:
“GCP
forgetting should be applied or not at a particular time point is the predictive likelihood, and hence, if there is a covariate that is predominant at this time-point,
it will also have a great impact on the predictive likelihood in the full model. The
increased precision of the univariate DMA comes at the cost of computational speed.
Given the analysis of the previous paragraph, the univariate DMA is used as the
tool for obtaining log-odds and odds-ratios for the remainder of this section.
4.3.2.2. Interesting period
The period of consideration is that which occurs between time-point 4200-6400,
corresponding to 14-23 in Figure 4.7 (e.g. the second period of high drop rate).
In Table 4.15, the covariates that were found to attain a significant positive effect
during this period are listed. Time-series plots of the recursive coefficient estimates
for these covariates are found in sec. A.2.2.
As one may observe from Table 4.15 (or from sec. A.2.2), some of the covariates are
55
Chapter 4
Results
Coefficient
Time-period
cell_id220524
4200-6400
X14
4200-5200
X15
4200-6400
X16
4200-5200
X21
4200-6400
X22
4200-5200
X23
5200-6400
GCP 000011000000001100001000 4200-5200
GCP 000011000000011100011100 4200-5200
GCP 000000000000011000011011 5200-6400
GCP 000000000000001000011011 5200-6400
PS
4200-6400
radiolinksetupfailurefdd
5200-6400
Table 4.15.: Significant coefficients during period of consideration: 4200-6400
only relevant for a certain part of this period. More specifically, time-point ~5200
appear to be a division-point. Hence, this period can be thought of as consisting
of two sub-periods. If we take a look at Figure 4.7, this appears reasonable: at
time-point 18 there is noticeable increase (and this corresponds to time-point ~5200
in the figures displayed in sec. A.2.2).
Sub-period 1
Considering the evolution of the odds-ratios for the first sub-period, there is one
covariate that present a particularly interesting behavior, and that is Cell_220524.
Let us therefore consider it in a bit more detail. Recall that cells define the geographical area in which a call is made. For most of the (full) period, e.g. 1-14200,
this covariate is insignificant, with an odds-ratio hovering around 1, but for the particular period under concern, the odds-ratio shoots up significantly, easily passing
the previously mentioned rule-of-thumb of > 3. See Figure 4.12.
Exploring this covariate under this particular sub-period more closely, one finds
that 52.7% of the calls were made from Cell_220524, and that out of these, 98.5%
were dropped. This can be compared to the prior period (1 – 4200), in which 5%
of the calls were recorded for this cell, and out of these, only 15% were dropped
calls. Considering the posterior model probabilities for the (univariate) candidate
model consisting of this covariate, one may observe from Figure 4.13 that DMA
has assigned 100% posterior model probability for this candidate model during this
period. Finally, if we consider the measure of reduction in entropy obtained from
the dynamic trees, as is displayed in Figure 4.14, one can observe that this replicates
Figure 4.12 and Figure 4.13 quite well.
Deriving association rules for the first sub-period (4200-5200), one obtains rules
that confirm the significance of the covariates. One finding, for instance, is that
56
Odds ratio
0
50
100
150
4.3 Online drop analysis
0
2000
4000
6000
8000
10000
12000
14000
12000
14000
Time
0.8
0.6
0.4
0.2
0.0
Posterior Model Probability
1.0
Figure 4.12.: Odds-ratios: Cell 220524
0
2000
4000
6000
8000
10000
Time
Figure 4.13.: Posterior Model Probabilities: Cell 220524
∼ 20% of the calls during this period were made from phones with the particular
GCP combination “000011000000011100011100 ”, and out of these, 99.58% were
terminated unnaturally. Another finding is in regards to the covariate PS : 52.3% of
the calls originated from phones transmitting data, and out of these 86.7% dropped.
Sub-period 2
Considering the evolution of the odds-ratios for the second sub-period, there is
again one covariate that presents a particularly interesting behavior, in this case
it is radiolinksetupfailurefdd. For the greater part of the full-period, this covariate
mostly registers 0-values, but under the particular sub-period of concern, it registers
a lot of 1’s, indicating its presence in the calls. In Figure 4.15, the evolution of the
odds-ratio for this covariate is presented. Furthermore, in Figure 4.16 the posterior
model probabilities for the candidate model representing this covariate is displayed.
From Figure 4.15, we can observe that the odds-ratio shoots up significantly around
~5200. One may further notice that the estimate then stabilizes around ~70 for
the remainder of the period. This is because very few subsequent observations
having this attribute are encountered, and hence the coefficient isn’t updated. From
Figure 4.16, it can be seen that DMA has assigned a posterior model probability
57
Results
0.020
0.010
0.000
Variable Importance
0.030
Chapter 4
0
2000
4000
6000
8000
10000
12000
14000
Time
40
30
0
10
20
Odds ratio
50
60
70
Figure 4.14.: Reduction in Entropy: Cell 220524
0
2000
4000
6000
8000
10000
12000
14000
Time
Figure 4.15.: Odds-ratios: Radiosetupfailurefdd
of 100% for the greater part of this sub-period. Furthermore, in Figure A.12 the
reduction in entropy for this covariate is displayed, and it can be observed that it
replicates these two aforementioned plots quite well. Exploring this covariate more
closely, one finds that 74.6% of calls during this sub-period had a radiolink-setupfailure (radiolinksetupfailurefdd=1), of which 98.8% dropped.
Deriving association rules for the second sub-period (5200-6400), one again finds
rules that confirm the significance of the covariates. During this sub-period, we
find that two particular GCP combinations, “000000000000011000011011” and
“000000000000001000011011 ”, are relevant and correlate with dropped calls to a
great degree. They represent 18.1% and 27.8% respectively of the calls during this
sub-period, and out of these 99.5% and 98.5% were dropped respectively.
4.3.3. Static Logistic Regression vs. Dynamic Logistic
Regression
In sec. 4.2.5, the question of whether a dynamic approach is preferable to a static
one was evaluated in terms of predictive capability. In this section, this question is
58
0.8
0.6
0.4
0.2
0.0
Posterior Model Probability
1.0
4.3 Online drop analysis
0
2000
4000
6000
8000
10000
12000
14000
Time
Figure 4.16.: Posterior Model Probabilities: Radiosetupfailurefdd
considered in terms of the degree of insight about variable effects. In Table B.5 in
the Appendix, one finds the coefficient estimates obtained from the static logistic
regression model fitted on 70% of the dataset, without accounting for the orderdependence. Below, a few examples where the results of the two approaches do not
align are presented.
Cell_id220517
As one may observe from Table B.5, using the static logistic regression, the coefficient
for this covariate has been estimated to −0.528371 (with a p-value of 0.022355),
implying a significant negative effect. If we instead consider the recursive estimates
obtained from the dynamic logistic regression, displayed in Figure A.17, one can
observe that whilst this covariate presents a significant negative effect for some subperiods, it also presents periods of significant positive effect.
Cell_id220518
From Table B.5, it can be seen that the coefficient has been estimated to −0.285032
for this covariate (with the static logistic regression), however resulting in a p−value
of 0.17, and hence, by most standards, would be considered insignificant. If we take
a look at the recursive estimates displayed in Figure A.18, one can see that the
coefficient is indeed insignificant for large sections of the period, but that for a few
sub-periods it attains (both positive and negative) estimates that are at least ±2
standard errors away from 0.
X23
In Table B.5, one can observe a coefficient estimate of −0.506021 (with a p-value
of 0.000606), hence implying a significant negative effect. Considering the recursive
estimates from the dynamic logistic regression, displayed in Figure A.10, one may
indeed observe that for large parts of the period, the coefficient is estimated to a
negative value, but that for a sub-period of approximately 1000 time-points, the
coefficient is estimated to a positive value that is significant (where it reaches an
odds-ratio of ∼ 15).
59
Chapter 4
Results
4.3.3.1. Summary
For the three examples described above, the common theme is that through the
static framework, temporal behaviors are not captured, and as a consequence, the
resulting estimated effects could be misleading to interpret. For instance, to interpret
the effect of “X23” as “significantly negative over the considered period”, although
methodologically correct, may be misleading in practical applications. There are
dozens of additional cases like those described above. These results may be seen
as a part-explanation of why the dynamic logistic regression also proved to perform
stronger in terms of predictability.
60
5. Discussion
To our knowledge, this thesis work is the first to approach the problem of analyzing
drop causes in mobile networks by using online learning classification techniques.
This approach was motivated by, on the one hand, the availability of class labels,
and on the other, the assumption that data is non-stationary and what correlates
with a particular class may change over time. A natural approach would otherwise
have been to consider this problem as an anomaly detection problem, in which abnormal periods could be identified by large increases in the number of dropped calls.
By instead framing the problem as an online classification problem, one arguably
addresses the core of the problem more directly, in that one do not necessarily have
to first detect a suspicious period in order to detect changes in drop causes.
In its original format, the data consisted of very large .txt files, such that any direct
application of statistical or machine learning methods were not possible. Consequently, quite a lot of time was initially spent on processing the data using various
text mining techniques. The collective size of the .txt files was enormous, such
that pre-processing with a regular computer faced memory issues. The decision was
made to limit the analysis to one STP, and further to apply sampling techniques in
the parsing step, such that only a limited amount of observations (calls) had to be
pre-processed. Making the parsing-scripts more efficient and possibly running the
scripts on a more powerful computer would increase one the one hand the possible
scope of the analysis, as well as the practical applicability of the proposed approach,
and is left for future work.
A lot of effort was invested trying to understand the data as well as exploring what
methods had been used previously to analyze similar data. One characteristic of
the data that was of particular focus initially was that every observation has a
sequential structure: every call has a beginning and an end, and in between these
two events, data are successively recorded - denoted by time-stamps. As such,
an initial idea was to use sequential pattern-mining methods for the purpose of
detecting new behavior in the data. However, after having explored simpler (static)
classification techniques on unordered data from the logs, it became clear that the
sequential structure may not be as important as initially expected (to discriminate
between normal and dropped calls). As such, the problem was instead defined as a
classification problem, in which another sequential aspect was emphasized; that of
between logs, rather than within logs.
On the basis of the characteristics of the data (high-dimensional, sequentially arriving, and non-stationary) and the objective of the thesis (to explore discriminative
61
Chapter 5
Discussion
features), four criteria were set up as to determine the specific classification techniques to be used. From the literature-review that was performed, it was found that
these criteria quite drastically narrowed the space of apt classifiers. In the end, (dynamic) extensions of the logistic regression and partition trees were selected, both
satisfying the four criteria relatively well.
More specifically, considering the former first, two dynamic extensions of the logistic regression were considered, one being the (single-) dynamic logistic regression,
and the other being a further extension of the former, also accounting for model
uncertainty through an extension of BMA (DMA). The initial expectation was that
the latter would pose a stronger alternative compared to the former in terms of performance. It was however found that the standard approach of considering all the
possible variable-combinations was not computationally feasible for this data, due
to high-dimensionality, as well as a long time-series. An interesting extension of the
DMA were proposed in Onorante and Raftery (2014) to address the issue of large
model spaces, using the concept Occam’s window to reduce the number of possible
models considered at every time point. This approach was however ultimately disregarded as (i) it is suitable for shorter time series, and (ii) only occasionally tests to
include candidate models from the larger space of models, and as such doesn’t align
so well with the concept of detecting change in drop causes. Instead, two alternative
strategies for constructing candidate models were considered. Whilst showing strong
performance, neither could outperform the single dynamic logistic regression, which
were shown to perform excellent on this data: the best one resulting in an AUC of
99.96% and a G − mean of 99.7%.
The other (online learning) classification technique considered in this thesis was the
Dynamic Trees. A careful evaluation of the model parameters were performed as
to derive the best DT. Just as with the dynamic logistic regression, it was found
that a λ < 1 improved the results, implying that a local fit is preferable to a global
one. In terms of predictive capability, this technique did not perform as well as the
dynamic logistic regression: the best DT was shown to obtain an AUC and G-mean
of 93.41% and 86.35% respectively. Figure 4.5 displays the performance of the DT
over the considered period, and as previously noted, several degradation’s in the
performance are present, implying that the tree-structure were not able to adapt
as quickly as needed. Such degradation’s were not found for the dynamic logistic
regression.
The performance of the best dynamic logistic regression classifier were further compared to that of the standard (static) logistic regression in two experimental setups
in which (i) data were gradually observed, and (ii) all data were available. In both
cases, the dynamic logistic regression was shown to outperform the static logistic
regression with comfortable margins: supporting the hypothesis of non-stationary,
as well as motivating the dynamic extension. In addition, it may also be worth
noting that ANN and SVM - known for their ability to classify complex and highdimensional data - also were applied to this data (the full dataset); it was found
that neither could beat the performance of the dynamic logistic regression. These
62
Discussion
are promising results, as the covariates extracted from each call are of such a type
that extending the proposed approach to early prediction seems feasible. This could
be an interesting approach to explore in the future.
In addition to the three main characteristics of the data listed above, one may
add (iv) temporally sparse. This turned out to cause problems for the dynamic
logistic regression, where the original method failed to converge for about 1/4 of
the covariates. To address this problem, this work presented a modification of
the forgetting framework originally proposed by McCormick et al. (2012), which in
addition to considering the predictive likelihood also considers local sparsity. This
modification was shown to allow the inclusion of multiple variables that could not be
used with the original forgetting framework. An evaluation - in terms of predictive
capacity - was also performed, which showed that the modified framework achieved
a slightly stronger performance. The basic idea of this modification that during
periods of sparsity, lower degree of forgetting is applied, and hence the update of
the parameters is based on a longer span of data, appears intuitive. However, whilst
allowing more attributes to be included and a lower λ to be set, this modification
it did not completely solve the issue. It was found that under some circumstances,
this framework still had problems with convergence; when λ were set very low. As
such, when stability rather than maximum predictability is the objective, a more
conservative selection of λ is likely to be preferable. By scaling the forgetting factor
closer to 1 during periods of sparsity - for a particular covariate - one also runs the
risk of potentially not capturing some interesting local behavior at such periods.
The modification further introduced an additional parameter, defining the width
of the window in which sparsity should be considered - and everything else equal,
more parameters is not to be preferred. Further refinement and evaluation of this
modification is needed, and is left for future work.
Regarding the part of the analysis concerned with temporal significance of covariates, which we termed online drop analysis, it was demonstrated that the selected
online learning classification techniques were able to provide good insights into what
variables are important at different time-points. Two levels of granularity were considered; one centered around ’variable groups’, and the other focusing on (as in the
standard case) the ’actual variables’. In the former, posterior inclusion probabilities
obtained from DMA were analyzed, and in the latter, the evolution of the log-odds or
odds-ratios. The scope of insights from the former was quite limited. Concerning the
latter, an evaluation was performed as to determine whether the ’univariate scanner’
is more suitable than the single dynamic logistic regression. It was found that the
’univariate scanner’, through its variable-specific forgetting, could more precisely
describe the evolution of the coefficient estimates. In the case of the ’univariate
scanner’, in addition to evaluating the log-odds or odds-ratios, we also analyzed
the posterior (univariate) model probabilities. Finally, for the dynamic trees, the
reduction in entropy was studied. Using these approaches, two sub-periods of abnormal call-drop rates were analyzed, and several interesting findings were made.
One, for instance, is that the geographical area in which calls were made from (the
63
Chapter 5
Discussion
cells) played a key role in the first sub-period. Furthermore, it was shown that the
aforementioned approaches to analyzing temporal significance resulted in similar
conclusions. Reflecting briefly on the effect of the forgetting factor in the context
of online drop analysis; it may again be underscored that a lower λ has the effect
giving greater weight to more recently observed data points, and hence producing
more volatile updates of the odds ratios. This increases the degree to which one is
able to detect local temporal significance of covariates. However, to ensure stability
of the system and convergence of the algorithm, λ values below 0.9 were not considered for the univariate scanner. To further refine the modified forgetting factor,
and hence potentially allowing for greater granularity in terms of identifying local
behavior, is left for future work. Since the objective of this thesis was to develop
and demonstrate a framework rather than extracting specific covariates, collaboration with domain-experts could improve the selection of variables to extract, as to
better serve the specific objectives of the troubleshooting team. Another extension
of this approach could be to automate the “detection-step”, such that alerts would
be triggered if a covariate reaches a certain odds ratio for instance.
In addition to reducing the computational burden of pre-processing, sampling techniques were also used for the purpose of addressing the issues of class-imbalance.
Sampling techniques generally have the positive effect of helping classifiers so that
they are able to learn the minority class better. This was shown to be the case in
this thesis as well. Specifically, two sampling techniques were evaluated: (i) online
random undersampling, and (ii) adaptive-online random undersampling, which were
developed in this thesis. Both were shown to be able to increase the capability of the
classifiers to correctly identify minority instances. This improvement does however
come at the cost of potential information loss. By undersampling to such a great extent as done in this thesis, one does run the risk of missing out on potentially useful
information. To evaluate to which degree this was problem, sensitivity analysis were
performed as to ensure that different sampling rates resulted in approximately the
same conclusions. This analysis did not reveal any noteworthy issues. Additional
evaluation of this aspect would nonetheless be rewarding and is left for future work.
In terms of implementation, online learning has the caveat of not requiring the
storage of any data, which may come in handy seeing it as the size of the data that
Ericsson accumulates every day is enormous. The DMA approach can further be
parallelized, since each candidate model is updated independently of each other, as
to speed up the computation speed considerably.
64
6. Conclusions
This work presents an, to our knowledge, new approach for analyzing dropped calls
in mobile networks. Compared to the static state-of-art approaches, the developed
framework enables the detection of changes in drop causes without first determining
a suspicious period, and secondly, does not require any storage of data.
To address the issue of class imbalance, this thesis applied an online adaptation
of the random undersampling technique, as well as an extension developed in this
thesis. Whilst the developed technique did not succeed to improve the results of the
online random undersampler, both techniques were shown to significantly improve
the degree of discrimination of dropped calls compared to the original data.
Two online learning classification techniques, dynamic logistic regression and dynamic trees, were explored in this thesis. The former were shown to have considerable problems with temporally sparse covariates. To remedy this problem, this
work proposed a modification to the forgetting framework originally developed by
McCormick et al. (2012). The modification was shown to both allow for the inclusion of sparse covariates, as well as to improve the overall classification capability.
Having carefully evaluated the parameters for both of the models, the best dynamic
logistic regression model were shown to achieve excellent results, with an AU C of
99.96% and a G − mean of 99.7%, whilst the best dynamic trees model achieved an
AU C of 93.4% and G − mean of 86.4%. That is, the dynamic logistic regression
was shown to achieve a considerably stronger performance compared to the dynamic
trees. To evaluate the choice of online learning, a comparison was also made to the
static logistic regression, which was found to achieve an AUC of between 80% and
92%, depending on the amount of data fed to the training set; providing strong
support for the online learning approach.
In addition to showing that the online learning approach was able to predict mobile
phone call data with great precision, this thesis also shows that the selected online
learning techniques are able to provide useful insights regarding temporally important variables for discriminating dropped calls from normal calls. A comparison
to static models was also considered in terms of variable importance insights. It
was found that considering the dataset as “a whole” rather than sequentially, led to
misleading effect interpretations, due to the temporal nature of the system: further
supporting the online learning approach.
Whilst showing a lot of potential, the proposed approach needs to undergo further
refinement and testing, to ensure stability and confirm its practical use. There are
65
Chapter 6
Conclusions
several dimensions by which this work could be extended. A natural next step would
be to consult domain experts and more carefully select which variables that ought
to be monitored. Another possibility is to evaluate whether it is possible to extend
the framework to address the task of early classification of dropped calls, as in Zhou
et al. (2013).
66
A. Figures
0.6
0.4
0.0
0.2
varprop
0.8
1.0
A.1. Results: Online classification
0
2000
4000
6000
8000
10000
12000
14000
Time
0.020
0.010
0.000
Variable Importance
0.030
Figure A.1.: Proportion on the N particles that includes the covariate
“Cell_id220524”
0
2000
4000
6000
8000
10000
12000
14000
Time
Figure A.2.: Reduction in entropy for covariate “Cell_id220524” from Dynamic
Trees
67
Chapter A
Figures
A.2. Results: Online drop analysis
2
−2
0
Log odds
4
6
A.2.1. Single dynamic logistic regression vs. Univariate DMA
0
2000
4000
6000
8000
10000
12000
14000
Time
from
the
single
dynamic
logistic
regression:
2
−2
0
Log odds
4
6
Figure A.3.: Log-odds
“Cell_id220524”
0
2000
4000
6000
8000
10000
12000
14000
Time
Figure A.4.: Log-odds from the “univariate scanner”: “Cell_id220524”
68
A.2 Results: Online drop analysis
A.2.2. Significant covariates in interesting period
2
0
−4
−2
Log odds
4
6
A.2.2.1. Log odds from the Dynamic Logistic Regression
0
2000
4000
6000
8000
10000
12000
14000
Time
1
−2
−1
0
Log odds
2
3
Figure A.5.: Recursively estimated coefficient for “X14” from the dynamic logistic
regression
0
2000
4000
6000
8000
10000
12000
14000
Time
Figure A.6.: Recursively estimated coefficient for “X15” from the dynamic logistic
regression
69
Figures
0
−6
−4
−2
Log odds
2
4
6
Chapter A
0
2000
4000
6000
8000
10000
12000
14000
Time
1
−2
−1
0
Log odds
2
3
Figure A.7.: Recursively estimated coefficient for “X16” from the dynamic logistic
regression
0
2000
4000
6000
8000
10000
12000
14000
Time
0
−2
Log odds
2
4
Figure A.8.: Recursively estimated coefficient for “X21” from the dynamic logistic
regression
0
2000
4000
6000
8000
10000
12000
14000
Time
Figure A.9.: Recursively estimated coefficient for “X22” from the dynamic logistic
regression
70
1
0
−3
−2
−1
Log odds
2
3
4
A.2 Results: Online drop analysis
0
2000
4000
6000
8000
10000
12000
14000
Time
0
−2
Log odds
2
4
Figure A.10.: Recursively estimated coefficient for “X23” from the dynamic logistic regression
0
2000
4000
6000
8000
10000
12000
14000
Time
“GCP
0
−10
−5
Log odds
5
10
Figure A.11.: Recursively
estimated
coefficient
for
000011000000001100001000” from the dynamic logistic regression
0
2000
4000
6000
8000
10000
12000
14000
Time
Figure A.12.: Recursively
estimated
coefficient
for
000011000000011100011100” from the dynamic logistic regression
“GCP
71
Figures
2
−2
0
Log odds
4
6
Chapter A
0
2000
4000
6000
8000
10000
12000
14000
Time
2
0
1
Log odds
3
4
5
Figure A.13.: Recursively estimated coefficient for “Cell_id220524” from the dynamic logistic regression
0
2000
4000
6000
8000
10000
12000
14000
Time
1
−2
−1
0
Log odds
2
3
Figure A.14.: Recursively estimated coefficient for “radiolinkfailurefdd” from the
dynamic logistic regression
0
2000
4000
6000
8000
10000
12000
14000
Time
Figure A.15.: Recursively estimated coefficient for “PS” from the dynamic logistic
regression
72
A.2 Results: Online drop analysis
0.020
0.010
0.000
Variable Importance
0.030
A.2.2.2. Reduction in entropy from the Dynamic Trees
0
2000
4000
6000
8000
10000
12000
14000
Time
Figure A.16.: Reduction in entropy for “radiolinkfailurefdd” from the dynamic
trees model
73
Chapter A
Figures
A.2.3. Static vs. Dynamic Logistic Regression: covariate effects
0
−2
−1
Log odds
1
2
A.2.3.1. Dynamic Logistic Regression
0
2000
4000
6000
8000
10000
12000
14000
Time
0
−2
−1
Log odds
1
2
Figure A.17.: Recursively estimated coefficient for “Cell_id220517” from the dynamic logistic regression
0
2000
4000
6000
8000
10000
12000
14000
Time
Figure A.18.: Recursively estimated coefficient for “Cell_id220518” from the dynamic logistic regression
74
1
−1
0
Log odds
2
A.2 Results: Online drop analysis
0
2000
4000
6000
8000
10000
12000
14000
Time
2
−2
0
Log odds
4
6
Figure A.19.: Recursively estimated coefficient for “Cell_id220519” from the dynamic logistic regression
0
2000
4000
6000
8000
10000
12000
14000
Time
1
0
−2
−1
Log odds
2
3
Figure A.20.: Recursively estimated coefficient for “Cell_id220521” from the dynamic logistic regression
0
2000
4000
6000
8000
10000
12000
14000
Time
Figure A.21.: Recursively estimated coefficient for “Cell_id220523” from the dynamic logistic regression
75
B. Tables
B.1. Results: Online classification
B.1.1. Dynamic Trees
Alpha Beta TPR TNR
0.99
2 0.820 0.855
0.90
2 0.814 0.852
0.80
2 0.810 0.850
0.70
2 0.815 0.842
Table B.1.: Evaluation of tree prior
AUC G-mean
0.908
0.845
0.906
0.841
0.904
0.838
0.904
0.839
alpha for Dynamic Trees
Beta Alpha TPR TNR
1.75
0.99 0.811 0.854
2.00
0.99 0.820 0.855
2.25
0.99 0.815 0.855
2.50
0.99 0.811 0.848
Table B.2.: Evaluation of tree prior
AUC G-mean
0.908
0.840
0.908
0.845
0.905
0.843
0.901
0.836
beta for Dynamic Trees
w user.self sys.self
elapsed
50 342.300
0.150 342.420
100 388.100
0.300 388.330
250 520.190
0.190 520.370
500 768.580
0.110 768.660
1000 1057.950
0.170 1058.130
2000 1796.790
5.360 1802.110
4000 2489.360 27.440 2516.830
Table B.3.: Evaluation of the effect of active pool size (w) on computational time
for Dynamic Trees
77
Chapter B
Tables
B.1.2. Dynamic Logistic Regression
Covariate
X10
X12
imsi_factor1
imsi_factor2
imsi_factor3
imsi_factor4
imsi_factor5
imsi_factor6
imsi_factor7
X_uraupdate.orig
last_tGCP000011000000001100001000
last_tGCP000011100000000000000011
last_tGCP000000100000000000000011
last_tGCP000000100000000000000000
last_tGCP000011000000011100011100
last_tGCP000000000000001000011011
last_tGCP000000000000001000011000
last_tGCP000011000001011100011100
last_tGCP000011000000001100011100
cell_id220412
cell_id220511
cell_id220521
cell_id220524
cell_id220526
RSCP_avg.1
X_interrathandoverinfo.orig
t_proc22
radiobearerrelease
locationreport
securitymodereject
Table B.4.: Covariates for which the original forgetting framework failed to
converge
Coefficient
Intercept
X6
X10
X12
X13
78
Estimate Standard.Error z.value p.value
-2.59
0.31
-8.40 < 2e-16
0.16
0.23
0.72 0.473546
0.47
0.35
1.33 0.183486
1.13
0.23
4.88 1.08e-06
-0.82
0.19
-4.37 1.23e-05
B.1 Results: Online classification
X14
X15
X16
X19
X22
X23
imsi_factor1
imsi_factor2
imsi_factor3
imsi_factor4
imsi_factor5
imsi_factor6
imsi_factor7
imsi_factor8
imsi_factor9
imsi_factor10
imsi_factor11
imsi_factor12
imsi_factor13
imsi_factor14
imsi_factor15
X_cellupdate.orig
X_uraupdate.orig
activate
activesetupdatecomplete
X_physicalchannel.orig
X_compressed.orig
X_dlpower.orig
GCP_000011000000000000000000
GCP_000011100000000000000000
GCP_000011000000001100001000
GCP_000011000000001000011000
GCP_000011100000000000000011
GCP_000000100000000000000011
GCP_000011000000011000011011
GCP_000011000000011000011100
GCP_000000100000000000000000
GCP_000011000000011100011100
GCP_000000000000001000011011
GCP_000000000000001000011000
GCP_000011000000011000011000
GCP_000011000001011100011100
GCP_000011000000001000011011
GCP_000011000000001100011100
-0.03
1.41
-0.96
-1.37
1.09
-0.51
-0.19
-0.74
-1.19
-0.73
-0.47
-0.76
-1.09
0.01
-0.06
-0.84
1.52
0.55
0.52
0.69
0.64
-0.69
-2.39
0.22
-0.37
0.85
-0.02
0.15
1.21
-1.02
1.57
0.52
-0.86
-2.19
0.88
-0.64
-1.88
2.37
0.73
-0.57
0.60
2.18
0.82
1.12
0.14
0.18
0.17
0.48
0.14
0.15
0.20
0.21
0.22
0.22
0.23
0.22
0.23
0.23
0.23
0.26
0.22
0.21
0.22
0.22
0.24
0.09
0.34
0.15
0.15
0.11
0.13
0.10
0.23
0.28
0.36
0.26
0.31
0.37
0.30
0.33
0.37
0.29
0.27
0.30
0.32
0.33
0.33
0.37
-0.22
7.95
-5.76
-2.86
7.55
-3.43
-0.95
-3.46
-5.36
-3.30
-2.01
-3.42
-4.81
0.05
-0.25
-3.26
6.80
2.60
2.32
3.11
2.67
-7.35
-6.95
1.43
-2.44
7.42
-0.19
1.39
5.24
-3.60
4.39
1.97
-2.80
-5.89
2.95
-1.94
-5.13
8.04
2.77
-1.90
1.84
6.58
2.46
3.00
0.822596
1.86e-15
8.42e-09
0.004237
4.39e-14
0.000606
0.343919
0.000545
8.20e-08
0.000969
0.044828
0.000618
1.54e-06
0.962124
0.801791
0.001127
1.04e-11
0.009346
0.020193
0.001844
0.007568
1.99e-13
3.61e-12
0.152776
0.014902
1.13e-13
0.852600
0.165634
1.60e-07
0.000314
1.11e-05
0.049023
0.005092
3.86e-09
0.003212
0.052570
2.92e-07
9.15e-16
0.005695
0.058078
0.065270
4.64e-11
0.013758
0.002713
79
Chapter B
GCP_000000000000011000011011
GCP_000000000000001000000011
cell_id220412
cell_id220413
cell_id220414
cell_id220415
cell_id220416
cell_id220511
cell_id220512
cell_id220513
cell_id220514
cell_id220517
cell_id220518
cell_id220519
cell_id220521
cell_id220523
cell_id220524
cell_id220526
cell_id250511
RSCP_avg.1
RSCP_avg.2
X_interrathandoverinfo.orig
X_tx.orig
cpichAvg.1
cpichAvg.2
t_proc1
t_proc14
t_proc15
t_proc16
t_proc18
t_proc2
t_proc21
t_proc22
t_proc23
t_proc29
t_proc3
t_proc32
t_proc33
t_proc34
t_proc37
t_proc10
t_proc11
t_proc12
t_proc13
80
Tables
0.61
-0.08
0.09
0.21
1.06
1.85
0.49
-0.03
0.74
0.20
0.25
-0.53
-0.29
0.26
1.36
1.01
1.81
1.11
2.14
1.13
-0.76
-0.32
0.02
-0.11
0.03
-0.27
-0.22
-0.72
-0.17
-0.29
0.42
-0.46
0.42
-0.35
-0.08
0.13
0.05
0.20
-0.02
-0.24
0.02
-0.19
0.45
-0.03
0.60
0.52
0.22
0.18
0.17
0.18
0.32
0.21
0.16
0.24
0.32
0.23
0.21
0.24
0.16
0.18
0.16
0.19
0.37
0.58
0.16
0.25
0.10
0.13
0.15
0.18
0.13
0.33
0.24
0.28
0.36
0.25
0.52
0.32
0.22
0.16
0.19
0.23
0.47
0.21
0.09
0.09
0.14
0.09
1.02
-0.16
0.43
1.13
6.23
10.55
1.55
-0.12
4.52
0.85
0.80
-2.28
-1.37
1.07
8.55
5.66
11.03
5.79
5.78
1.95
-4.71
-1.24
0.21
-0.82
0.20
-1.52
-1.75
-2.21
-0.69
-1.04
1.16
-1.85
0.80
-1.09
-0.34
0.77
0.28
0.84
-0.04
-1.16
0.17
-2.08
3.16
-0.32
0.308164
0.874456
0.665636
0.257968
4.64e-10
< 2e-16
0.121757
0.901457
6.10e-06
0.396193
0.425370
0.022355
0.171236
0.282202
< 2e-16
1.52e-08
< 2e-16
6.89e-09
7.41e-09
0.051235
2.49e-06
0.214177
0.833289
0.414159
0.845180
0.128285
0.079899
0.026835
0.488020
0.298706
0.246736
0.064493
0.424858
0.277684
0.730663
0.442721
0.778440
0.398736
0.967561
0.246237
0.868475
0.037290
0.001571
0.752861
B.1 Results: Online classification
t_proc25
0.34
0.10
3.23 0.001231
t_proc28
0.35
0.12
2.86 0.004256
t_proc31
-0.05
0.10
-0.45 0.653183
t_proc4
-0.29
0.14
-2.04 0.041486
t_proc6
0.08
0.18
0.45 0.653796
t_proc9
0.10
0.09
1.11 0.265443
sirAvg.1
-0.00
0.41
-0.01 0.994379
sirAvg.2
0.21
0.10
2.18 0.029493
radiolinksetupfailurefdd
6.92
0.44
15.55 < 2e-16
radiolinkadditionrequestfdd
-0.57
0.10
-5.60 2.15e-08
radiolinkfailureindication
2.14
0.09
22.90 < 2e-16
radiobearerrelease
-1.22
0.23
-5.34 9.52e-08
radiobearerreconfiguration
-0.72
0.11
-6.46 1.05e-10
Interact
0.43
0.16
2.70 0.006918
SRB
-0.42
0.19
-2.21 0.026755
Other
2.93
0.27
10.80 < 2e-16
e4a
-0.24
0.24
-0.98 0.328102
evid_not_measured
1.12
0.20
5.73 9.83e-09
e1d
0.92
0.15
6.21 5.32e-10
e2d
0.00
0.12
0.00 0.998481
e2f
-1.14
0.14
-7.91 2.56e-15
locationreport
0.29
0.19
1.56 0.118088
location
-0.69
0.14
-5.05 4.41e-07
locationreportingcontrol
0.57
0.17
3.38 0.000720
X_rab.orig
1.59
0.22
7.35 2.05e-13
securitymodereject
5.74
0.48
11.94 < 2e-16
securitymodecommand
-1.47
0.21
-7.05 1.76e-12
Table B.5.: Coefficient estimates for a Static Logistic Regression model trained on
the full data set
81
Bibliography
Agrawal, R., Imieliński, T., and Swami, A. (1993). Mining association rules between
sets of items in large databases. In ACM SIGMOD Record, volume 22, pages 207–
216. ACM.
Anagnostopoulos, C. and Gramacy, R. B. (2012). Dynamic trees for streaming and
massive data contexts. arXiv preprint arXiv:1201.5568.
Barbieri, M. M. and Berger, J. O. (2004). Optimal predictive model selection. Annals
of Statistics, pages 870–897.
Brauckhoff, D., Dimitropoulos, X., Wagner, A., and Salamatian, K. (2012).
Anomaly extraction in backbone networks using association rules. IEEE/ACM
Transactions on Networking (TON), 20(6):1788–1799.
Breaugh, J. A. (2003). Effect size estimation: Factors to consider and mistakes to
avoid. Journal of Management, 29(1):79–97.
Cheung, B., Kumar, G., and Rao, S. A. (2005). Statistical algorithms in fault detection and prediction: Toward a healthier network. Bell Labs Technical Journal,
9(4):171–185.
Chipman, H. A., George, E. I., and McCulloch, R. E. (1998). Bayesian cart model
search. Journal of the American Statistical Association, 93(443):935–948.
Cox, D. R. (1958). The regression analysis of binary sequences. Journal of the Royal
Statistical Society. Series B (Methodological), pages 215–242.
Ericsson (2014). Ericsson Mobility Report 2014 kernel description. http://www.
ericsson.com/res/docs/2014/ericsson-mobility-report-june-2014.pdf.
Accessed: 2015-05-27.
Gramacy, R. B., Taddy, M., Wild, S. M., et al. (2013). Variable selection and
sensitivity analysis using dynamic trees, with an application to computer code
performance tuning. The Annals of Applied Statistics, 7(1):51–80.
Haddock, C. K., Rindskopf, D., and Shadish, W. R. (1998). Using odds ratios
as effect sizes for meta-analysis of dichotomous data: a primer on methods and
issues. Psychological Methods, 3(3):339.
Hanley, J. A. and McNeil, B. J. (1982). The meaning and use of the area under a
receiver operating characteristic (roc) curve. Radiology, 143(1):29–36.
He, H. and Garcia, E. A. (2009). Learning from imbalanced data. Knowledge and
Data Engineering, IEEE Transactions on, 21(9):1263–1284.
83
Chapter B
Bibliography
Hoeting, J. A., Madigan, D., Raftery, A. E., and Volinsky, C. T. (1999). Bayesian
model averaging: a tutorial. Statistical science, pages 382–401.
Japkowicz, N. et al. (2000). Learning from imbalanced data sets: a comparison
of various strategies. In AAAI workshop on learning from imbalanced data sets,
volume 68, pages 10–15. Menlo Park, CA.
Khanafer, R., Moltsen, L., Dubreil, H., Altman, Z., and Barco, R. (2006). A bayesian
approach for automated troubleshooting for umts networks. In Personal, Indoor
and Mobile Radio Communications, 2006 IEEE 17th International Symposium
on, pages 1–5. IEEE.
Koop, G. and Korobilis, D. (2012). Forecasting inflation using dynamic model averaging*. International Economic Review, 53(3):867–886.
Kurgan, L. A. and Cios, K. J. (2004). Caim discretization algorithm. Knowledge
and Data Engineering, IEEE Transactions on, 16(2):145–153.
Lewis, S. M. and Raftery, A. E. (1997). Estimating bayes factors via posterior simulation with the laplace metropolis estimator. Journal of the American Statistical
Association, 92(438):648–655.
McCormick, T. H., Raftery, A. E., Madigan, D., and Burd, R. S. (2012). Dynamic
logistic regression and dynamic model averaging for binary classification. Biometrics, 68(1):23–30.
Nguyen, H. M., Cooper, E. W., and Kamei, K. (2011). Online learning from imbalanced data streams. In Soft Computing and Pattern Recognition (SoCPaR), 2011
International Conference of, pages 347–352. IEEE.
Obuchowski, N. A. (2003). Receiver operating characteristic curves and their use in
radiology 1. Radiology, 229(1):3–8.
Onorante, L. and Raftery, A. E. (2014). Dynamic model averaging in large model
spaces.
Penny, W. D. and Roberts, S. J. (1999). Dynamic logistic regression. In Neural
Networks, 1999. IJCNN’99. International Joint Conference on, volume 3, pages
1562–1567. IEEE.
Powers, D. M. (2011). Evaluation: from precision, recall and f-measure to roc,
informedness, markedness and correlation.
Raftery, A. E., Kárnỳ, M., and Ettler, P. (2010). Online prediction under model
uncertainty via dynamic model averaging: Application to a cold rolling mill. Technometrics, 52(1):52–66.
Rao, S. (2006). Operational fault detection in cellular wireless base-stations. Network
and Service Management, IEEE Transactions on, 3(2):1–11.
Smith, J. (1992). A comparison of the characteristics of some bayesian forecasting models. International Statistical Review/Revue Internationale de Statistique,
pages 75–87.
84
Bibliography
Taddy, M. A., Gramacy, R. B., and Polson, N. G. (2011). Dynamic trees for learning
and design. Journal of the American Statistical Association, 106(493).
Theera-Ampornpunt, N., Bagchi, S., Joshi, K. R., and Panta, R. K. (2013). Using
big data for more dependability: a cellular network tale. In Proceedings of the 9th
Workshop on Hot Topics in Dependable Systems, page 2. ACM.
Wang, S., Minku, L. L., and Yao, X. (2013). A learning framework for online
class imbalance learning. In Computational Intelligence and Ensemble Learning
(CIEL), 2013 IEEE Symposium on, pages 36–45. IEEE.
Watanabe, Y., Matsunaga, Y., Kobayashi, K., Tonouchi, T., Igakura, T., Nakadai,
S., and Kamachi, K. (2008). Utran o&m support system with statistical fault
identification and customizable rule sets. In Network Operations and Management
Symposium, 2008. NOMS 2008. IEEE, pages 560–573. IEEE.
Zhou, S., Yang, J., Xu, D., Li, G., Jin, Y., Ge, Z., Kosseifi, M. B., Doverspike, R.,
Chen, Y., and Ying, L. (2013). Proactive call drop avoidance in umts networks.
In INFOCOM, 2013 Proceedings IEEE, pages 425–429. IEEE.
85
LIU-IDA/STAT-A–15/007–SE
Download