Master Thesis in Statistics and Data Mining Dynamic Call Drop Analysis Martin Arvidsson Division of Statistics Department of Computer and Information Science Linköping University Supervisor Patrik Waldmann Examiner Mattias Villani “It is of the highest importance in the art of detection to be able to recognize, out of a number of facts, which are incidental and which vital. Otherwise your energy and attention must be dissipated instead of being concentrated.” (Sherlock Holmes - Arthur Conan Doyle) Contents Abstract 1 Acknowledgments 3 1. Introduction 1.1. Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2. Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3. Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 5 7 7 2. Data 9 2.1. Data sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2. Raw data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2.1. Data variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3. Methods 3.1. Text Mining and Variable creation . . . . 3.2. Sampling strategies . . . . . . . . . . . . . 3.3. Evaluation techniques . . . . . . . . . . . . 3.4. Online drop analysis through classification 3.4.1. Dynamic Logistic Regression . . . . 3.4.2. Dynamic Model Averaging . . . . . 3.4.3. Dynamic Trees . . . . . . . . . . . 3.5. Drop description . . . . . . . . . . . . . . 3.5.1. Association Rule Mining . . . . . . 3.6. Technical aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4. Results 4.1. Exploratory analysis . . . . . . . . . . . . . . . . . . . . . . . . . 4.2. Online classification . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1. Sampling strategies . . . . . . . . . . . . . . . . . . . . . . 4.2.2. Dynamic Trees . . . . . . . . . . . . . . . . . . . . . . . . 4.2.3. Dynamic Logistic Regression . . . . . . . . . . . . . . . . . 4.2.4. Summary of results . . . . . . . . . . . . . . . . . . . . . . 4.2.5. Static Logistic Regression vs. Dynamic Logistic Regression 4.3. Online drop analysis . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1. DMA posterior inclusion probabilities . . . . . . . . . . . . 4.3.2. Evolution of odds-ratios and reduction in entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 13 13 15 17 17 22 26 32 32 33 . . . . . . . . . . 35 35 38 38 39 41 48 48 50 50 54 i Contents Contents 4.3.3. Static Logistic Regression vs. Dynamic Logistic Regression . . 58 5. Discussion 61 6. Conclusions 65 A. Figures A.1. Results: Online classification . . . . . . . . . . . . . . . . . . . . A.2. Results: Online drop analysis . . . . . . . . . . . . . . . . . . . A.2.1. Single dynamic logistic regression vs. Univariate DMA . A.2.2. Significant covariates in interesting period . . . . . . . . A.2.3. Static vs. Dynamic Logistic Regression: covariate effects . . . . . . . . . . . . . . . 67 67 68 68 69 74 B. Tables 77 B.1. Results: Online classification . . . . . . . . . . . . . . . . . . . . . . . 77 B.1.1. Dynamic Trees . . . . . . . . . . . . . . . . . . . . . . . . . . 77 B.1.2. Dynamic Logistic Regression . . . . . . . . . . . . . . . . . . . 78 Bibliography ii 83 Abstract This thesis sets out to analyze the complex and dynamic relationship between mobile phone call connections that terminate unexpectedly (dropped calls) and those that terminate naturally (normal calls). The main objective is to identify temporally discriminative features, such as to assist domain experts in their quest of troubleshooting mobile networks. For this purpose, dynamic extensions of logistic regression and partition trees are considered. The data consists of information recorded in real-time from mobile phone call connections, and each call is labeled by its category of termination. Characterizing features of the data that pose considerable challenges are: (i) class imbalance, (ii) high-dimensional, (iii) non-stationary, and (iv) sequentially arriving in a stream. To address the issue of class imbalance, two sampling techniques are considered. Specifically, an online adaptation of the random undersampling technique is implemented, as well as an extension (proposed in this thesis) that accounts for the possibility of changing degree of imbalance. The results suggest that the former is preferable for this data, but that both improve the degree of identification of the minority class (dropped calls). Another characterizing feature of this dataset is that several of the covariates are temporally sparse. This is shown to cause problems in the recursive estimation step of the dynamic logistic regression model. Consequently, this thesis presents an extension that accounts for temporal sparsity, and it is shown that this extension allows for the inclusion of temporally sparse attributes, as well as to improve the predictive capability. A thorough evaluation of the considered models is performed, and it is found that the best model is the single dynamic logistic regression, achieving an Area under the curve (AUC) of 99.96%. Based on odds ratios, posterior inclusion probabilities, and posterior model probabilities from the dynamic logistic regression, and reduction in entropy from the dynamic trees, analysis of temporally discriminative features is performed. Specifically, two sub-periods of abnormally high call drop rate are analyzed in closer detail, and several interesting findings are made; demonstrating the potential of the proposed approach. 1 Acknowledgments Several people deserve and have my deepest appreciation for their aid and support in making this thesis possible. First, I would like to thank Ericsson for giving me the opportunity to work with them, as well as for providing the data for this thesis. Special thanks to my cosupervisors Paolo Elena and Henrik Schüller for, on the one hand, defining a really interesting problem, and on the other, providing good support. Thanks also to Leif Jonsson, who oversaw the thesis projects and provided valuable input. Another person that cannot be left out is domain expert Håkan Bäcks, who provided very useful insights about the data and the functionality of the network. I would also like to thank my supervisor at Linköping University, Patrik Waldmann, who provided good advice and participated in many fruitful discussions. Finally, I would also like to thank my opponent, Andreea Bocancea, for her improvement suggestions. These undoubtedly strengthened the subsequent versions of the thesis. 3 1. Introduction 1.1. Background Besides selling hardware and software, network equipment providers (NEPs) also provide support to mobile network operators (MNOs). One imperative supportrelated task is that of troubleshooting, which consists of detecting problems in the network and understanding their causes. This task poses considerable challenges, not just because of the complexity of the systems, but also because of the enormous quantities of information that is collected from the networks every day. In this thesis, troubleshooting will be considered from a statistics and data analysis point of view. More specifically, this thesis sets out to analyze the complex and dynamic relationship between dropped calls and normal calls - where a dropped call may be defined as a call that ends without the intention of either participants of the call. While it is the case that a certain number of dropped calls are expected, inevitable and not interesting, there are also dropped calls of the sort that are unexpected, and may be caused by system malfunctions. Hence, from the perspective of the NEPs, it is of great interest to quickly identify and understand the causes for dropped calls, such that eventual problems can be correctly addressed. In periods of abnormally high call drop rates (percentage of calls that are dropped) the identification of drop causes is especially important. System degradation can have a wide range of different causes and explanations, such that the problem becomes quite complex. Two examples of high-level causes are; (i) system updates in the network, and (ii) new phones or software updates in already existing phones. In this thesis, statistical and machine learning methods are applied to identify low-level indicators of dropped calls, which later can be interpreted by domain experts to put eventual problems into context. The issue of detecting problems in mobile networks has been considered with a range of approaches in the literature, in particular within the subdisciplines of anomaly detection, fault detection, and fault diagnosis. A substantial amount of research has been done in these areas, and there are quite a few papers that consider these problems within the context of mobile networks, for example (Brauckhoff et al., 2012; Watanabe et al., 2008; Cheung et al., 2005; Rao, 2006). The bulk of these papers are concerned with identifying problems at the level of defined geographical regions, and the data is such that it describes the characteristics of particular regions (cells/radio base stations or radio network controllers), and not individual calls, as is the case in this thesis. A common approach is to work within the unsupervised 5 Chapter 1 Introduction framework, where the detection of a fault or anomaly often is the result of a setup whereby one tracks and/or model a selected number of features, such to gain an idea of the normal behavior, and then, when large deviations from this normal behavior are observed, through - for instance - threshold violations, as in Cheung et al. (2005) and Rao (2006), an anomaly or fault has been identified. Various techniques have been explored to extract and describe anomalies and faults: one approach is to apply association rule mining, as in Brauckhoff et al. (2012). There are relatively few papers that, within the context of mobile networks, consider the problem of fault detection or fault diagnosis in a supervised setting. In one of the exceptions, Khanafer et al. (2006), a Naive Bayes classifier is considered for predicting a set of labeled faults. Zhou et al. (2013) and Theera-Ampornpunt et al. (2013) also work within the supervised framework, with similar data (equal response and similar input) to that of this thesis, but with a slightly different objective; to perform early classification, such that proactive management can be implemented to deter certain types of calls from dropping. The classification methods considered in these two papers are Adaboost and Support Vector M achines. A limitation of the aforementioned approaches is that they, to a varying degree, implicitly assume a stationary and static environment - and mobile networks are in general not static systems: as previously mentioned, internal and external modifications and updates occur irregularly. This motivates a dynamic approach, rather than a static one. An additional limitation of the aforementioned approaches is that they also, to some extent, assume that the the data can be stored. In the context of processing data from mobile networks, however, this is assumption is problematic, since the volume of the data that is processed every day is astronomical: in 2014, Ericsson, the company at which this work was carried out, had 6 billion mobile subscribers, with a global monthly traffic of ∼ 2400 Petabytes (Ericsson, 2014). While it may not be feasible to thoroughly analyze the whole data, it does appear intuitively appealing to be able to analyze more data for the same cost, and thus, approaches with such characteristics ought to be preferable. A research discipline that has gained a lot of attention recently, and which deals with limitations of the sort described above, is online learning. In this thesis, a framework centered around online learning is proposed for the problem of predicting dropped calls and explaining their causes. I addition to being non-stationary, the data is also greatly imbalanced with respect to the response variable. To address the challenges that come with imbalanced data, sampling techniques are explored. In particular, an adaptive undersampling scheme is developed, where less data is sampled during periods of few dropped calls, and more data is sampled during periods of increased number of dropped calls. Another challenge is that several attributes in the data are temporally sparse. This presents a limitation for one of the selected methods. Subsequently, in this thesis, an extension of the forgetting factor framework originally proposed by McCormick et al. (2012), is developed and evaluated. 6 1.2 Objective 1.2. Objective The aim of this master thesis is to develop a framework that can identify temporally discriminative features for explaining dropped calls. A key challenge is that the underlying distribution of the data is non-stationary and changes are expected to occur irregularly and unpredictably over time. Subsequently, this thesis sets out to tackle this problem by using an online learning approach, wherein dynamic extensions of the logistic regression and partition trees are explored. Another (not completely orthogonal) aim of this thesis is to predict dropped calls with high precision. This latter objective is motivated by the fact that only information recorded up to a certain time before call termination is used, and as such, may be thought of as a first step in exploring the possibilities of early classification for this type of data. Finally, to evaluate the decision of using the dynamic approach, a set of scenarios are simulated in which the best dynamic classifier is compared to its static equivalency, both in terms of predictability and exploratory insights. 1.3. Definitions The following definitions are needed to fully understand the context of the problem. Troubleshooting Troubleshooting is an approach to problem solving. Specifically, it is the systematic search for the source of a problem - such that it can be solved. User Equipment (UE) User equipment (UE) constitutes of phones, computers, tablets, and other devices that connect to the network. Network Equipment Provider (NEP) Companies that sell products and services to communication service providers, such as mobile network operators, are referred to as network equipment providers (NEPs). Mobile Network Operator (MNO) Companies that provide services of wireless communications that either own or control the necessary elements to sell and deliver services to end users are referred to as mobile network operators (MNOs). Examples of such companies are Telia, Tele2, and Telenor. 7 Chapter 1 Introduction Normal calls Normal calls refer to connections between user equipment (UE) and the network where the connection terminates as expected. Dropped calls Dropped calls refer to connections between user equipment (UE) and the network, with the outcome of unexpected termination. UMTS Network A Universal Mobile Telecommunications System (UMTS), also referred to as 3G, is a third generation mobile cellular system for telecommunication networks. The system supports standard voice calls, mobile internet access, as well as simultaneous use of both voice call and internet access. Although 4G has been introduced, 3G remain the most widely used standard for mobile networks. Radio Network Controller (RNC) The Radio Network Controller (RNC) is the governing element in the UMTS network and is responsible for controlling the radio base stations that are connected to it. Radio Base Station (RBS) Radio base stations (RBS) constitute the elements of a network that provides the connection between UE and the RNC. 8 2. Data 2.1. Data sources The data were supplied by Ericsson AB and consist of machine-produced trace logs. These so called trace logs were originally collected from a lab environment at the Ericsson offices in Kista. As such, the information contained in the data does not reflect the behavior of any real people, but rather programmed systems. However, these systems are programmed such that they should reflect human behavior: a simulated call may for instance consist of texting, browsing the internet, physical movements, and others. Moreover, even though it is a lab environment, the implemented system technology is equivalent to that which is used in most live networks; the so called Universal Mobile Telecommunications System (UMTS), also known as 3G. Introduced in 2001, 3G is the third generation of mobile systems for telecommunication networks, and supports standard voice calls, mobile internet access, as well as simultaneous use of both voice call and internet access. Although 4G has been introduced, 3G still remains the most widely used system for mobile networks. The 3G network is structured hierarchically and by geographical region. More specifically, the network consists of three primary - interacting - elements: the user equipment (UE), radio base stations (RBS), and radio network controllers (RNC). At the bottom of the hierarchy, there are the cells, which define the smallest geographical regions in the network. RBSs are deployed such that they may be responsible for multiple cells, and as described in sec. 1.3, a RBS acts similar to that of a router: it provides the connection between the UE and the RNC. The RNC is the ruling element of the 3G network and is responsible for managing the division of resources at lower levels; for example which RBS a particular UE should use. 2.2. Raw data For every call that is initiated, a trace log is produced. The contents of these logs are recorded in real-time and contains information that corresponds to signals sent between the user equipment (UE), radio base stations (RBS), and the radio network controller (RNC). These signals may contain connection details, configuration information, measurement reports, failure indications, and others. This information, originally formatted as text, were first transformed into suitable format (as 9 Chapter 2 Data described in sec. 3.1), and later used as the input to the statistical models evaluated in this thesis. Finally, for each call, there is a recorded outcome, {normal, dropped}, which defines the response variable. More details about specific variables follow in sec. 2.2.1. The period for which the data were collected is January 26, 2015 - April 10, 2015, corresponding to approximately two and a half months’ worth of data. During this period, a total of 7.200 dropped calls were recorded. The total number of normal calls in the same period was much greater: 670.000. That is, approximately 99% of the calls terminated as expected (=normal), and only 1% terminated unexpectedly (=dropped). Datasets with this characteristic are often referred to as imbalanced in the machine learning and statistics literature. For classifiers that seek to separate two or more classes, imbalance can be problematic. In sec. 3.2, techniques for addressing challenges accompanying imbalanced datasets are described. 500 400 300 200 100 0 Number of dropped calls 600 In Figure 2.1, a time-series plot is presented, displaying the number of dropped calls over the period of interest. Note that the time-scale of the plot is not in minutes, hours or days: instead the data were divided into 100 equally large subsets, and then the sum was calculated within each subset. The rationale for presenting the data like this, rather than in relation to actual time, is twofold: (i) this lab data do not have any periodic dependencies, and (ii) an unequal amount of calls were traced during different periods, and during some days, no calls were recorded. 0 20 40 60 80 100 Time Figure 2.1.: Number of dropped calls as divided into 100 (ordered) equally large subsets. As one may observe in Figure 2.1, the number of drops is approximately constant for most of the period, with no apparent trend. There are however multiple time periods in which the number of drops increases quite drastically. Intuitively, these periods represent some form of degradation in the system with systematic errors. One of the goals of this thesis is to identify what factors that was important during such periods. In an online implementation, such a framework could potentially be used to detect causes for problems early on. 10 2.2 Raw data 2.2.1. Data variables From the original trace logs, a total of 188 attributes were initially extracted. Exploratory analysis revealed that quite a large proportion of them was redundant, which resulted in a final input-space of 122 attributes. To reduce the degree of distortion from events occurring a long time prior to the termination of the calls, s was (together with domain experts at Ericsson) decided that only the last 20 seconds of each call should be kept for analysis. Note that the main contributing factor for including a particular variable weren’t the known significance of the variable, but rather its intrinsic and potential relevance - in terms of future events (such that observing changes in its degree of relevance are useful for troubleshooting the network). In this section, a brief summary and explanation of each category of variables is presented. 2.2.1.1. Cell Ids As described in the previous section, cells are defined geographical areas of the network, and hence, in a model context, these variables contain information about the location of the call events. From the considered period, 17 cell ids were recorded, resulting in 17 binary dummy variables. In a real setting with live networks, the number of cells would increase. In such a situation, clustering methods could potentially be used to merge cells that are (i) close to each other geographically, and (ii) similar by some relevant metric - this to reduce the dimension of variables to include and evaluate in the model. 2.2.1.2. tGCP GCP, short for Generic Connection Properties, describes the range of possible connection properties that a call may possess. tGCP, or target-GCP, are the connection properties that are targeted or requested by a particular device at a particular timepoint. A maximum of 31 connection properties can be possessed for a particular call. The presence or absence of a particular connection property is registered as 1 and 0 respectively. In this work, the last set of registered connection properties for each call is used as input to the model - this to capture the connection properties that were requested at the time of the drop. The 31 connection properties are treated as binary dummy-variables. 2.2.1.3. Trace 4 Procedures Trace, in the context of this dataset, refers to the process of monitoring the execution of RNC functions that are relevant to a particular call. Traces are grouped such that similar events (execution of RNC functions) are traced by the same trace group. 11 Chapter 2 Data For the considered STP, over the relevant time-period, three trace groups were observed: trace1, trace3, and trace4 - the latter being (by far) the most frequent one. Trace4 describes events such as: Importation and Deportation of processes and program procedures. More specifically, trace4 can be divided into 37 different events, referred to as procedures. For example, procedure 10 describes the Importation or Deportation of a “soft handover” event. In this thesis, these procedures are treated as binary dummy-variables. 2.2.1.4. UeRcid UeRcid, short for UE Radio Connection Id, defines - as the name suggests - the type of radio connection that a particular UE has activated. For the considered data, approximately 150 different such id’s exist. In this work, we group these by their inherent properties. Specifically, we differentiate between PS (Packet switched), CS (Circuit switched), SRB (Signaling Radio Bearer) and Mixed (a combination of the aforementioned). PS is that connection which is concerned with data traffic, whilst CS is that which is concerned with conversation/speech. SRB is the result of the initial connection establishment, as well as the release of the connection. The presence or absence of a particular radio connection is registered as 1 and 0 respectively. 2.2.1.5. evID EvID, short for Event Id, is found in the measurement reports of the trace logs, and constitutes of reports related to radio quality, signal strength and others. As such, a specific evID defines a specific type of such a report or event. Consider for instance, “evID=e2d”, which defines “Quality of the currently used frequency is below a certain threshold”. In this work, these evID’s are treated as binary dummy-variables. 12 3. Methods In this chapter, the framework and subsequent methods used in this thesis are explained. The framework is divided into four parts; the first step, text mining and variable creation, is the step in which the data is transformed from machine generated text to structured matrices apt for statistical methods. The second step, sampling, addresses the challenges of imbalanced data. The third step of the framework is the main part and consists of dynamic classification of streaming data. The fourth and final part of the framework seeks to derive intuitive descriptions of the results obtained from step 3, through the application of association rules. 3.1. Text Mining and Variable creation As previously mentioned, the original format of the data was “text”, such that any direct input to statistical methods was not possible. To address this, techniques commonly associated with the area of text mining were applied. More specifically, text variables were created, and defined as binary dummy-variables. For instance, if “configuration request” appears in a particular call, then the value of that variable is ”1”. Initially, the count of specific words was also considered, but it was found that it did not add any discriminative value, and were consequently dismissed. Some numerical measurements were also found in the logs - these do however (i) not occur in all of the logs and (ii) are not missing at random: some measurements are only triggered under certain circumstances. To cope with this type of missing data, discretization techniques were applied such that categorical variables could be derived from the original numerical ones (including a category ’missing’). Specifically, the CAIM discretization algorithm, proposed by Kurgan and Cios (2004) was used. For a continuous-valued attribute, the CAIM algorithm seeks to divide its range of values into a minimal number of discrete intervals, whilst at the same time minimizing the loss of class-attribute interdependency. It is out of the scope of this thesis to cover the details of this algorithm, and hence we refer to Kurgan and Cios (2004) for more details. 3.2. Sampling strategies Sampling, in the most general sense of the word, is concerned with selecting a subset of observations from a particular population. In the context of classification, 13 Chapter 3 Methods sampling techniques are popular for dealing with the issue of class imbalance. An imbalanced dataset is defined as one in which the distribution of the response variable is skewed towards one of the classes (He and Garcia, 2009). The motivation for considering sampling techniques in this thesis is three-fold: (i) due to limitations in memory & computational power, and the overwhelming size of the unformatted source (txt) files, only a limited number of logs could feasibly be extracted from these source files, (ii) for imbalanced datasets, classifiers tend to learn the response classes unequally well, where the minority class often is ignored, such that the separation capability becomes poor (Wang et al., 2013), and (iii) sampling techniques has shown to be effective for addressing class imbalance in other works (He and Garcia, 2009). Sampling is a well-researched subject, and a wide range of techniques have been proposed over the years. The great bulk of these techniques are however limited to environments where the data is assumed to be fixed and static. For example, the random undersampling technique, that has a simple and intuitive appeal: observations from the majority class are selected at random and removed, until the ratio between the response classes has reached a satisfactory level. Japkowicz et al. (2000) evaluated this simpler technique and compared it to more sophisticated ones, and concluded that random undersampling held up well. The issue of online class imbalance learning has so far attracted relatively little attention (Wang et al., 2013). Most of the proposed methods for addressing non-static environments assume that the data arrives in batches (Nguyen et al., 2011), and are thus not directly applicable to online learning. One of the first papers to address the issue of imbalanced data in an online learning context was Nguyen et al. (2011). In it, a technique here referred to as ORUS - that allows the analyst to choose a fixed rate at which undersampling should occur were proposed: observations from the minority class are always accepted for inclusion, whilst observations from the majority class are included only with a fixed probability. In other words random under sampling in an online context. This simple implementation is described more formally in equation (3.1), where q is the parameter determining the fixed sampling rate. Nguyen et al. (2011) shows that this approach is able to provide good results for an online implementation of the naive Bayes classifier: ORU S : p(inclusionxt ) = 1 q yt = 1 yt = 0 (3.1) This technique does however not account for the possibility of changing levels of imbalance over time; it assumes a fixed rate, to be known a priori. In Wang et al. (2013), an extension were proposed in which the degree of imbalance is continuously estimated, using a decay factor, such that the inclusion probability is allowed to change over time. In this thesis, a simple adaptive sampling scheme, sharing traits with both Wang et al. (2013) and Nguyen et al. (2011), is developed. Specifically, a sliding window is 14 3.3 Evaluation techniques used to estimate the local imbalance at different time points, such that the undersampling rate (inclusion probability) of the majority class is allowed to change over time. If the proportion of dropped calls during a particular period is relatively high, the inclusion probability for normal calls is increased. If, on the other hand, the proportion of dropped calls is relatively low, the inclusion probability for normal calls is decreased. More formally, as in Nguyen et al. (2011), we let the analyst select a constant (q): it should be the baseline expectation of the class-imbalance prior to observing any data. In the case of mobile networks, the call-drop rate is well-understood such that this “pseudo prior” can be set with confidence. The idea is then to use this baseline expectation to construct the sliding window: w = 1q . This sliding window moves incrementally, one observation a time, and estimates the local imbalance-rate at every time point as a result of the number of minority observations found in that particular time-window. This is described mathematically in equation (3.2), where q is the constant describing the baseline expectation: O − ARU St : 1 Pt−1 p(inclusionxt ) = t−w w q yt = 1 yt yt = 0 & t−1 t−w yt > 1 Pt−1 yt = 0 & t−w yt ≤ 1 P (3.2) For instance, let’s consider a scenario in which the analyst has set a baseline ex1 pectation of 1%: the sliding window would then become w = 0.01 = 100. Consider further that we stand at time point (t), and in the past 100 observations, 3 observations from the minority class have been encountered. The inclusion probability for 3 a majority class observation would then, at time point (t), be equal to 100 = 3%. 3.3. Evaluation techniques The question of how one should evaluate a classifier depends on the data, and the objective of the classification. What is of particular interest in this thesis is to identify and discriminate positive occurrences from negative ones, e.g. identifying and separating ’dropped calls’ from ’normal calls’ - largely because the main objective of this thesis is to explore what factors contribute towards the classification of ’dropped calls’. The most commonly used metric for evaluating classifiers is the Accuracy measure: Accuracy = TP + TN TP + TN + FP + FN (3.3) Where T P = T rue P ositives, T N = T rue N egatives, F P = F alse P ositives, F N = F alse N egatives. It simply describes the total number of correct predictions as a ratio of the total number of predictions. 15 Chapter 3 Methods In cases where the number of positive and negative instances differ greatly (imbalanced data), the accuracy measure can be misleading. For instance, with an imbalance-ratio of 99:1, it would be possible to achieve an accuracy of 99% simply by classifying all observations as negative instances. To avoid such pitfalls, a variety of evaluation metrics have been proposed: one being AUC, which represents the area under the ROC curve (Hanley and McNeil, 1982). The ROC curve displays the relationship between the true positive rate, TPR (Sensitivity) and the false positive rate, FPR (1 − Specif icity). More specifically, the ROC curve is constructed by considering a range of operating points or decision thresholds, and for each such point (or threshold), it calculates the true positive rate and false positive rate. The intersection of these two scores, at each threshold, produces a dot in a two-dimensional display. Between the plotted dots, a line is drawn: this constitutes the ROC curve (Obuchowski, 2003). Sensitivity = T P R = TP TP + FN 1 − Specif icity = F P R = (3.4) FP TN =1− FP + TN TN + FP (3.5) AUC can be interpreted as the probability that a randomly selected observation from the positive class is ranked higher than a randomly selected observation from the negative class - in terms of belonging to the positive class. It should be emphasized that, in the context of online learning, and hence for the methods considered in this thesis, there is no training- or test- dataset: as the models are constructed and updated sequentially, we instead evaluate the one-stepahead predictions of the models. The first papers to address the issue of imbalanced online learning, (Nguyen et al., 2011) and (Wang et al., 2013), proposed the use of the G-mean as an evaluation metric. G-mean is short for Geometric mean and is constructed as follows (Powers, 2011): G − mean = q (3.6) precision × recall where precision = TP TP + FP recall = TP TP + FN (3.7) AUC and G-mean will constitute the main measurements upon which comparisons and evaluations are founded in this thesis. 16 3.4 Online drop analysis through classification 3.4. Online drop analysis through classification Classification, as a statistical framework, defines the process of modeling the relationship between a set of input variables, X, and an outcome variable, y, where the outcome variable is discrete. As the main objective of this thesis is to study the relationship between ’dropped calls’ and ’normal calls’, it is naturally framed as a classification problem - with the response: {Dropped call, Normal call}. An extensive number of classification techniques have been proposed over the years, and what amounts to the “best one” is often data and task specific. In the case of this thesis, there are four fundamental criteria that a classifier must meet: (i) it must be transparent, in the sense that insight of what variables contributes to a certain outcome is required, (ii) it must be able to cope with a high-dimensional input, as there is a great deal of interesting information recorded for each call, (iii) it must be able to handle the sequential nature of the data, e.g. that data arrives continuously in a stream, and finally (iv) it should be adaptive and be able to capture local behaviors, since - as explained before - the cause for drops are expected to change over time. These criteria drastically reduce the space of apt classifiers: popular techniques such as Support Vector Machines and Artificial Neural Networks are good alternatives to deal with complex high-dimensional input (and may be extended to deal with streaming data), but they fail on the important issue of transparency in regards to variable importance. The sequential and adaptive aspects described above are naturally addressed in the field of online learning, which assumes that data is continuously arriving and may not be stationary. Hence, an ideal intersection would be an online learning classifier that is transparent and can handle higher dimensions. Two such techniques were identified, the Dynamic Logistic Regression and Dynamic Trees. The static versions of these two, the logistic regression and partition trees, are known for their transparency in regards to variable contribution, and hence the dynamic extensions are appealing for this work. 3.4.1. Dynamic Logistic Regression This technique, originally proposed by Penny and Roberts (1999), extends the standard logistic regression by considering an additional dimension: time. Through a Bayesian sequential framework, the parameter estimates are recursively estimated, and hence allowed to change over time. The particular version of the dynamic logistic regression that is applied in this work follows McCormick et al. (2012), and it is described below. But first, let’s consider - what in this thesis is referred to as the static logistic regression. 17 Chapter 3 Methods 3.4.1.1. Static logistic Regression The static logistic regression, or just logistic regression, is a technique for predicting discrete outcomes. It was originally developed by Cox (1958), and still remains one of the most popular classification techniques. Logistic regression has several attractive characteristics, in particular its relative transparency, and the way in which one is able to evaluate the contribution of the covariates to the predictions. Logistic regression is a special case of generalized linear models, and may be seen as an extension of the linear regression model. Since the dependent variable is discrete, or more specifically Bernoulli distributed, it is not possible to derive the linear relationship between the response and the predictors directly, and hence a transformation is needed. y ∼ Bernoulli(p) In the case of the logistic regression, a logit-link is used for the purpose of transformation. Consider the logistic function in equation (3.8): F (x) = 1 1 + e−(β0 +β1 x1 +...) (3.8) Where the exponent describes a function of a linear combination of the independent variables. The logit-link is derived through the inverse of the logistic function, as in equation (3.9): logit(p) = g(F (x)) = ln F (x) = β0 + β1 x1 + ... = xT θ 1 − F (x) (3.9) 3.4.1.2. State-space representation Given the objective of exploring temporal significance of independent variables, a natural extension of the static logistic regression model is to add a time dimension. As in McCormick et al. (2012), we do so by defining the logistic regression through the Bayesian paradigm, and by applying the concept of recursive estimation: this allows sequential modeling of the data, and - what in the literature commonly is referred to as - online learning. Equation (3.9) is hence updated to: logit(pt ) = xTt θt (3.10) Notice the added subscript t. The recursive estimation is computed in two steps: the prediction step and the updating step: 18 3.4 Online drop analysis through classification Prediction step: At a given point in time, (t), the posterior mode of the previous time step (t − 1) is used to form the prior for time (t). The parameter estimates at time (t) are hence based on the observed data up and till time (t − 1). Using these estimates, a prediction of the outcome at time (t) is made. More formally, we let the regression parameters θt evolve according to the state equation θt = θt−1 +δt , where δt ∼ N (0, Wt ) is a state innovation. That is, the parameter estimates at time (t) are based on the parameter estimates at time (t − 1) plus a delta term. Inference is then performed recursively using Kalman filter updating, Suppose that, for set of past outcomes Y t−1 = {y1 , ..., yt−1 }: θt−1 |Y t−1 ∼ N (θ̂t−1 , Σ̂t−1 ) The prediction equation is then formed as: θt |Y t−1 ∼ N (θ̂t−1 , Rt ) (3.11) where Rt = Σ̂t−1 λt (3.12) λt is a forgetting factor, and is typically set slightly below 1. The forgetting factor acts as a scaling factor to the covariance matrix from the previous time point, this to calibrate the influence of past observations. The concept of using forgetting factors for this particular purpose is quite common in the area of dynamic modeling, and there has been a range of proposed forgetting strategies. For a review, see (Smith, 1992). In this work, we apply the adaptive forgetting scheme proposed by McCormick et al. (2012), which allows the amount of change in the model parameters to change over time - an attractive feature, considering the complex dynamics of the mobile network systems. More about the specifics of the forgetting factor later in this section. Updating step: The prediction equation in (3.11) is, together with the observation arriving at time (t), used to construct the updated estimates. More specifically, having observed yt , the posterior distribution of the updated estimate θt is: p(θt |Y t ) ∝ p(yt |θt )p(θt |Y t−1 ) (3.13) 19 Chapter 3 Methods where p(yt |θt ) is the likelihood at time (t), and the second term is the prediction equation (which now acts a prior). Since the Gaussian distribution is not the conjugate prior of likelihood function in logistic regression, the posterior is non-standard, and there is no solution in closed form of equation (3.13). Consequently, McCormick et al. (2012) approximate the right-hand side of equation (3.13) with the normal distribution, as is common practice. More formally, θ̂t−1 is used as a starting value, and then the mean of the approximating normal distribution at time point (t) is: θ̂t = θ̂t−1 − D2 l(θ̂t−1 )−1 Dl(θ̂t−1 ) (3.14) where second and third term of the right-hand side are the second and first derivatives of l(θ) = log p(yt |θ)p(θ|Y t−1 ) respectively, e.g. the logarithm of the likelihood times the prior. The variance of the approximating normal distribution, which is used to update the state variance, is estimated using: X ˆ t = {−D2 l(θ̂t−1 )}−1 (3.15) In McCormick et al. (2012), a static (frequentist) logistic regression is used in a training period to obtain some reasonable starting points for the coefficient estimates. Now, since the data which is used in this thesis is sparse with regards to several of the input variables, this approach cannot straightforwardly be implemented. This is so because, for some of the covariates, none or very few occurrences are recorded during the first part of the data. Consequently, we here apply a pseudo-Bayesian framework, introducing two pseudo priors (mean, variance): θ0 , σ02 , for every coefficient. If no observations are observed during the training period, these priors are simply not updated. The forgetting factor, λ In Raftery et al. (2010), the predecessor to McCormick et al. (2012), a forgetting scheme where λ is a fixed constant were introduced, and more specifically they set λ = 0.99. It is noted that this constant ought to be determined based on the belief of the stability of the system. If the process is believed to be more volatile and nonstationary, a smaller λ is preferable, since the posterior update at each time-point then weighs the likelihood - relative to the prior - higher, and hence the parameter estimates are more locally fitted, and updated more rapidly. More formally, this forgetting specification implies that an observation encountered j time-points in the past is weighted by λj (Koop and Korobilis, 2012). For instance, with λ = 0.99, an observation encountered 100 time-points in the past receive approximately 37% as much weight as the current observation. McCormick et al. (2012), in addition to extending (Raftery et al., 2010) from dynamic linear regression to dynamic binary classification, also proposed a new adaptive forgetting scheme. The forgetting factor, λt (now defined with a subscript: t), is 20 3.4 Online drop analysis through classification extended such that it is allowed to assume different values at different time-points. This has the effect of allowing the rate of change in the parameters to change over time. The predictive likelihood is used to determine the λ to be used at each timepoint. More specifically, the λt that maximizes the following argument is selected; ˆ λt = arg maxλt p(yt |θt , Y t−1 )p(θt |Y t−1 )dθt (3.16) θt However, since this integral is not available in closed form, McCormick et al. (2012) uses a Laplace approximation: f (yt |Y t−1 ) ≈ (2π)d/2 |{D2 (θ̂t )}−1 |1/2 p(yt |Y t−1 , θ̂t )p(θ̂t |Y t−1 ) (3.17) Which, according to Lewis and Raftery (1997), should be quite accurate. Instead of evaluating a whole range of different λ0t s to maximize the expression in equation (3.16), McCormick et al. (2012) uses a simpler approach, that only considers two possible states: some forgetting (λt = c < 1) and no forgetting (λt = 1). Different parameters are allowed to have different forgetting factors, and hence it would computationally difficult to evaluate multiple λ0 s for models consisting of more than just a few variables, because the combinatorics grows exponentially. In their experiments, they conclude that the results were not sensitive to the chosen constant. In this thesis, both single and multiple λ0 s will be evaluated. In the case of multiple λ0 s, the model will share a common forgetting factor. Quite early on, it was empirically found that the forgetting schemes described above encountered problems with temporally sparse covariates, and that the smaller the λ, the bigger the trouble. In an attempt to remedy this issue, we propose a simple, yet intuitively reasonable, modification. The basic idea is that c, the constant selected by the analyst, is - for each observation, and each attribute - scaled based on an estimate of the local sparsity, such that, during periods of mostly zeros for a particular covariate, λ is scaled towards 1: (1) (2) λt = (1) λt (1 − λt ) + P 3 ( ti=t−w xi ) 1+ w (3.18) Where w is a constant to be selected by the user: it is the window upon which the local sparsity is estimated. The summation in the denominator reflects the number of non-zero occurrences in the past w observations. The more occurrences that are (1) observed, the larger the number that (1 − λt ) is divided by, and consequently the (1) less λt is scaled. 21 Chapter 3 Methods For instance, consider a fictive scenario in which an analyst has selected c = 0.95, and w = 10, and for a particular covariate, at a particular time-point, 9 out of the last 10 observations are zero for this attribute, e.g. sparse. Equation (3.18) would (1) (2) have the effect of modifying λt = 0.95 to λt = 0.995. If, at another time-point, (1) say 8 out of the 10 occurrences in w are non-zero values, λt =0.95 is only changed (2) to λt = 0.9501. The effect of this modification is further analyzed in sec. 4.2. Evolution of the odds-ratios In McCormick et al. (2012); Koop and Korobilis (2012), two approaches were considered for studying the temporal significance of covariates and how the conditional relationships change over time; one being through the evolution of odds-ratios for specific covariates. Just as in the static logistic regression, odds-ratios are obtained by exponentiation the logit coefficients. Odds-ratios may be interpreted as the effect of one unit change in X in the predicted odds, with all other independent variables held constant (Breaugh, 2003). An odds-ratio > 1.0 implies that a particular covariate potentially has a positive effect, while an odds-ratio < 1.0 implies a potential negative effect. The farther the odds ratio is from 1.0, the stronger the association. In (Haddock et al., 1998), guidelines for interpreting the magnitude of an odds ratio are provided, and in particular a rule of thumb which states that odds ratios close to 1.0 represent a ’weak relationship’, whereas odds ratios over 3.0 indicate ’strong (positive) relationships’. In (McCormick et al., 2012), ±2 standard errors are computed, and if the confidence interval doesn’t overlap 1.0, a covariate is concluded to have a significant effect. In this thesis, both of the aforementioned approaches are considered in the process of reflecting upon temporal significance of covariates. 3.4.2. Dynamic Model Averaging Dynamic model averaging (DMA), originally proposed by Raftery et al. (2010), is an extension of Bayesian Model Averaging (BMA) that introduces the extra dimension of time through state-space modeling. In this thesis, DMA is used together with the dynamic logistic regression, as in McCormick et al. (2012). This combination is attractive, considering the objectives of this work, in that the dynamic logistic regression allows the marginal effects of the predictors to change over time, whilst the dynamic model averaging allows for the set of predictors to change over time. BMA, first introduced by Hoeting et al. (1999), addresses the issue of model uncertainty by considering multiple (M1 , ..., Mk ) models simultaneously, and computes the posterior distribution of a quantity of interest, say θ, by averaging the posterior distribution of θ for every considered model - weighting their respective contribution 22 3.4 Online drop analysis through classification by their posterior model probability (Hoeting et al., 1999), as in equation (3.19): p(θ|X) = K X (3.19) p(θ|Mk , X)p(Mk |X) k=1 The posterior model probability for model Mk can be written as follows: p(X|Mk )p(Mk ) (3.20) p(Mk |X) = PK l=1 p(X|Ml )p(Ml ) ´ where p(X|Mk ) = p(X|θk , Mk )p(θk |Mk )dθk is the integrated likelihood of model Mk , and θk is the vector of parameters of model Mk . 3.4.2.1. State-space representation By introducing a state-space representation of the BMA, leading to DMA, the posterior model probabilities become dynamic, and are hence allowed to change over time. Just as in regular BMA, one considers K candidate models {M1 , ..., MK }. Considering the specific combination of DMA and dynamic logistic regression, we re-define equation (3.10) as follows: (k) (k)T (k) θt (3.21) logit(pt ) = xt (k)T (k) and θt , implying that canNotice the superscript (k) that is present for both xt didate models may have different setups of covariates, and their parameter estimates may also differ. Estimation with DMA, following McCormick et al. (2012), is computed using the same framework as in the (single-) dynamic logistic regression, e.g. the two steps of prediction and updating. Different from the single-model case, however, is the definition of the state space, which here consist of the pair (Lt , Θt ), where Lt is a model indicator - such that if Lt = k, the process is governed by model Mk at time (1) (k) (t), and Θt = {θt , ..., θt }. Recursive estimation is performed on the pair (Lt , Θt ): K X (l) p(θt |Lt = l, Y t−1 )p(Lt = l|Y t−1 ) (3.22) l=1 Equation (3.22) may be compared to (3.19), which is the corresponding equation for BMA. An important aspect of (3.22) is that θtk is only present conditionally when Lt = l. 23 Chapter 3 Methods Before we consider the prediction and updating steps, it is worth noting that, as in McCormick et al. (2012), a uniform prior is specified for the candidate models: p(Lt = l) = 1/K. Prediction step We here consider the second term of equation (3.22), which is the prediction equation of model indicator Lt : in other words, the probability that the considered model is the governing model at time (t), given data up and till (t − 1). The prediction equation is defined as follows: P (Lt = k|Y t−1 ) = K X p(Lt−1 = l|Y t−1 )p(Lt = k|Lt−1 = l) (3.23) l=1 The term p(Lt = k|Lt−1 = l) implies that a K × K transition matrix needs be specified. To avoid this, Raftery et al. (2010) redefines equation (3.23) and introduce another forgetting factor, αt : P (Lt−1 = k|Y t−1 )αt P (Lt = k|Y t−1 ) = PK t−1 )αt l=1 P (Lt−1 = l|Y (3.24) where αt has the effect of flattening the distribution of Lt , and hence increase the uncertainty. Just as with λt , αt is adjusted over time using the predictive likelihood (but here across candidate models). Updating step The (model-) updating step is defined through equation (3.25): (k) ωt P (Lt = k|Y ) = PK t l=1 (l) ωt (3.25) where (l) ωt = P (Lt = l|Y t−1 )f (l) (yt |Y t−1 ) (3.26) Notice that the first term on the right side of equation (3.26) is the prediction equation and the second term is the predictive likelihood for model (l). An important feature here is that this latter term (the predictive likelihood) has already been calculated (recall that it was used to determine the model-specific forgetting factor λt ). 24 3.4 Online drop analysis through classification Just as λt is allowed to take different values at different time-points, the forgetting factor for the model indicator αt is as well. To determine which αt to be used at time t, (McCormick et al., 2012) suggests maximizing: K X arg maxαt f (k) (yt |Y t−1 )P (Lt = k|Y t−1 ) (3.27) k=1 That is, maximizing the predictive likelihood across the candidate models. The first term in equation (3.27) is the model-specific predictive likelihood (which we already have computed), and the second term is (3.24). As such, this adds minimal additional computation. Now, in practice, McCormick et al. (2012) takes the approach of evaluating two α values at each time-point {some forgetting/no forgetting}. Finally, upon predicting yt at time t, equation (3.28) is applied: ŷtLDM A = K X (l) P (Lt = l|Y t−1 )ŷt (3.28) l=1 (l) where ŷt is the predicted response for model l and time t. That is, to form the DMA prediction, each candidate model’s individual prediction is weighted by its posterior model probability. Evolution of the inclusion probabilities The second approach considered by McCormick et al. (2012); Koop and Korobilis (2012) for the purpose of studying the temporal significance of covariates is that which is centered around posterior inclusion probabilities. They are derived by summing the posterior model probabilities for those models that include a particular variable at a particular time. To do so, first all 2p combinations of the input variables need to be computed as to construct 2p candidate models - where p is the number of predictors. More formally, the posterior inclusion probability for variable i at time t is (Barbieri and Berger, 2004): pi,t ≡ X P (MI |y) (3.29) I: l=1 This approach is feasible in both McCormick et al. (2012) and Koop and Korobilis (2012) since the number of covariates is - in comparison to this thesis - relatively small, and the length of the time-series are also relatively short. For this thesis, the ideal would have been to set up candidate models such to represent all possible combinations of variables, but since the number of covariates is quite large (> 100), and the length of the time-series is long, that isn’t computationally feasibly. Consequently, we do not consider all possible combinations of all covariates, but rather use 25 Chapter 3 Methods the “interesting variable groups” (as defined in sec. 4.2), and consider all the possible combinations of these. Although limiting, this approach is reasonable since many of the covariates have quite a clear group structure, and the motivation for exploring this approach is that it may give some (high-level) insights into what variable groups are important at different time-points. The univariate scanner An additional approach considered in this thesis for exploring temporal significance of covariates is one in which candidate models are constructed to be univariate. This approach is explored because (i) it allows for covariate-specific updating of the forgetting factor in a computationally feasible way, and (ii) it avoids eventual issues of multicollinearity that the first approach of McCormick et al. (2012) may suffer from. To determine the significance of a particular variable at a particular time, the oddsratios may be interpreted as described in the last section, or through the posterior model probabilities (> 0.5), as recommended by Barbieri and Berger (2004). 3.4.3. Dynamic Trees Dynamic trees, first proposed by Taddy et al. (2011), is an extension of the popular non-parametric technique partition trees. This thesis follow the particular version developed by Anagnostopoulos and Gramacy (2012), which extends the former by introducing a retiring scheme that allows the model complexity of the tree to not increase in a monotonic way over time, but rather change in accordance with local structures of the data. We first outline some basic concepts of partition trees and relevant notations, and then the dynamic extension is introduced. 3.4.3.1. Static partition trees The basic idea of (static) partition trees is to hierarchically partition a given input space X into hyper-rectangles (leaves), by applying nested logical rules. The standard approach is to use binary recursive partitioning. A tree, here denoted by T , consists of a set of hierarchically ordered nodes ηT , each of which is associated with a subset of the input covariates xt = {xs }t . These subsets are the result of a series of splitting rules. Considering the tree structure in a bit more detail, one may differentiate between different types of nodes: (i) at the top of every tree, one finds the root node, RT , which includes all of xt , (ii) using binary splitting rules, a node η may be spitted into two new nodes that are placed lower in the hierarchy, these are referred to as η’s child nodes, or more specifically η’s left and right children: Cl (η) and Cr (η) respectively, and are disjoint subsets of η such that Cl (η) ∪ Cr (η) = η, (iii) the parent node, P (η), on the other hand, is placed above η in the hierarchy, and contains both η and its 26 3.4 Online drop analysis through classification sibling node S(η), such that P (η) = η ∪ S(η). A node that has children is defined as an internal node, whilst nodes that do not are referred to as leaf nodes. The sets of internal nodes and leaf nodes in T are denoted by IT and LT respectively. At every leaf node, a decision rule is deployed and is parametrized by θη . IndeQ pendence across tree partitions leads to likelihood p(y t |xt , T, θ) = ηLT p(y η |xη , θη ), where [xη , y η ] is the subset of data allocated to η. This way of considering the leaf nodes is often referred to as a Bayesian treed models in the literature. Whilst flexible, this approach poses challenges in terms of selecting a suitable tree structure. To address this problem, Chipman et al. (1998) designed a prior distribution, π(T ) (often referred to as the CGM tree prior), over the range of possible partition structures, that allows for a Bayesian approach with inference via the posterior: p(T |[x, y]t ) ∝ p(y t |T, xt )π(T ), where [x, y]t is the complete data set. The CGM prior specifies a tree probability by placing a prior on each partition rule: π(T ) ∝ Y psplit (T, η) ηIT Y [1 − psplit (T, η)] (3.30) ηLT Where psplit (T, η) = α(1 + Dη )−β is the depth-dependent split probability (α, β > 0 and Dn = depth of η in the tree). Equation (3.30) implies that the tree prior is the probability that internal nodes have split and leaves have not. In Chipman et al. (1998), a Metropolis-Hastings MCMC approach is developed for sampling from the posterior distribution of partition trees. Specifically, stochastic modifications referred to as “moves” (grow, prune, change, and swap) of T are proposed incrementally, and accepted according to the Metropolis-Hastings ratio. It is upon this framework that Taddy et al. (2011) base its dynamic extension. 3.4.3.2. Dynamic Trees The extension from static partition trees (or more specifically, Bayesian static treed models) to dynamic trees is the result of defining the tree as a latent state which is allowed to evolve according to a state transition probability: P (Tt |Tt−1 , xt ), referred to as the evolution equation, where Tt−1 represents the set of recursive partitioning rules observed up to time t − 1. A key insight here is that the transition probability is dependent on xt , which implies that only such moves (grow, prune, etc.) that are local to the current observation (e.g. leaf η(xt )) are considered. This makes this approach computationally feasible. Following Anagnostopoulos and Gramacy (2012), we let: 0, P (Tt |Tt−1 , xt ) = pm π(Tt ), if Tt is not reachable f rom Tt−1 via moves local to xt otherwise 27 Chapter 3 Methods (3.31) where pm is the probability of a particular move, and π(Tt ) is the tree prior. The moves that are considered in this sequential approach are: {grow, prune, stay}. Taddy et al. (2011) argues that the exclusion of the change and swap moves allows considerably more efficient processing. The three considered moves are equally probable, and are defined as follows: • Stay: The tree remains the same: Tt = Tt−1 • Prune: The tree is pruned such that η(xt ) and all of the nodes below in he hierarchy are removed, including η(xt )0 s sibling node S(η(xt )). This implies that η(xt )’s parent node P (η(xt )) after the prune becomes a leaf node. • Grow: A new partition is created within the hyper-rectangle defined for η(xt ). More specifically, this move first uniformly chooses a split dimension (covariate dimension) j, and split point xgrow . Then the observations of η(xt ), are divided j according to the defined split rule. 3.4.3.2.1 Prediction and the Leaf Classification Model For posterior inference with dynamic trees, two quantities are imperative: (i) the marginal likelihood for a given tree, and (ii) the posterior predictive distribution for new data. The marginal likelihood is obtained by marginalizing over the regression model parameters, which in this case are the leaves ηLT , each parametrized by θη ∼ π(θ): p(y t |Tt , xt ) = Y p(y η |xη ) ηLTt ˆ = Y p(y η |xη , θη )dπ(θη ) (3.32) ηLTt That is, by conditioning a given tree, the marginal likelihood is simply the product of independent leaf likelihoods. Combining (3.32) with the prior described earlier, we obtain the posterior p(Tt |[x, y]t , Tt−1 ). Considering next the predictive distribution for yt+1 , given xt+1 , Tt , and data [x, y]t : p(yt+1 |xt+1 , Tt , [x, y]t ]) = = p(yt+1 |xt+1 , Tt , [x, y]η(xt+1 ) ]) 28 3.4 Online drop analysis through classification ˆ = p(yt+1 |xt+1 , θ)dP (θ|[x, y]η(xt+1 ) ) (3.33) Notice that in the second step of the derivation that [x, y]t is re-written as [x, y]η(xt+1 ) , this is so because we only consider the leaf partition which contains xt+1 . The second term in (3.33), dP (θ|[x, y]η(xt+1 ) ), is the posterior distribution over the leaf parameters (classification rules), given the data in η(xt+1 ). As such, the predictive distribution is simply the classification function at the leaf containing xt , integrated over the conditional posterior for the leaves (model parameters). The model defined at each of the leaves may be linear, constant or multinomial. Since the response variable in this work is binary, the approach of binomial leaves is applied. As such, each leaf response ysη is equal to one of 2 alternative factors. The set of outcomes for a particular leaf is summarized by a count vector: zη = [z1η , z2η ]0 , P|η| such that the total count for each class is zcη = s=1 1(ysη = c). Following Taddy et al. (2011), we then model the summary counts for each leaf as follows: zη = Bin(pη , |η|) (3.34) where Bin(p, n) is a binomial with expected count pc /n for each category. A Dirichlet Dir(1C /C) prior is assumed for each leaf probability vector, and as such, the posterior information about pη is given by: p̂η = (zη + 1/C) (zη + 1/2) = (|η| + 1) (|η| + 1) . The marginal likelihood for leaf node η is then defined by equation 3.35: p(y η |xη ) = p(zη ) = 2 Y Γ(zcη + 1/2) Γ(zcη + 1/C) = η η c=1 zc ! × Γ(1/2) c=1 zc ! × Γ(1/C) C Y (3.35) Finally, the predictive response probabilities for leaf node η containing covariates x is: p(y = c|x, η, [x, y]η ) = p(y = c|zη ) = p̂ηc f or c = 1, 2 (3.36) 3.4.3.2.2 Particle Learning for Posterior Simulation 29 Chapter 3 Methods As in the static version of Chipman et al. (1998), a sampling scheme is applied to approximate the posterior distribution of the tree. More specifically, Taddy et al. (2011) uses a Sequential Monte Carlo (SMC) approach: at time t − 1, the posterior distribution over the trees is characterized by N equally weighted particles, each (i) (i) of which includes a tree Tt−1 as well as sufficient statistics St−1 for each of its leaf (i) N (i) classification models. This tree posterior, {Tt−1 }N i=1 , is updated to {Tt }i=1 through a two-step procedure of (i) resampling and (ii) propagating. In the first step, particles are resampled, with replacement, according to their predictive probability for (i) the next (x, y) pair: wi = p(yt |Tt−1 , xt ). In the second step, each tree particle is (i) (i) updated by first proposing local changes: Tt−1 → Tt via the moves {stay, prune or grow}, resulting in three candidate trees: {T stay , T prune , T grow }. As the candidate trees are equivalent above the parent node for xt , P (η(xt )), one only needs to calculate the posterior probabilities for the subtrees rooted at this particular node. Denoting subtrees by Ttmove , the new Tt is sampled with probabilities proportional to: π(Ttmove )p(y t |xt , Ttmove ), where the first term, the prior, is equal to (3.31) and the second term, the likelihood, is (3.32) with leaf marginal (3.35). As noted in Taddy et al. (2011) and Anagnostopoulos and Gramacy (2012), this sequential filtering approach enables the model to inherit a natural division of labor that mimics the behavior of an ensemble method - without explicitly maintaining one. 3.4.3.2.3 Data retirement What has been considered so far for this method is the original approach developed by Taddy et al. (2011). Whilst being sequential, this approach is not strictly online, because the tree moves may require access to full data history. Furthermore, the complexity of the original dynamic trees model grows with log t, and in terms of classification in non-stationary environments, this isn’t ideal, as we suspect that the data generating mechanism may change over time. In Anagnostopoulos and Gramacy (2012), an extension is proposed where data is sequentially discarded and down-weighted. Specifically, an approach referred to as data point retirement is developed, where only a constant number, w, of observations are active in the trees (referred to as the ’active data pool’). Whilst data points are sequentially discarded they are still ’remembered’ in this approach. This is achieved by retaining the discarded information in the form of informative leaf priors. More specifically, suppose we have a single leaf ηTt , for which we have already 0 discarded some data, (xs , ys ){s} , that was in η at some time t0 ≤ t in the past. Anagnostopoulos and Gramacy (2012) suggests that this information can be “remembered” by taking the leaf-specific prior, π(θη ), to be the posterior of θη , given only the retired data. If we generalize this to trees of more than one leaf, we may take: π(θ) =df P (θ|(xs , ys ){s} ∝ L(θ; (xs , ys )π0 (θ) (3.37) where π0 (θ) is a baseline non-informative prior to all of the leaves. Following Anag- 30 3.4 Online drop analysis through classification nostopoulos and Gramacy (2012), we update the retired information through the recursive updating equation: π (new) (θ) =df P (θ|(xx , ys ){s},r ) ∝ L(θ; xr , yr )P (θ|(xs , ys ){s} ) (3.38) where (xr , yr ) is the new data point that is retired. Anagnostopoulos and Gramacy (2012) shows that equation 3.38 is tractable whenever conjugate priors are employed. In our case, with the binomial model, the discarded response values ys are represented as indicator vectors zs , where zjs = 1(ys = j). The natural conjugate is the Dirichlet D(a), where a is a hyperparameter vector that may be interpreted as counts. It is updated through: a(new) = a + zr , where zjm = 1(yr=j ). Anagnostopoulos and Gramacy (2012) shows that through this approach, the retirement preserves the posterior distribution, and as such, the posterior predictive distributions and marginal likelihoods required for SMC updates are also unchanged. A dynamic tree with retirement manage two types of information: (i) a non-parametric memory of an active data pool of (constant) size w < t, as well as (ii) a parametric memory of possibly informative priors. The algorithm proposed by Anagnostopoulos and Gramacy (2012) may be summarized by the following steps: 1. At time t, add the tth data point to the active data pool. 2. Update the model through the Sequential Monte Carlo scheme described in 3.4.3.2. 3. If t exceeds w, select some data point (xr , yr ) and remove it from the active data pool. But before doing so, update the associated leaf prior for η(xr )(i) for each particle i = 1, ..., N, as to ’remember’ the information present in (xr , yr ). More details are found in (Anagnostopoulos and Gramacy, 2012). 3.4.3.2.4 Temporal adaptivity using forgetting factors To address the possibility of a changing data generating mechanism in a streaming context, Anagnostopoulos and Gramacy (2012) further introduced a modification of the retiring scheme described in the previous section. Specifically, retired data history, s, is exponentially down-weighted when a new point ym arrives: (new) πλ (θ) ∝ L(θ|ym )Lλ (θ; (ys , xs ){s} )π0 (θ) (3.39) Where λ is a forgetting factor. At the two extremes, when λ = 1, the standard conjugate Bayesian updating is applied, as in the previous section, and when λ = 0, the retired history is disregarded completely. A λ in-between these two extremes has the effect of placing more weight on recently retired data points. More specifically, 31 Chapter 3 Methods in the context of the binomial model, the conjugate update is modified from a(new) = a + zr to a(new) = λa + zm . In the algorithm described in 3.4.3.2, one of the steps noted “select some data point (xr , yr ) and remove it”. We may here specify that, in the context of this thesis, and following Anagnostopoulos and Gramacy (2012), this data point is the oldest data point in the active data pool. 3.4.3.2.5 Variable Importance To measure the importance of predictors for dynamic trees, where the response variable is discrete, Gramacy et al. (2013) proposed the use of predictive entropy based on the posterior predictive probability (p̂) of each class c in node η. This leads to the entropy reduction: 4(η) = nη Hη − nl Hl − nr Hr (3.40) where Hn = − c p̂c log p̂c and n is the number of data points in η. The second and third term on the right hand side of equation (3.40) describes the entropy for node η’s left- and right- children respectively. In Gramacy et al. (2013), however, variable importance is not considered in an online setting: each covariates’ predictive entropy is calculated based on results from the full dataset. In this thesis, we are interested in the temporal variable importance, and as such, we instead consider the mean entropy reduction for a particular covariate at each time point, by averaging over the N particles. This allows us to display the variable importance as a time-series; a simple and intuitive way to study its relative importance over time. P 3.5. Drop description 3.5.1. Association Rule Mining From the analysis of sec. 3.4, one may gain insights about which variables are relevant at different time-points. As an additional layer to this aforementioned analysis, we further consider the application of association rule mining, originally proposed by Agrawal et al. (1993), with the objective of gaining intuitive descriptions that are easy to interpret for domain experts. This approach is convenient since the data has been formatted such that it constitutes of binary variables. The possessed knowledge of which variables that are interesting at different time-points (inherited from sec. 3.4), and hence which variables to consider for deriving association rules at these different time-points, has the positive effect of reducing the search-space needed to be explored for obtaining association rules. Specifically, the Apriori algorithm is used to generate the association rules. The Apriori algorithm is designed to operate on transaction databases, and hence the 32 3.6 Technical aspects first step constitutes of transforming the original data into a transaction database format. Following the transformation, the data consist of a set of transactions, where each transaction (T ) is a set of items (I), and is identified by its unique T ID (= transaction identifier). An association rule is an implication of the form X → Y , where X ⊆ I, Y ⊆ I, and X ∩ Y = ∅. The Apriori algorithm works in a bottom-up approach, and first identifies frequent individual items in the database and then extends them into larger item sets, as long as those item sets appear sufficiently often in the database (Agrawal et al., 1993). Given a set of transactions, the problem of mining association rules is to generate all association rules that have support and confidence greater than the user-specified minimum support (minsup) and minimum confidence (minconf ) respectively. Support is simply the count of the number of transactions in the dataset that contain the association rule, divided by the total number of transactions, whilst the confidence measures how many of the transactions containing a particular item (say X) that also contain another item (say Y ). More formally, the support for association rule X → Y is defined by equation (3.41): Support(X → Y ) = count(X ∪ Y ) N (3.41) where N is the number of transactions (observations). The confidence for association rule X → Y is obtained through equation (3.42): Conf idence(X → Y ) = count(X ∪ Y ) count(X) (3.42) In this thesis we are interested in those association rules which have {Drop = 1} in the right-hand-side of the association rule, and hence we introduce such a constraint into the process - in addition to minsup and minconf. 3.6. Technical aspects For the purpose of data cleaning, pre-processing, and sampling, the Python programming language was used. The analysis part of this thesis was carried out using the R programming language. For the dynamic logistic regression and dynamic model averaging, code was first extracted from the dma library - and on the basis of this code, various extensions and modifications were implemented. For the dynamic trees model, the dynaTree package was used. Finally, association rule mining relied on the arules package. 33 4. Results This section presents the results of this thesis, and is divided into three parts: the first containing a brief exploratory analysis of the data. In the second part, the task of deriving an online classifier with high prediction capabilities is tackled. In the third and final part, exploration of the temporal significance of covariates is considered, where interesting periods are analyzed in more detail - with the objective of identifying potential causes for drops at those particular periods. 4.1. Exploratory analysis 500 400 300 200 100 0 Number of dropped calls 600 In Figure 4.1 the number of dropped calls over the relevant period is displayed. 0 20 40 60 80 100 Time Figure 4.1.: Number of dropped calls over the period January 26 - April 11 for STP9596, as divided by 100 equally large time-ordered subsets. As one may observe, there are at least ~5 time-periods in which the call drop rate increases considerably. Upon exploring the temporal significance of covariates in sec. 4.3, one of these periods will receive special attention. Worth noting is that no effort has been made to account for periodicity, and this is because the data originates from programmed systems that does not have any periodicity-dependencies. Although, even if there would have been any, the sequential Bayesian framework would naturally have incorporated that aspect by updating the parameters accordingly. Initially, 188 covariates were extracted. Having considered the aspect of redundancy and multicollinearity, several covariates could be removed. For instance, a lot signal 35 Chapter 4 Results types have both a ’request’ and a ’response’ signal, and hence almost always occur together. In such cases, one of them were removed. The resulting dataset of 122 covariates is one in which the degree of multicollinearity is low, as one may observe from the heatmap plot of the correlation matrix in Figure 4.2: value 1.0 Var 2 0.5 0.0 −0.5 −1.0 Var 1 Figure 4.2.: Heatmap of correlation matrix To demonstrate the concept of temporal significance, let’s consider two of the 122 covariates, starting with PS. As mentioned in sec. 2.2.1, this covariate describes the “type of radio connection that a particular UE has”, or more specifically, that it has a data connection activated. In Figure 4.3, the percentage of such calls that terminate unexpectedly is displayed, as divided by four equally sized (ordered) time periods. Considered as a univariate classifier, the red bars in this plot represents the true positive rate at four different time periods. It can be observed that the proportion of calls with PS that drops is not constant over the considered time period. In the first time period, the percentage of normal outcomes outweighs the dropped ones. This changes quite drastically in the second period, where more than 75% of the calls with PS terminate unexpectedly. In period three, the percentage of dropped calls still outweighs the number of normal ones considerably. In the fourth and final period, the proportions are almost equal. Next, we consider one of the GCP covariates, more specifically the GCP combination “000011000000011000011011 ”. In Figure 4.4, the proportion of calls that has this 36 4.1 Exploratory analysis 1.00 0.75 % Termination Drop 0.50 Normal 0.25 0.00 T1 T2 T3 T4 Timeperiod Figure 4.3.: Percentage of PS that terminates unexpectedly, as divided by 4 equally sized time periods specific combination of generic connection properties and terminates unexpectedly is displayed - again, as divided by four equally sized time periods. 1.00 0.75 % Termination Drop 0.50 Normal 0.25 0.00 T1 T2 T3 T4 Timeperiod Figure 4.4.: Percentage of calls with GCP=000011000000011000011011 that terminates unexpectedly, as divided by 4 equally sized time periods One may observe that the proportion of calls having this particular GCP, that drops, changes over time. Specifically, during the first 1/4th of the time-series, close to 70% of the calls that attain this GCP terminates unexpectedly. For the following three periods, however, this relationship shifts such that the calls attaining this property instead tend to correlate with normal calls. 37 Chapter 4 Results 4.2. Online classification In this section, the sampling and classification techniques described in chapter 3 are evaluated as to derive a model that can discriminate between dropped calls and normal calls with high precision. To conclude this section, the (best) online classifier is compared to its static equivalence in a few fictive scenarios. 4.2.1. Sampling strategies As previously mentioned, the number of normal calls far outweighs the number of dropped calls. This part of the results is concerned with studying the effects of the imbalance on the capability of the classifiers. The first question one reasonably may pose is whether sampling is needed at all? If yes, what sampling technique and what sampling rate is suitable? To answer these questions, the online random undersampling (ORUS) technique, as well as the proposed extension, adaptive online random undersampling (A-ORUS) that were described in sec. 3.2 are evaluated using the same evaluation metric as in Nguyen et al. (2011); Wang et al. (2013), the geometric mean. Datasets of different rates of imbalance were created via these sampling techniques, and then the dynamic logistic regression model and the dynamic trees model (with fixed-parameter settings) were applied to these datasets - e.g. holding everything except the sampling size constant. For ORUS, the considered imbalance rates are: (i) 10%/90%, (ii) 30%/80%, and (iii) 50%/50%. For A-ORUS, the considered imbalance rate is: (i) 50%/50%. The original imbalance rate of 1.2%/98.8% is also considered. Let us first evaluate the results of the dynamic logistic regression. In Table 4.1, the results for this evaluation are presented. Sampling Strategy G-mean ORIG 1/99 0.487 ORUS 10/90 0.810 ORUS 30/70 0.875 ORUS 50/50 0.913 A-ORUS 0.890 Table 4.1.: Evaluation of Sampling strategies using Dynamic Logistic Regression It can be seen that the original imbalance rate (∼ 1% dropped calls) has resulted in a G − mean score that is considerably worse than the other four; an indication that sampling may be justified. A general tendency that one may observe is that, as the undersampling rate increases for ORUS - and the distribution over the classes becomes more uniform - the G−mean score increases, reflecting the increased capability of the model to predict positive instances (dropped calls) correctly. Considering the proposed adaptive technique, it can be observed that it does not affect 38 4.2 Online classification the G−mean as positively as the 50/50 sampling rate of ORUS. Let us next consider the corresponding results for the dynamic trees. Sampling Strategy TPR TNR G-mean ORUS 1/99 0.370 0.994 0.545 ORUS 10/90 0.610 0.984 0.719 ORUS 30/70 0.718 0.946 0.789 ORUS 50/50 0.779 0.876 0.827 A-ORUS 0.743 0.911 0.798 Table 4.2.: Evaluation of Sampling strategies using Dynamic Trees The results from Table 4.2 align well with those of Table 4.1, in that the G − mean score steadily increases as the distribution between the classes becomes more even. To confirm the conclusions based on the G − mean, one may further consider the TPR and TNR values, and in particular, the general trend that the TPR increases as the undersampling-rate increases. This, however, come at the cost of reductions in the TNR. Even so, the overall performance of the classifier is improved. Since what is of particular interest in this thesis is to discriminate dropped calls from normal calls, e.g. the positive cases, the G − mean and TPR are of particular importance. Considering the proposed adaptive technique, it can again be observed that it does not affect the G − mean or TPR as positively as the 50/50 sampling rate for ORUS. Based on the results presented in Table 4.1 and 4.2, the decision was made to use the data resulting from the 50/50 ORUS (e.g. the one with a 50%/50% distribution between the classes) for the remainder of the analysis. 4.2.2. Dynamic Trees The first online classification technique to be considered is the dynamic trees. As previously described, the tree prior (affecting the split probability) is specified by two parameters: α and β. Sensitivity analysis of these was performed, and it was found that the results were only marginally affected by their specification. Based on this analysis, the parameters were set to α = 0.99 and β = 2. These settings align well with what usually is applied in the literature. Tables displaying the sensitivity analysis are found in Table A.1 and A.2 in the Appendix. In addition to the tree prior, there is also the forgetting factor (λ), the active data pool size (w), and the number of particles (N ). The latter was set in accordance with the literature; N = 1000. For the former two, an empirical evaluation is performed, as to derive the best DT. Let’s first consider the forgetting factor λ, holding w constant. In Table 4.3, the result of this evaluation is presented: It can be observed that, between λ = 1 and λ = 0.90 , the prediction capability - as measured by AUC and G-mean - monotonically improves. At λ = 0.80, the 39 Chapter 4 Results Lambda w 1.00 1000 0.99 1000 0.95 1000 0.90 1000 0.85 1000 0.80 1000 0.70 1000 0.60 1000 0.50 1000 Table 4.3.: Evaluation of TPR TNR AUC G-mean 0.747 0.858 0.869 0.799 0.839 0.841 0.921 0.850 0.831 0.871 0.926 0.856 0.850 0.875 0.933 0.864 0.823 0.877 0.927 0.857 0.839 0.889 0.934 0.864 0.802 0.900 0.924 0.849 0.819 0.882 0.927 0.855 0.813 0.894 0.927 0.852 forgetting factors for the Dynamic Trees best score is obtained. Lowering λ further does not present any improvement. One important takeaway here is that a λ < 1 is rewarding, which implies that weighting (retired) observations that are observed more recently higher improves the result. This reflects the time-dependence of the system. Let us next consider the active data pool size, w, holding λ constant. w 50 100 250 500 750 1000 1500 2000 3000 4000 Table 4.4.: Evaluation TPR TNR AUC G-mean 0.727 0.886 0.874 0.799 0.777 0.875 0.894 0.823 0.808 0.874 0.913 0.843 0.797 0.884 0.917 0.839 0.815 0.885 0.926 0.851 0.823 0.889 0.928 0.856 0.817 0.894 0.931 0.859 0.837 0.883 0.930 0.866 0.827 0.886 0.931 0.871 0.800 0.903 0.929 0.874 of active data pool size (w) for the Dynamic Trees From Table 4.4, one can observe that, between w = 50 and w = 1000, the performance of the classifier is monotonically increased. After this point, the AUC and G-mean does still increase (up and till w = 3000), but only marginally (relative to the increase in w). Considering the notable increase in terms of computational cost (see Table A.3), the marginal gains in performance is not enough to offset w = 1000 as the best alternative. The observation that the performance is improved as w is increased is not that surprising, because, as previously described, the size of the active data pool determine the total number of observations to be stored in the tree (at any given time point), and hence, a lower w has the effect of forcing the tree to be smaller, whilst a larger w allows the tree to grow larger. A larger tree has the advantage of being able to 40 4.2 Online classification capture more complex structures in the data, but it comes at the cost of potentially not being as flexible as the smaller tree. 0.6 0.4 0.0 0.2 Accuracy 0.8 1.0 Based on the results and analysis of Table 4.3 and 4.4, it is concluded that the best DT is the one that has parameter settings: λ = 0.80, w = 1000: it achieved an AUC of 93.4% and a G-mean of 86.4%. To gain insight of how well this model performed over time we, in Figure 4.5, consider a rolling window displaying the accuracy at different time-points. 0 2000 4000 6000 8000 10000 12000 14000 Time Figure 4.5.: Rolling window measuring the Accuracy for the best Dynamic Trees model over the considered period One can observe that the performance of the classifier at approximately 7 times degrades to an accuracy of less than 60%. The time-points for the degradations may be compared to the call drop time-series for the undersampled data in Figure 4.7. In doing so, one finds that four of the degradations happen during the second period of abnormally high drop-rate (corresponding to subset 13-22 in Figure 4.7). The sixth and seventh degradation are related to the third and fifth period of abnormally high drop-rate respectively. As described in sec. 3.4.3, the latent state of the dynamic tree consist of the tree-structure. The degradations in this plot reflects the inability of the model to update its structure fast enough. A general observation one might make is that there is no clear trend, which can imply one of two things, either (i) there are no structural changes that occurred over this time-period, or (ii) the classifier is able to adapt to changing circumstances, and thus avoids any longer period of degradation. Seeing it as there clearly are reductions in the performance, but that the classifier recovers, the second alternative appear more likely. In sec. 4.3, we will consider the period 4200-6400 in more detail, as to explore what might have caused these degradations. 4.2.3. Dynamic Logistic Regression In this subsection, the dynamic logistic regression, as well as the extension with dynamic model averaging are evaluated. 41 Chapter 4 Results 4.2.3.1. Allowing for the inclusion of sparse attributes As previously mentioned, in addition to being imbalanced, the data is also sparse with regards to several of the input variables, and this proved to pose another challenge. This part of the results is concerned with shedding light on this problem, as well as to evaluate the proposed forgetting factor modification and compare it to the original proposed in McCormick et al. (2012). To explore how many, and which variables the original forgetting framework has trouble with, an experiment was set up such that 122 univariate dynamic regression models were fitted, one for each covariate. If the model-fitting failed to execute correctly during the recursive-updating step, a ”1” was recorded for that attribute. If no problem occurred, a ”0” was registered. The same experiment was then performed for the modified forgetting framework. The forgetting factor λ was set to 0.95 for both versions. For the modified version, the additional parameter w was set to 10. In Table 4.5, the outcome of these runs is presented: Forgetting scheme Success Failed Original 92 30 Modified 122 0 Table 4.5.: Evaluation forgetting frameworks One may observe that approximately 1/4 of the covariates could not be used with the original forgetting framework. By applying the modified version, however, all covariates could be included. The full list of variables for which the original forgetting framework failed is found in Table A.4. One characteristic that they all share, is that they are temporally sparse. Let’s consider one of these covariates, cell_526, in a bit more detail. The updating step fails to converge at time-point 3441, and in Figure 4.6, the log-odds, the values of the covariate, and the values of the response variable are presented between time-point 3350-3450. From Figure 4.6 it can be observed that during this sub-period, 6 calls were made from cell_526, of which 3 were dropped. At time-point 22 (in this plot), a match occurs (cell_526 = 1 and y = 1), the model react by updating the parameter estimate to ∼ 500+. At time-point 36, the next call from cell_526 is made, however this time it’s not a match (cell_526 = 1, y = 0), consequently, the model updates the parameter estimate to ∼ −200. Finally, at time-point 91, we observe two consecutive matches, and this is what causes the model updating to crash: the model updates the parameter to such an extent that when the logit prediction is made, exponentiation of the log-odds produces an infinite value. As previously mentioned, by lowering the λ value, we assign larger weights to more recent observations (according to: weightobservation j = λj ) , and what we observe here is that, during periods of sparsity, this has the potential effect of causing extreme inflation of parameter estimates that crashes the algorithm. The proposed modified forgetting framework 42 4.2 Online classification 200 −200 0 Log−odds 400 addresses this problem by scaling λ closer to 1 during periods of sparsity, and hence base its parameter updates on longer spans of observations. 0 20 40 60 80 100 1.0 Time ● ● ●● ● 0.6 0.4 0.0 0.2 cellid_526 0.8 ● ●●●● ●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 0 20 40 60 ●●●●●● ●● 80 100 1.0 Time ● ● ● ● ● ●●●● ● ● ● ● ●●●●● ● ● ● ●●● ● ● ● ●●● ●● ●● 0.0 0.2 0.4 y 0.6 0.8 ● ● ●●●●●●● ● ● ●● ●●●● ●●● 0 20 ●● ● ● ●●●●●●●●● ●● ●●●● ●●● ●●● ●● 40 60 ●● ●● ●● ●● 80 ●●●● ●●● ●●●●● 100 Time Figure 4.6.: Example of breakdown for original forgetting framework: cell_526 43 Chapter 4 Results 4.2.3.2. Evaluation of forgetting factors Given the central importance of the concept of forgetting, this section is dedicated to evaluate the different forgetting strategies described in Section sec. 3.4.1, as well as different forgetting constants (c = λ < 1), such to obtain the best fit and prediction capability. Note that, when using the modified forgetting framework, the additional parameter, w (defining the window upon which local sparsity should be estimated), is set to 10 throughout this work: it was empirically found to be a suitable value. First we consider the simplest strategy, that of the fixed λ, using the original forgetting framework. Lambda AUC G-mean 1.000 0.94746 0.88126 0.999 0.97700 0.92302 Table 4.6.: Evaluation of Simple Forgetting Strategy Lambda 0.99 0.95 0.90 0.85 0.80 0.75 Table 4.7.: Evaluation AUC G-mean 0.97009 0.91377 0.99656 0.98119 0.99930 0.99508 0.99954 0.99649 0.99963 0.99688 0.99956 0.99657 of Adaptive Forgetting Strategy Lambda AUC G-mean Multiple: 1, 0.99, 0.95, 0.90, 0.85, 0.80 0.99972 0.99867 Table 4.8.: Evaluation of Multiple Adaptive Forgetting Strategy From Table 4.6, it can be observed that, out of the two considered λ values, the best result is obtained by using λ = 0.999. As this strategy were implemented via the original forgetting framework (because we want a fixed λ), the process collapses for λ values lower than 0.999, and hence lower values were not explored for this approach. The results from Table 4.6 do nonetheless give a hint that a more local fit is probably preferable - giving relatively higher weights to more recently observed data points. From Table 4.7, one can observe that, using the extended forgetting strategy proposed by McCormick et al. (2012) coupled with the modification proposed in this thesis, we are able to improve the results of the fixed λ considerably. The forgetting factor which has resulted in the best prediction capability is λ = 0.80, obtaining 44 4.2 Online classification an AUC of 99.963% and G − mean = 99.688%. This again tends to imply that a local fit is preferable to a global one. A λ value of 0.80 implies that an observation occurring 10 time-points back is assigned approximately 1/10th of the weight that the past observation has. In Table 4.8, results for the third strategy of multiple λ’s are displayed. It is found that extending the number of λ0 s to evaluate at each iteration leads to a marginal improvement in this case (AU C = 99.972% and G − mean = 99.867%). This improvement comes at the cost of slower computational time, such that a trade-off has to be made. Since the improvement is only marginal in this case, the extension may not be worth the computational cost. However, if the degree of change changes over time, or is temporal, multiple λ0 s may be worthwhile. A final comment is that, using the original (adaptive) forgetting framework proposed by McCormick et al. (2012), the lowest λ value that could be used were λ = 0.98, and it resulted in AU C = 99.688% and G − mean = 98.234%. Hence, the modified forgetting framework, in addition to being able to include more covariates, is able to outperform the original approach on this data. 4.2.3.3. Extension with Dynamic Model Averaging Two approaches for the construction of candidate models are considered: (i) one candidate model per “variable group”, and (ii) one candidate model for all possible combinations of “the most interesting variable groups”. It should be emphasized that the set of variables and variable groups are different in (i) and (ii): the former constitutes of all 122 variables (22 variable groups), whilst the latter only contains 92 variables and 6 variable groups. Let’s begin by considering the former: Strategy 1 First, the model forgetting factor, α, is considered (holding the within-model forgetting factor, λ, constant): Lambda Alpha AUC G-mean 0.99 0.99 0.881 0.798 0.99 0.95 0.917 0.839 0.99 0.90 0.931 0.855 0.99 0.85 0.935 0.860 0.99 0.80 0.937 0.862 0.99 0.75 0.937 0.863 0.99 0.70 0.937 0.863 Table 4.9.: Evaluation of alpha for DMA From Table 4.9, one can observe that as the α value is lowered, the predictive capability of the model steadily increases. This reflects, on the one hand, that 45 Chapter 4 Results we have many small models that by themselves may not be very predictive, and on the other hand, that these models discriminate the data with varying quality over the span of the time-series: e.g. that one variable group (candidate model) may explain the data relatively well at one point in time, but not at the other. The gains in predictive capability from a lowered α takes a decaying form, and stops around 0.75 (AUC is marginally lower at α = 0.70). As such, we move on to the second parameter, the within-model forgetting factor, λ (holding the model forgetting factor, α, constant): Lambda Alpha AUC G-mean 1.00 0.75 0.928 0.847 0.99 0.75 0.936 0.863 0.98 0.75 0.945 0.875 0.97 0.75 0.950 0.883 0.96 0.75 0.955 0.890 0.95 0.75 0.958 0.895 0.94 0.75 0.960 0.897 0.93 0.75 0.962 0.899 0.92 0.75 0.963 0.902 0.91 0.75 0.965 0.905 0.90 0.75 0.966 0.907 0.89 0.75 0.967 0.909 0.88 0.75 0.968 0.911 0.87 0.75 0.969 0.913 0.86 0.75 0.970 0.913 0.85 0.75 0.970 0.913 Table 4.10.: Evaluation of lambda for DMA In Table 4.10 it can be seen that the comments made for α also apply to λ: as the forgetting factor is lowered, the overall performance increases. As previously described, what this translates into, in practice, is that the candidate models adapt to local behaviors through rapid and local updating of the coefficient estimates. It should be noted that some trouble were encountered here when lowering the λ value (even with the modified forgetting factor). These problems were limited to specific candidate models, and as such, a specific (higher) λ value was set for those. Strategy 2 In the second strategy, we construct candidate models by considering all the possible model-combinations of “the most interesting variable groups”. Selecting six variable groups translates into 64 candidate models. In Table 4.11 these variable groups are displayed. Given the encountered limitations of the previous strategy, we here set a rather conservative λ of 0.95: this to ensure stability. 46 4.2 Online classification Variable.Groups Variables tProc4 t_proc1,...,t_proc37 Cell ID cell_1,...,cell_17 GCP X1,...,X23 Radiolink Radiolinkrequest,...,Radiolinkfailure UeRcid PS, CS, SRB, Mixed EvID e1a, e1d,...e1f Table 4.11.: Variable groups Model Alpha AUC G-mean DMA 0.99 0.983 0.939 DMA 0.9 0.990 0.954 DMA 0.8 0.993 0.961 DMA 0.7 0.994 0.965 DMA 0.6 0.995 0.968 DMA 0.5 0.995 0.970 DMA 0.4 0.996 0.971 DMA 0.3 0.996 0.972 DMA 0.2 0.996 0.973 DMA 0.1 0.997 0.974 DMA 0.01 0.997 0.974 Full Model – 0.990 0.954 Table 4.12.: Strategy 2: Evaluation of alpha for DMA From Table 4.12 it can observed that as the model-forgetting factor α is lowered the classification capability of DMA monotonically increases, and outperforms the single (full) model at α = 0.90. The gains in AUC from lowering α gradually decays, and at α = 0.10, only the 5th decimal changes, and hence we stop there. These results implies, on the one hand, that the variable groups has a non-constant and varying degree of importance over the time-period, and being able to shift more weight to models excluding less relevant variables, is rewarding. Decreasing α as low as 0.10 has the effect flattening the distribution of the model indicator quite extensively, and as such, all candidate models are assigned relatively low weights and the prediction of DMA becomes more of an averaged prediction of many candidate models, rather than a few. Whilst presenting promising results, this approach has the downside of having to consider 64 candidate models rather than 1, implying a hefty decrease in computational speed. Although, as the candidate models are updated independently of one another, it is possible parallelize this process, as to reduce the computational constraints. 47 Chapter 4 Results 4.2.4. Summary of results In Table 4.13, the best results for each of the considered approaches are presented. Model AUC G-mean Single DLR 0.9996 0.9969 Group DMA 0.9965 0.9737 Dynamic Trees 0.9341 0.8635 Table 4.13.: Summary of results It can be observed that the single dynamic logistic regression is the clear winner; it has obtained the highest AUC and G-mean scores. Recall that in contrast to the single dynamic logistic regression and the dynamic tress model, the Group DMA model only consists of 92 variables, divided into 6 variable groups. This is worth underscoring since we in sec. 4.2.3.3 concluded that the Group DMA outperformed the single model. From Table 4.13, one can also observe the rather big difference that exists in terms of predictive capability between the models that are centered around the dynamic logistic regression and the dynamic trees. One possible explanation for this is that the process of updating the tree-structure may not be as rapid as the process of updating the parameters in the dynamic logistic regression. To demonstrate this point, consider Figure A.1. It can be seen that the covariate cell_id220524 were included in 100% of the N particles (trees) between time-point 4200-9800. If we next consider Figure A.2, where the reduction in entropy for the same covariate is displayed, one may observe that the period for which the reduction in entropy is large (4200-6000), is considerably shorter than the period for which the covariate is included in the trees (4200-9800). 4.2.5. Static Logistic Regression vs. Dynamic Logistic Regression The two classifiers that were selected for this thesis have in common that they are dynamic and updated online. In this subsection, the question of whether a dynamic model is preferable to a static one is evaluated in terms of the predictive capability over the considered period. Seven scenarios are considered, and what differentiates them is the size of the batch in the training set compared to the test set: from {10% training, 90% test}, to {90% training, 10% test}. For all of the scenarios, a static logistic regression is fitted in the training period, and then used for predicting the incoming calls for the next period. The results are presented in Table 4.14. It can be observed that the static classifier performs gradually better as it is fed more data: the highest AUC score is obtained when 80% of the data is in the 48 4.2 Online classification Training set proportion AUC 0.100 0.846 0.200 0.855 0.333 0.878 0.500 0.881 0.667 0.900 0.800 0.911 0.900 0.880 Table 4.14.: AUC Static Logistic Regression training set (AU C = 91.11%). In comparison to the dynamic logistic regression, the performance is considerably worse: recall that the best dynamic logistic regression obtained an AUC of 99.96%. In addition to the scenarios described above, in which data is accumulated sequentially, let’s consider a scenario in which all the data is available. First, we randomly assign 70% of the observations of the data to a training set (without accounting for the sequential order of the data), and use this data to fit a logistic regression model. Secondly, we use this fitted model to predict the outcomes of the remaining 30% of the data (in the test set). Following step 1 and 2, one obtains predictions that results in an AUC of 93.8%. In other words, an improvement to the sequential modeling scheme of the static classifier, but still far less well compared to the dynamic logistic regression. 49 Chapter 4 Results 4.3. Online drop analysis Besides resulting in actual predictions, the chosen classifiers have - as previously mentioned - the caveat of leaving varying degrees of “traces” (not to be confused with the covariate) as to how these predictions were made. The idea of this part of the results, which is termed online drop analysis, is to explore these “traces”, and in particular what they can tell us about periods of abnormal number of dropped calls. 200 150 100 Number of dropped calls 250 Having randomly undersampled the data as if it arrived online (according to the concluded best sampling rate: 50/50), the call-drop time-series from Figure 4.1 is transformed as to appear like Figure 4.7: 0 10 20 30 40 50 Time Figure 4.7.: Number of dropped calls, as divided into 50 equally large time-ordered subsets - based on the undersampled data (ORUS 50/50) In the previous section, it was shown that the dynamic logistic regression clearly outperformed the dynamic trees model, and hence this section will be centered around the use of the dynamic logistic regression, although the results of the DT are supplied for comparative and confirmatory purposes. As previously mentioned, in McCormick et al. (2012); Koop and Korobilis (2012), two approaches were considered for studying the temporal significance of covariates and how the conditional relationships change over time. The first considering posterior inclusion probabilities, and the second considering odds ratios. Both of these approaches are considered in this section. 4.3.1. DMA posterior inclusion probabilities As mentioned in sec. 3.4.2, candidate models are - in this work - not constructed on the basis of individual variables, but by variable groups, such that we do not consider all the possible combinations of variables, but rather all the possible combinations of variable groups: 26 = 64. A DMA model was set up with the parameter-settings that were found to produce the best results in sec. 4.2.3.3. In Figure 4.8 and Figure 4.9, 50 4.3 Online drop analysis posterior inclusion probabilities for these variable groups over the considered period are displayed. A first broad observation is that the inclusion probabilities are quite volatile for all of the variable groups. This is a byproduct of setting a low α value, since this parameter controls how rapid the dynamic updating of the model probabilities should be. It may further be noted that neither of the individual candidate models assume a posterior model probability higher than 0.5 for any notable length of time. Neither of the variable groups assumes a posterior inclusion probability of 1 or 0 for the whole period. The two variable groups with the overall lowest inclusion probability are Variable Group 1 (Trace 4 Procedures) and Variable Group 4 (Uercid). The two variable groups with the highest overall inclusion probability are Variable Group 5 (Radiolink) and Variable Group 2 (GCP). A more specific and possibly more interesting observation that can be made is in regards to Variable Group 6 (Cell IDs). In Figure 4.9 one can observe that for 2-3 periods (one around 4500, one around 9000, and one around 11500) the inclusion probabilities remain close to 1, in a way that is not observed for the rest of the time-series. 51 Results 0.6 0.4 0.0 0.2 Inclusion probability 0.8 1.0 Chapter 4 0 2000 4000 6000 8000 10000 12000 14000 8000 10000 12000 14000 0.6 0.4 0.0 0.2 Inclusion probability 0.8 1.0 Time 0 2000 4000 6000 0.6 0.4 0.0 0.2 Inclusion probability 0.8 1.0 Time 0 2000 4000 6000 8000 10000 12000 14000 Time Figure 4.8.: Posterior inclusion probabilities: Variable Group 1 = Trace 4 Procedures || Variable Group 2: GCP || Variable Group 3: evID 52 0.6 0.4 0.0 0.2 Inclusion probability 0.8 1.0 4.3 Online drop analysis 0 2000 4000 6000 8000 10000 12000 14000 8000 10000 12000 14000 0.6 0.4 0.0 0.2 Inclusion probability 0.8 1.0 Time 0 2000 4000 6000 0.6 0.4 0.0 0.2 Inclusion probability 0.8 1.0 Time 0 2000 4000 6000 8000 10000 12000 14000 Time Figure 4.9.: Posterior inclusion probabilities: Variable Group 4: UeRcid || Variable Group 5: Radiolink || Variable Group 6: Cell ID 53 Chapter 4 Results 4.3.2. Evolution of odds-ratios and reduction in entropy In this subsection, the temporal significance of covariates is analyzed by considering the evolution of the odds-ratios as well as the posterior model probabilities from the dynamic logistic regression and the univariate scanner. The reduction in entropy from the DT model is also considered as a way to confirm the main results. To explore all of the covariates individually is out of the scope of this thesis. As such, we limit the analysis to one interesting period, or more specifically, a period of abnormal call-drop ratio. It is worth emphasizing that whilst we here first select a period, and then secondly evaluate which variables were important, this order of events is not required. That is, one could just as well have assumed that no knowledge of the actual call-drop ratio were possessed, and instead monitor the actual evolution of the output from the considered models. Before considering the “interesting period”, we shall first compare the degree of insight from the (full) single dynamic logistic regression to that of the univariate DMA (also referred to as the “univariate scanner”), as to determine which to use for analyzing the “interesting period”. 4.3.2.1. Single Dynamic Logistic Regression vs. Univariate DMA As previously described, in the single dynamic logistic regression, all of the covariates are included in the same model, whilst in the univariate DMA (the “univariate scanner”) there are as many candidate models as there are covariates, one covariate in each. An important difference between these two approaches concerns the forgetting factor: in the full model, we set a common forgetting factor for all of the covariates, whilst in the univariate case, we allow for each covariate to have its own forgetting factor. A priori this reasonably suggests that the latter allows for a more precise recursive estimation for each of the covariates. To evaluate whether this is the case, we compare the recursive estimates from the single dynamic logistic regression that were found to obtain the best results in sec. 4.2 (λt = 0.80), to a univariate DMA (with λt = 0.95). In Figure 4.10 and Figure 4.11, one such comparison is displayed. Figure 4.10 and Figure 4.11 demonstrates a general finding, that the univariate DMA indeed is able to more precisely update the coefficient estimates. For instance, note that that the confidence band of Figure 4.11 is narrower compared to Figure 4.10, reflecting the ability of the former to not update unnecessarily. The degree of similarity between the recursive estimates of the two approaches is linked to the posterior model probabilities in the univariate DMA, such that if the posterior model probability - for a particular covariate - is large at a particular time-point, the updating (at that time-point) of the single dynamic logistic regression is likely to be similar to that of the univariate DMA. For instance, consider Figure A.3 and Figure A.4 in the Appendix. This is reasonable, since what determines whether 54 0 −2 Log odds 2 4 4.3 Online drop analysis 0 2000 4000 6000 8000 10000 12000 14000 Time “GCP 0 −2 Log odds 2 4 Figure 4.10.: Log-odds from the single dynamic logistic regression: 000011000000001100001000” 0 2000 4000 6000 8000 10000 12000 14000 Time Figure 4.11.: Log-odds from 000011000000001100001000” the “univariate scanner”: “GCP forgetting should be applied or not at a particular time point is the predictive likelihood, and hence, if there is a covariate that is predominant at this time-point, it will also have a great impact on the predictive likelihood in the full model. The increased precision of the univariate DMA comes at the cost of computational speed. Given the analysis of the previous paragraph, the univariate DMA is used as the tool for obtaining log-odds and odds-ratios for the remainder of this section. 4.3.2.2. Interesting period The period of consideration is that which occurs between time-point 4200-6400, corresponding to 14-23 in Figure 4.7 (e.g. the second period of high drop rate). In Table 4.15, the covariates that were found to attain a significant positive effect during this period are listed. Time-series plots of the recursive coefficient estimates for these covariates are found in sec. A.2.2. As one may observe from Table 4.15 (or from sec. A.2.2), some of the covariates are 55 Chapter 4 Results Coefficient Time-period cell_id220524 4200-6400 X14 4200-5200 X15 4200-6400 X16 4200-5200 X21 4200-6400 X22 4200-5200 X23 5200-6400 GCP 000011000000001100001000 4200-5200 GCP 000011000000011100011100 4200-5200 GCP 000000000000011000011011 5200-6400 GCP 000000000000001000011011 5200-6400 PS 4200-6400 radiolinksetupfailurefdd 5200-6400 Table 4.15.: Significant coefficients during period of consideration: 4200-6400 only relevant for a certain part of this period. More specifically, time-point ~5200 appear to be a division-point. Hence, this period can be thought of as consisting of two sub-periods. If we take a look at Figure 4.7, this appears reasonable: at time-point 18 there is noticeable increase (and this corresponds to time-point ~5200 in the figures displayed in sec. A.2.2). Sub-period 1 Considering the evolution of the odds-ratios for the first sub-period, there is one covariate that present a particularly interesting behavior, and that is Cell_220524. Let us therefore consider it in a bit more detail. Recall that cells define the geographical area in which a call is made. For most of the (full) period, e.g. 1-14200, this covariate is insignificant, with an odds-ratio hovering around 1, but for the particular period under concern, the odds-ratio shoots up significantly, easily passing the previously mentioned rule-of-thumb of > 3. See Figure 4.12. Exploring this covariate under this particular sub-period more closely, one finds that 52.7% of the calls were made from Cell_220524, and that out of these, 98.5% were dropped. This can be compared to the prior period (1 – 4200), in which 5% of the calls were recorded for this cell, and out of these, only 15% were dropped calls. Considering the posterior model probabilities for the (univariate) candidate model consisting of this covariate, one may observe from Figure 4.13 that DMA has assigned 100% posterior model probability for this candidate model during this period. Finally, if we consider the measure of reduction in entropy obtained from the dynamic trees, as is displayed in Figure 4.14, one can observe that this replicates Figure 4.12 and Figure 4.13 quite well. Deriving association rules for the first sub-period (4200-5200), one obtains rules that confirm the significance of the covariates. One finding, for instance, is that 56 Odds ratio 0 50 100 150 4.3 Online drop analysis 0 2000 4000 6000 8000 10000 12000 14000 12000 14000 Time 0.8 0.6 0.4 0.2 0.0 Posterior Model Probability 1.0 Figure 4.12.: Odds-ratios: Cell 220524 0 2000 4000 6000 8000 10000 Time Figure 4.13.: Posterior Model Probabilities: Cell 220524 ∼ 20% of the calls during this period were made from phones with the particular GCP combination “000011000000011100011100 ”, and out of these, 99.58% were terminated unnaturally. Another finding is in regards to the covariate PS : 52.3% of the calls originated from phones transmitting data, and out of these 86.7% dropped. Sub-period 2 Considering the evolution of the odds-ratios for the second sub-period, there is again one covariate that presents a particularly interesting behavior, in this case it is radiolinksetupfailurefdd. For the greater part of the full-period, this covariate mostly registers 0-values, but under the particular sub-period of concern, it registers a lot of 1’s, indicating its presence in the calls. In Figure 4.15, the evolution of the odds-ratio for this covariate is presented. Furthermore, in Figure 4.16 the posterior model probabilities for the candidate model representing this covariate is displayed. From Figure 4.15, we can observe that the odds-ratio shoots up significantly around ~5200. One may further notice that the estimate then stabilizes around ~70 for the remainder of the period. This is because very few subsequent observations having this attribute are encountered, and hence the coefficient isn’t updated. From Figure 4.16, it can be seen that DMA has assigned a posterior model probability 57 Results 0.020 0.010 0.000 Variable Importance 0.030 Chapter 4 0 2000 4000 6000 8000 10000 12000 14000 Time 40 30 0 10 20 Odds ratio 50 60 70 Figure 4.14.: Reduction in Entropy: Cell 220524 0 2000 4000 6000 8000 10000 12000 14000 Time Figure 4.15.: Odds-ratios: Radiosetupfailurefdd of 100% for the greater part of this sub-period. Furthermore, in Figure A.12 the reduction in entropy for this covariate is displayed, and it can be observed that it replicates these two aforementioned plots quite well. Exploring this covariate more closely, one finds that 74.6% of calls during this sub-period had a radiolink-setupfailure (radiolinksetupfailurefdd=1), of which 98.8% dropped. Deriving association rules for the second sub-period (5200-6400), one again finds rules that confirm the significance of the covariates. During this sub-period, we find that two particular GCP combinations, “000000000000011000011011” and “000000000000001000011011 ”, are relevant and correlate with dropped calls to a great degree. They represent 18.1% and 27.8% respectively of the calls during this sub-period, and out of these 99.5% and 98.5% were dropped respectively. 4.3.3. Static Logistic Regression vs. Dynamic Logistic Regression In sec. 4.2.5, the question of whether a dynamic approach is preferable to a static one was evaluated in terms of predictive capability. In this section, this question is 58 0.8 0.6 0.4 0.2 0.0 Posterior Model Probability 1.0 4.3 Online drop analysis 0 2000 4000 6000 8000 10000 12000 14000 Time Figure 4.16.: Posterior Model Probabilities: Radiosetupfailurefdd considered in terms of the degree of insight about variable effects. In Table B.5 in the Appendix, one finds the coefficient estimates obtained from the static logistic regression model fitted on 70% of the dataset, without accounting for the orderdependence. Below, a few examples where the results of the two approaches do not align are presented. Cell_id220517 As one may observe from Table B.5, using the static logistic regression, the coefficient for this covariate has been estimated to −0.528371 (with a p-value of 0.022355), implying a significant negative effect. If we instead consider the recursive estimates obtained from the dynamic logistic regression, displayed in Figure A.17, one can observe that whilst this covariate presents a significant negative effect for some subperiods, it also presents periods of significant positive effect. Cell_id220518 From Table B.5, it can be seen that the coefficient has been estimated to −0.285032 for this covariate (with the static logistic regression), however resulting in a p−value of 0.17, and hence, by most standards, would be considered insignificant. If we take a look at the recursive estimates displayed in Figure A.18, one can see that the coefficient is indeed insignificant for large sections of the period, but that for a few sub-periods it attains (both positive and negative) estimates that are at least ±2 standard errors away from 0. X23 In Table B.5, one can observe a coefficient estimate of −0.506021 (with a p-value of 0.000606), hence implying a significant negative effect. Considering the recursive estimates from the dynamic logistic regression, displayed in Figure A.10, one may indeed observe that for large parts of the period, the coefficient is estimated to a negative value, but that for a sub-period of approximately 1000 time-points, the coefficient is estimated to a positive value that is significant (where it reaches an odds-ratio of ∼ 15). 59 Chapter 4 Results 4.3.3.1. Summary For the three examples described above, the common theme is that through the static framework, temporal behaviors are not captured, and as a consequence, the resulting estimated effects could be misleading to interpret. For instance, to interpret the effect of “X23” as “significantly negative over the considered period”, although methodologically correct, may be misleading in practical applications. There are dozens of additional cases like those described above. These results may be seen as a part-explanation of why the dynamic logistic regression also proved to perform stronger in terms of predictability. 60 5. Discussion To our knowledge, this thesis work is the first to approach the problem of analyzing drop causes in mobile networks by using online learning classification techniques. This approach was motivated by, on the one hand, the availability of class labels, and on the other, the assumption that data is non-stationary and what correlates with a particular class may change over time. A natural approach would otherwise have been to consider this problem as an anomaly detection problem, in which abnormal periods could be identified by large increases in the number of dropped calls. By instead framing the problem as an online classification problem, one arguably addresses the core of the problem more directly, in that one do not necessarily have to first detect a suspicious period in order to detect changes in drop causes. In its original format, the data consisted of very large .txt files, such that any direct application of statistical or machine learning methods were not possible. Consequently, quite a lot of time was initially spent on processing the data using various text mining techniques. The collective size of the .txt files was enormous, such that pre-processing with a regular computer faced memory issues. The decision was made to limit the analysis to one STP, and further to apply sampling techniques in the parsing step, such that only a limited amount of observations (calls) had to be pre-processed. Making the parsing-scripts more efficient and possibly running the scripts on a more powerful computer would increase one the one hand the possible scope of the analysis, as well as the practical applicability of the proposed approach, and is left for future work. A lot of effort was invested trying to understand the data as well as exploring what methods had been used previously to analyze similar data. One characteristic of the data that was of particular focus initially was that every observation has a sequential structure: every call has a beginning and an end, and in between these two events, data are successively recorded - denoted by time-stamps. As such, an initial idea was to use sequential pattern-mining methods for the purpose of detecting new behavior in the data. However, after having explored simpler (static) classification techniques on unordered data from the logs, it became clear that the sequential structure may not be as important as initially expected (to discriminate between normal and dropped calls). As such, the problem was instead defined as a classification problem, in which another sequential aspect was emphasized; that of between logs, rather than within logs. On the basis of the characteristics of the data (high-dimensional, sequentially arriving, and non-stationary) and the objective of the thesis (to explore discriminative 61 Chapter 5 Discussion features), four criteria were set up as to determine the specific classification techniques to be used. From the literature-review that was performed, it was found that these criteria quite drastically narrowed the space of apt classifiers. In the end, (dynamic) extensions of the logistic regression and partition trees were selected, both satisfying the four criteria relatively well. More specifically, considering the former first, two dynamic extensions of the logistic regression were considered, one being the (single-) dynamic logistic regression, and the other being a further extension of the former, also accounting for model uncertainty through an extension of BMA (DMA). The initial expectation was that the latter would pose a stronger alternative compared to the former in terms of performance. It was however found that the standard approach of considering all the possible variable-combinations was not computationally feasible for this data, due to high-dimensionality, as well as a long time-series. An interesting extension of the DMA were proposed in Onorante and Raftery (2014) to address the issue of large model spaces, using the concept Occam’s window to reduce the number of possible models considered at every time point. This approach was however ultimately disregarded as (i) it is suitable for shorter time series, and (ii) only occasionally tests to include candidate models from the larger space of models, and as such doesn’t align so well with the concept of detecting change in drop causes. Instead, two alternative strategies for constructing candidate models were considered. Whilst showing strong performance, neither could outperform the single dynamic logistic regression, which were shown to perform excellent on this data: the best one resulting in an AUC of 99.96% and a G − mean of 99.7%. The other (online learning) classification technique considered in this thesis was the Dynamic Trees. A careful evaluation of the model parameters were performed as to derive the best DT. Just as with the dynamic logistic regression, it was found that a λ < 1 improved the results, implying that a local fit is preferable to a global one. In terms of predictive capability, this technique did not perform as well as the dynamic logistic regression: the best DT was shown to obtain an AUC and G-mean of 93.41% and 86.35% respectively. Figure 4.5 displays the performance of the DT over the considered period, and as previously noted, several degradation’s in the performance are present, implying that the tree-structure were not able to adapt as quickly as needed. Such degradation’s were not found for the dynamic logistic regression. The performance of the best dynamic logistic regression classifier were further compared to that of the standard (static) logistic regression in two experimental setups in which (i) data were gradually observed, and (ii) all data were available. In both cases, the dynamic logistic regression was shown to outperform the static logistic regression with comfortable margins: supporting the hypothesis of non-stationary, as well as motivating the dynamic extension. In addition, it may also be worth noting that ANN and SVM - known for their ability to classify complex and highdimensional data - also were applied to this data (the full dataset); it was found that neither could beat the performance of the dynamic logistic regression. These 62 Discussion are promising results, as the covariates extracted from each call are of such a type that extending the proposed approach to early prediction seems feasible. This could be an interesting approach to explore in the future. In addition to the three main characteristics of the data listed above, one may add (iv) temporally sparse. This turned out to cause problems for the dynamic logistic regression, where the original method failed to converge for about 1/4 of the covariates. To address this problem, this work presented a modification of the forgetting framework originally proposed by McCormick et al. (2012), which in addition to considering the predictive likelihood also considers local sparsity. This modification was shown to allow the inclusion of multiple variables that could not be used with the original forgetting framework. An evaluation - in terms of predictive capacity - was also performed, which showed that the modified framework achieved a slightly stronger performance. The basic idea of this modification that during periods of sparsity, lower degree of forgetting is applied, and hence the update of the parameters is based on a longer span of data, appears intuitive. However, whilst allowing more attributes to be included and a lower λ to be set, this modification it did not completely solve the issue. It was found that under some circumstances, this framework still had problems with convergence; when λ were set very low. As such, when stability rather than maximum predictability is the objective, a more conservative selection of λ is likely to be preferable. By scaling the forgetting factor closer to 1 during periods of sparsity - for a particular covariate - one also runs the risk of potentially not capturing some interesting local behavior at such periods. The modification further introduced an additional parameter, defining the width of the window in which sparsity should be considered - and everything else equal, more parameters is not to be preferred. Further refinement and evaluation of this modification is needed, and is left for future work. Regarding the part of the analysis concerned with temporal significance of covariates, which we termed online drop analysis, it was demonstrated that the selected online learning classification techniques were able to provide good insights into what variables are important at different time-points. Two levels of granularity were considered; one centered around ’variable groups’, and the other focusing on (as in the standard case) the ’actual variables’. In the former, posterior inclusion probabilities obtained from DMA were analyzed, and in the latter, the evolution of the log-odds or odds-ratios. The scope of insights from the former was quite limited. Concerning the latter, an evaluation was performed as to determine whether the ’univariate scanner’ is more suitable than the single dynamic logistic regression. It was found that the ’univariate scanner’, through its variable-specific forgetting, could more precisely describe the evolution of the coefficient estimates. In the case of the ’univariate scanner’, in addition to evaluating the log-odds or odds-ratios, we also analyzed the posterior (univariate) model probabilities. Finally, for the dynamic trees, the reduction in entropy was studied. Using these approaches, two sub-periods of abnormal call-drop rates were analyzed, and several interesting findings were made. One, for instance, is that the geographical area in which calls were made from (the 63 Chapter 5 Discussion cells) played a key role in the first sub-period. Furthermore, it was shown that the aforementioned approaches to analyzing temporal significance resulted in similar conclusions. Reflecting briefly on the effect of the forgetting factor in the context of online drop analysis; it may again be underscored that a lower λ has the effect giving greater weight to more recently observed data points, and hence producing more volatile updates of the odds ratios. This increases the degree to which one is able to detect local temporal significance of covariates. However, to ensure stability of the system and convergence of the algorithm, λ values below 0.9 were not considered for the univariate scanner. To further refine the modified forgetting factor, and hence potentially allowing for greater granularity in terms of identifying local behavior, is left for future work. Since the objective of this thesis was to develop and demonstrate a framework rather than extracting specific covariates, collaboration with domain-experts could improve the selection of variables to extract, as to better serve the specific objectives of the troubleshooting team. Another extension of this approach could be to automate the “detection-step”, such that alerts would be triggered if a covariate reaches a certain odds ratio for instance. In addition to reducing the computational burden of pre-processing, sampling techniques were also used for the purpose of addressing the issues of class-imbalance. Sampling techniques generally have the positive effect of helping classifiers so that they are able to learn the minority class better. This was shown to be the case in this thesis as well. Specifically, two sampling techniques were evaluated: (i) online random undersampling, and (ii) adaptive-online random undersampling, which were developed in this thesis. Both were shown to be able to increase the capability of the classifiers to correctly identify minority instances. This improvement does however come at the cost of potential information loss. By undersampling to such a great extent as done in this thesis, one does run the risk of missing out on potentially useful information. To evaluate to which degree this was problem, sensitivity analysis were performed as to ensure that different sampling rates resulted in approximately the same conclusions. This analysis did not reveal any noteworthy issues. Additional evaluation of this aspect would nonetheless be rewarding and is left for future work. In terms of implementation, online learning has the caveat of not requiring the storage of any data, which may come in handy seeing it as the size of the data that Ericsson accumulates every day is enormous. The DMA approach can further be parallelized, since each candidate model is updated independently of each other, as to speed up the computation speed considerably. 64 6. Conclusions This work presents an, to our knowledge, new approach for analyzing dropped calls in mobile networks. Compared to the static state-of-art approaches, the developed framework enables the detection of changes in drop causes without first determining a suspicious period, and secondly, does not require any storage of data. To address the issue of class imbalance, this thesis applied an online adaptation of the random undersampling technique, as well as an extension developed in this thesis. Whilst the developed technique did not succeed to improve the results of the online random undersampler, both techniques were shown to significantly improve the degree of discrimination of dropped calls compared to the original data. Two online learning classification techniques, dynamic logistic regression and dynamic trees, were explored in this thesis. The former were shown to have considerable problems with temporally sparse covariates. To remedy this problem, this work proposed a modification to the forgetting framework originally developed by McCormick et al. (2012). The modification was shown to both allow for the inclusion of sparse covariates, as well as to improve the overall classification capability. Having carefully evaluated the parameters for both of the models, the best dynamic logistic regression model were shown to achieve excellent results, with an AU C of 99.96% and a G − mean of 99.7%, whilst the best dynamic trees model achieved an AU C of 93.4% and G − mean of 86.4%. That is, the dynamic logistic regression was shown to achieve a considerably stronger performance compared to the dynamic trees. To evaluate the choice of online learning, a comparison was also made to the static logistic regression, which was found to achieve an AUC of between 80% and 92%, depending on the amount of data fed to the training set; providing strong support for the online learning approach. In addition to showing that the online learning approach was able to predict mobile phone call data with great precision, this thesis also shows that the selected online learning techniques are able to provide useful insights regarding temporally important variables for discriminating dropped calls from normal calls. A comparison to static models was also considered in terms of variable importance insights. It was found that considering the dataset as “a whole” rather than sequentially, led to misleading effect interpretations, due to the temporal nature of the system: further supporting the online learning approach. Whilst showing a lot of potential, the proposed approach needs to undergo further refinement and testing, to ensure stability and confirm its practical use. There are 65 Chapter 6 Conclusions several dimensions by which this work could be extended. A natural next step would be to consult domain experts and more carefully select which variables that ought to be monitored. Another possibility is to evaluate whether it is possible to extend the framework to address the task of early classification of dropped calls, as in Zhou et al. (2013). 66 A. Figures 0.6 0.4 0.0 0.2 varprop 0.8 1.0 A.1. Results: Online classification 0 2000 4000 6000 8000 10000 12000 14000 Time 0.020 0.010 0.000 Variable Importance 0.030 Figure A.1.: Proportion on the N particles that includes the covariate “Cell_id220524” 0 2000 4000 6000 8000 10000 12000 14000 Time Figure A.2.: Reduction in entropy for covariate “Cell_id220524” from Dynamic Trees 67 Chapter A Figures A.2. Results: Online drop analysis 2 −2 0 Log odds 4 6 A.2.1. Single dynamic logistic regression vs. Univariate DMA 0 2000 4000 6000 8000 10000 12000 14000 Time from the single dynamic logistic regression: 2 −2 0 Log odds 4 6 Figure A.3.: Log-odds “Cell_id220524” 0 2000 4000 6000 8000 10000 12000 14000 Time Figure A.4.: Log-odds from the “univariate scanner”: “Cell_id220524” 68 A.2 Results: Online drop analysis A.2.2. Significant covariates in interesting period 2 0 −4 −2 Log odds 4 6 A.2.2.1. Log odds from the Dynamic Logistic Regression 0 2000 4000 6000 8000 10000 12000 14000 Time 1 −2 −1 0 Log odds 2 3 Figure A.5.: Recursively estimated coefficient for “X14” from the dynamic logistic regression 0 2000 4000 6000 8000 10000 12000 14000 Time Figure A.6.: Recursively estimated coefficient for “X15” from the dynamic logistic regression 69 Figures 0 −6 −4 −2 Log odds 2 4 6 Chapter A 0 2000 4000 6000 8000 10000 12000 14000 Time 1 −2 −1 0 Log odds 2 3 Figure A.7.: Recursively estimated coefficient for “X16” from the dynamic logistic regression 0 2000 4000 6000 8000 10000 12000 14000 Time 0 −2 Log odds 2 4 Figure A.8.: Recursively estimated coefficient for “X21” from the dynamic logistic regression 0 2000 4000 6000 8000 10000 12000 14000 Time Figure A.9.: Recursively estimated coefficient for “X22” from the dynamic logistic regression 70 1 0 −3 −2 −1 Log odds 2 3 4 A.2 Results: Online drop analysis 0 2000 4000 6000 8000 10000 12000 14000 Time 0 −2 Log odds 2 4 Figure A.10.: Recursively estimated coefficient for “X23” from the dynamic logistic regression 0 2000 4000 6000 8000 10000 12000 14000 Time “GCP 0 −10 −5 Log odds 5 10 Figure A.11.: Recursively estimated coefficient for 000011000000001100001000” from the dynamic logistic regression 0 2000 4000 6000 8000 10000 12000 14000 Time Figure A.12.: Recursively estimated coefficient for 000011000000011100011100” from the dynamic logistic regression “GCP 71 Figures 2 −2 0 Log odds 4 6 Chapter A 0 2000 4000 6000 8000 10000 12000 14000 Time 2 0 1 Log odds 3 4 5 Figure A.13.: Recursively estimated coefficient for “Cell_id220524” from the dynamic logistic regression 0 2000 4000 6000 8000 10000 12000 14000 Time 1 −2 −1 0 Log odds 2 3 Figure A.14.: Recursively estimated coefficient for “radiolinkfailurefdd” from the dynamic logistic regression 0 2000 4000 6000 8000 10000 12000 14000 Time Figure A.15.: Recursively estimated coefficient for “PS” from the dynamic logistic regression 72 A.2 Results: Online drop analysis 0.020 0.010 0.000 Variable Importance 0.030 A.2.2.2. Reduction in entropy from the Dynamic Trees 0 2000 4000 6000 8000 10000 12000 14000 Time Figure A.16.: Reduction in entropy for “radiolinkfailurefdd” from the dynamic trees model 73 Chapter A Figures A.2.3. Static vs. Dynamic Logistic Regression: covariate effects 0 −2 −1 Log odds 1 2 A.2.3.1. Dynamic Logistic Regression 0 2000 4000 6000 8000 10000 12000 14000 Time 0 −2 −1 Log odds 1 2 Figure A.17.: Recursively estimated coefficient for “Cell_id220517” from the dynamic logistic regression 0 2000 4000 6000 8000 10000 12000 14000 Time Figure A.18.: Recursively estimated coefficient for “Cell_id220518” from the dynamic logistic regression 74 1 −1 0 Log odds 2 A.2 Results: Online drop analysis 0 2000 4000 6000 8000 10000 12000 14000 Time 2 −2 0 Log odds 4 6 Figure A.19.: Recursively estimated coefficient for “Cell_id220519” from the dynamic logistic regression 0 2000 4000 6000 8000 10000 12000 14000 Time 1 0 −2 −1 Log odds 2 3 Figure A.20.: Recursively estimated coefficient for “Cell_id220521” from the dynamic logistic regression 0 2000 4000 6000 8000 10000 12000 14000 Time Figure A.21.: Recursively estimated coefficient for “Cell_id220523” from the dynamic logistic regression 75 B. Tables B.1. Results: Online classification B.1.1. Dynamic Trees Alpha Beta TPR TNR 0.99 2 0.820 0.855 0.90 2 0.814 0.852 0.80 2 0.810 0.850 0.70 2 0.815 0.842 Table B.1.: Evaluation of tree prior AUC G-mean 0.908 0.845 0.906 0.841 0.904 0.838 0.904 0.839 alpha for Dynamic Trees Beta Alpha TPR TNR 1.75 0.99 0.811 0.854 2.00 0.99 0.820 0.855 2.25 0.99 0.815 0.855 2.50 0.99 0.811 0.848 Table B.2.: Evaluation of tree prior AUC G-mean 0.908 0.840 0.908 0.845 0.905 0.843 0.901 0.836 beta for Dynamic Trees w user.self sys.self elapsed 50 342.300 0.150 342.420 100 388.100 0.300 388.330 250 520.190 0.190 520.370 500 768.580 0.110 768.660 1000 1057.950 0.170 1058.130 2000 1796.790 5.360 1802.110 4000 2489.360 27.440 2516.830 Table B.3.: Evaluation of the effect of active pool size (w) on computational time for Dynamic Trees 77 Chapter B Tables B.1.2. Dynamic Logistic Regression Covariate X10 X12 imsi_factor1 imsi_factor2 imsi_factor3 imsi_factor4 imsi_factor5 imsi_factor6 imsi_factor7 X_uraupdate.orig last_tGCP000011000000001100001000 last_tGCP000011100000000000000011 last_tGCP000000100000000000000011 last_tGCP000000100000000000000000 last_tGCP000011000000011100011100 last_tGCP000000000000001000011011 last_tGCP000000000000001000011000 last_tGCP000011000001011100011100 last_tGCP000011000000001100011100 cell_id220412 cell_id220511 cell_id220521 cell_id220524 cell_id220526 RSCP_avg.1 X_interrathandoverinfo.orig t_proc22 radiobearerrelease locationreport securitymodereject Table B.4.: Covariates for which the original forgetting framework failed to converge Coefficient Intercept X6 X10 X12 X13 78 Estimate Standard.Error z.value p.value -2.59 0.31 -8.40 < 2e-16 0.16 0.23 0.72 0.473546 0.47 0.35 1.33 0.183486 1.13 0.23 4.88 1.08e-06 -0.82 0.19 -4.37 1.23e-05 B.1 Results: Online classification X14 X15 X16 X19 X22 X23 imsi_factor1 imsi_factor2 imsi_factor3 imsi_factor4 imsi_factor5 imsi_factor6 imsi_factor7 imsi_factor8 imsi_factor9 imsi_factor10 imsi_factor11 imsi_factor12 imsi_factor13 imsi_factor14 imsi_factor15 X_cellupdate.orig X_uraupdate.orig activate activesetupdatecomplete X_physicalchannel.orig X_compressed.orig X_dlpower.orig GCP_000011000000000000000000 GCP_000011100000000000000000 GCP_000011000000001100001000 GCP_000011000000001000011000 GCP_000011100000000000000011 GCP_000000100000000000000011 GCP_000011000000011000011011 GCP_000011000000011000011100 GCP_000000100000000000000000 GCP_000011000000011100011100 GCP_000000000000001000011011 GCP_000000000000001000011000 GCP_000011000000011000011000 GCP_000011000001011100011100 GCP_000011000000001000011011 GCP_000011000000001100011100 -0.03 1.41 -0.96 -1.37 1.09 -0.51 -0.19 -0.74 -1.19 -0.73 -0.47 -0.76 -1.09 0.01 -0.06 -0.84 1.52 0.55 0.52 0.69 0.64 -0.69 -2.39 0.22 -0.37 0.85 -0.02 0.15 1.21 -1.02 1.57 0.52 -0.86 -2.19 0.88 -0.64 -1.88 2.37 0.73 -0.57 0.60 2.18 0.82 1.12 0.14 0.18 0.17 0.48 0.14 0.15 0.20 0.21 0.22 0.22 0.23 0.22 0.23 0.23 0.23 0.26 0.22 0.21 0.22 0.22 0.24 0.09 0.34 0.15 0.15 0.11 0.13 0.10 0.23 0.28 0.36 0.26 0.31 0.37 0.30 0.33 0.37 0.29 0.27 0.30 0.32 0.33 0.33 0.37 -0.22 7.95 -5.76 -2.86 7.55 -3.43 -0.95 -3.46 -5.36 -3.30 -2.01 -3.42 -4.81 0.05 -0.25 -3.26 6.80 2.60 2.32 3.11 2.67 -7.35 -6.95 1.43 -2.44 7.42 -0.19 1.39 5.24 -3.60 4.39 1.97 -2.80 -5.89 2.95 -1.94 -5.13 8.04 2.77 -1.90 1.84 6.58 2.46 3.00 0.822596 1.86e-15 8.42e-09 0.004237 4.39e-14 0.000606 0.343919 0.000545 8.20e-08 0.000969 0.044828 0.000618 1.54e-06 0.962124 0.801791 0.001127 1.04e-11 0.009346 0.020193 0.001844 0.007568 1.99e-13 3.61e-12 0.152776 0.014902 1.13e-13 0.852600 0.165634 1.60e-07 0.000314 1.11e-05 0.049023 0.005092 3.86e-09 0.003212 0.052570 2.92e-07 9.15e-16 0.005695 0.058078 0.065270 4.64e-11 0.013758 0.002713 79 Chapter B GCP_000000000000011000011011 GCP_000000000000001000000011 cell_id220412 cell_id220413 cell_id220414 cell_id220415 cell_id220416 cell_id220511 cell_id220512 cell_id220513 cell_id220514 cell_id220517 cell_id220518 cell_id220519 cell_id220521 cell_id220523 cell_id220524 cell_id220526 cell_id250511 RSCP_avg.1 RSCP_avg.2 X_interrathandoverinfo.orig X_tx.orig cpichAvg.1 cpichAvg.2 t_proc1 t_proc14 t_proc15 t_proc16 t_proc18 t_proc2 t_proc21 t_proc22 t_proc23 t_proc29 t_proc3 t_proc32 t_proc33 t_proc34 t_proc37 t_proc10 t_proc11 t_proc12 t_proc13 80 Tables 0.61 -0.08 0.09 0.21 1.06 1.85 0.49 -0.03 0.74 0.20 0.25 -0.53 -0.29 0.26 1.36 1.01 1.81 1.11 2.14 1.13 -0.76 -0.32 0.02 -0.11 0.03 -0.27 -0.22 -0.72 -0.17 -0.29 0.42 -0.46 0.42 -0.35 -0.08 0.13 0.05 0.20 -0.02 -0.24 0.02 -0.19 0.45 -0.03 0.60 0.52 0.22 0.18 0.17 0.18 0.32 0.21 0.16 0.24 0.32 0.23 0.21 0.24 0.16 0.18 0.16 0.19 0.37 0.58 0.16 0.25 0.10 0.13 0.15 0.18 0.13 0.33 0.24 0.28 0.36 0.25 0.52 0.32 0.22 0.16 0.19 0.23 0.47 0.21 0.09 0.09 0.14 0.09 1.02 -0.16 0.43 1.13 6.23 10.55 1.55 -0.12 4.52 0.85 0.80 -2.28 -1.37 1.07 8.55 5.66 11.03 5.79 5.78 1.95 -4.71 -1.24 0.21 -0.82 0.20 -1.52 -1.75 -2.21 -0.69 -1.04 1.16 -1.85 0.80 -1.09 -0.34 0.77 0.28 0.84 -0.04 -1.16 0.17 -2.08 3.16 -0.32 0.308164 0.874456 0.665636 0.257968 4.64e-10 < 2e-16 0.121757 0.901457 6.10e-06 0.396193 0.425370 0.022355 0.171236 0.282202 < 2e-16 1.52e-08 < 2e-16 6.89e-09 7.41e-09 0.051235 2.49e-06 0.214177 0.833289 0.414159 0.845180 0.128285 0.079899 0.026835 0.488020 0.298706 0.246736 0.064493 0.424858 0.277684 0.730663 0.442721 0.778440 0.398736 0.967561 0.246237 0.868475 0.037290 0.001571 0.752861 B.1 Results: Online classification t_proc25 0.34 0.10 3.23 0.001231 t_proc28 0.35 0.12 2.86 0.004256 t_proc31 -0.05 0.10 -0.45 0.653183 t_proc4 -0.29 0.14 -2.04 0.041486 t_proc6 0.08 0.18 0.45 0.653796 t_proc9 0.10 0.09 1.11 0.265443 sirAvg.1 -0.00 0.41 -0.01 0.994379 sirAvg.2 0.21 0.10 2.18 0.029493 radiolinksetupfailurefdd 6.92 0.44 15.55 < 2e-16 radiolinkadditionrequestfdd -0.57 0.10 -5.60 2.15e-08 radiolinkfailureindication 2.14 0.09 22.90 < 2e-16 radiobearerrelease -1.22 0.23 -5.34 9.52e-08 radiobearerreconfiguration -0.72 0.11 -6.46 1.05e-10 Interact 0.43 0.16 2.70 0.006918 SRB -0.42 0.19 -2.21 0.026755 Other 2.93 0.27 10.80 < 2e-16 e4a -0.24 0.24 -0.98 0.328102 evid_not_measured 1.12 0.20 5.73 9.83e-09 e1d 0.92 0.15 6.21 5.32e-10 e2d 0.00 0.12 0.00 0.998481 e2f -1.14 0.14 -7.91 2.56e-15 locationreport 0.29 0.19 1.56 0.118088 location -0.69 0.14 -5.05 4.41e-07 locationreportingcontrol 0.57 0.17 3.38 0.000720 X_rab.orig 1.59 0.22 7.35 2.05e-13 securitymodereject 5.74 0.48 11.94 < 2e-16 securitymodecommand -1.47 0.21 -7.05 1.76e-12 Table B.5.: Coefficient estimates for a Static Logistic Regression model trained on the full data set 81 Bibliography Agrawal, R., Imieliński, T., and Swami, A. (1993). Mining association rules between sets of items in large databases. In ACM SIGMOD Record, volume 22, pages 207– 216. ACM. Anagnostopoulos, C. and Gramacy, R. B. (2012). Dynamic trees for streaming and massive data contexts. arXiv preprint arXiv:1201.5568. Barbieri, M. M. and Berger, J. O. (2004). Optimal predictive model selection. Annals of Statistics, pages 870–897. Brauckhoff, D., Dimitropoulos, X., Wagner, A., and Salamatian, K. (2012). Anomaly extraction in backbone networks using association rules. IEEE/ACM Transactions on Networking (TON), 20(6):1788–1799. Breaugh, J. A. (2003). Effect size estimation: Factors to consider and mistakes to avoid. Journal of Management, 29(1):79–97. Cheung, B., Kumar, G., and Rao, S. A. (2005). Statistical algorithms in fault detection and prediction: Toward a healthier network. Bell Labs Technical Journal, 9(4):171–185. Chipman, H. A., George, E. I., and McCulloch, R. E. (1998). Bayesian cart model search. Journal of the American Statistical Association, 93(443):935–948. Cox, D. R. (1958). The regression analysis of binary sequences. Journal of the Royal Statistical Society. Series B (Methodological), pages 215–242. Ericsson (2014). Ericsson Mobility Report 2014 kernel description. http://www. ericsson.com/res/docs/2014/ericsson-mobility-report-june-2014.pdf. Accessed: 2015-05-27. Gramacy, R. B., Taddy, M., Wild, S. M., et al. (2013). Variable selection and sensitivity analysis using dynamic trees, with an application to computer code performance tuning. The Annals of Applied Statistics, 7(1):51–80. Haddock, C. K., Rindskopf, D., and Shadish, W. R. (1998). Using odds ratios as effect sizes for meta-analysis of dichotomous data: a primer on methods and issues. Psychological Methods, 3(3):339. Hanley, J. A. and McNeil, B. J. (1982). The meaning and use of the area under a receiver operating characteristic (roc) curve. Radiology, 143(1):29–36. He, H. and Garcia, E. A. (2009). Learning from imbalanced data. Knowledge and Data Engineering, IEEE Transactions on, 21(9):1263–1284. 83 Chapter B Bibliography Hoeting, J. A., Madigan, D., Raftery, A. E., and Volinsky, C. T. (1999). Bayesian model averaging: a tutorial. Statistical science, pages 382–401. Japkowicz, N. et al. (2000). Learning from imbalanced data sets: a comparison of various strategies. In AAAI workshop on learning from imbalanced data sets, volume 68, pages 10–15. Menlo Park, CA. Khanafer, R., Moltsen, L., Dubreil, H., Altman, Z., and Barco, R. (2006). A bayesian approach for automated troubleshooting for umts networks. In Personal, Indoor and Mobile Radio Communications, 2006 IEEE 17th International Symposium on, pages 1–5. IEEE. Koop, G. and Korobilis, D. (2012). Forecasting inflation using dynamic model averaging*. International Economic Review, 53(3):867–886. Kurgan, L. A. and Cios, K. J. (2004). Caim discretization algorithm. Knowledge and Data Engineering, IEEE Transactions on, 16(2):145–153. Lewis, S. M. and Raftery, A. E. (1997). Estimating bayes factors via posterior simulation with the laplace metropolis estimator. Journal of the American Statistical Association, 92(438):648–655. McCormick, T. H., Raftery, A. E., Madigan, D., and Burd, R. S. (2012). Dynamic logistic regression and dynamic model averaging for binary classification. Biometrics, 68(1):23–30. Nguyen, H. M., Cooper, E. W., and Kamei, K. (2011). Online learning from imbalanced data streams. In Soft Computing and Pattern Recognition (SoCPaR), 2011 International Conference of, pages 347–352. IEEE. Obuchowski, N. A. (2003). Receiver operating characteristic curves and their use in radiology 1. Radiology, 229(1):3–8. Onorante, L. and Raftery, A. E. (2014). Dynamic model averaging in large model spaces. Penny, W. D. and Roberts, S. J. (1999). Dynamic logistic regression. In Neural Networks, 1999. IJCNN’99. International Joint Conference on, volume 3, pages 1562–1567. IEEE. Powers, D. M. (2011). Evaluation: from precision, recall and f-measure to roc, informedness, markedness and correlation. Raftery, A. E., Kárnỳ, M., and Ettler, P. (2010). Online prediction under model uncertainty via dynamic model averaging: Application to a cold rolling mill. Technometrics, 52(1):52–66. Rao, S. (2006). Operational fault detection in cellular wireless base-stations. Network and Service Management, IEEE Transactions on, 3(2):1–11. Smith, J. (1992). A comparison of the characteristics of some bayesian forecasting models. International Statistical Review/Revue Internationale de Statistique, pages 75–87. 84 Bibliography Taddy, M. A., Gramacy, R. B., and Polson, N. G. (2011). Dynamic trees for learning and design. Journal of the American Statistical Association, 106(493). Theera-Ampornpunt, N., Bagchi, S., Joshi, K. R., and Panta, R. K. (2013). Using big data for more dependability: a cellular network tale. In Proceedings of the 9th Workshop on Hot Topics in Dependable Systems, page 2. ACM. Wang, S., Minku, L. L., and Yao, X. (2013). A learning framework for online class imbalance learning. In Computational Intelligence and Ensemble Learning (CIEL), 2013 IEEE Symposium on, pages 36–45. IEEE. Watanabe, Y., Matsunaga, Y., Kobayashi, K., Tonouchi, T., Igakura, T., Nakadai, S., and Kamachi, K. (2008). Utran o&m support system with statistical fault identification and customizable rule sets. In Network Operations and Management Symposium, 2008. NOMS 2008. IEEE, pages 560–573. IEEE. Zhou, S., Yang, J., Xu, D., Li, G., Jin, Y., Ge, Z., Kosseifi, M. B., Doverspike, R., Chen, Y., and Ying, L. (2013). Proactive call drop avoidance in umts networks. In INFOCOM, 2013 Proceedings IEEE, pages 425–429. IEEE. 85 LIU-IDA/STAT-A–15/007–SE