1162 IEEE TRANSACTIONS ON POWER DELIVERY, VOL. 25, NO. 2, APRIL 2010 Nontechnical Loss Detection for Metered Customers in Power Utility Using Support Vector Machines Jawad Nagi, Keem Siah Yap, Sieh Kiong Tiong, Member, IEEE, Syed Khaleel Ahmed, Member, IEEE, and Malik Mohamad Abstract—Electricity consumer dishonesty is a problem faced by all power utilities. Finding efficient measurements for detecting fraudulent electricity consumption has been an active research area in recent years. This paper presents a new approach towards nontechnical loss (NTL) detection in power utilities using an artificial intelligence based technique, support vector machine (SVM). The main motivation of this study is to assist Tenaga Nasional Berhad (TNB) Sdn. Bhd. in peninsular Malaysia to reduce its NTLs in the distribution sector due to abnormalities and fraud activities, i.e., electricity theft. The fraud detection model (FDM) developed in this research study preselects suspected customers to be inspected onsite fraud based on irregularities in consumption behavior. This approach provides a method of data mining, which involves feature extraction from historical customer consumption data. This SVM based approach uses customer load profile information and additional attributes to expose abnormal behavior that is known to be highly correlated with NTL activities. The result yields customer classes which are used to shortlist potential suspects for onsite inspection based on significant behavior that emerges due to fraud activities. Model testing is performed using historical kWh consumption data for three towns within peninsular Malaysia. Feedback from TNB Distribution (TNBD) Sdn. Bhd. for onsite inspection indicates that the proposed method is more effective compared to the current actions taken by them. With the implementation of this new fraud detection system TNBD’s detection hitrate will increase from 3% to 60%. Index Terms—Electricity theft, intelligent system, load profiling, nontechnical loss, pattern classification. I. INTRODUCTION OWER utilities lose large amounts of money each year due to fraud by electricity consumers. Electricity fraud can be defined as a dishonest or illegal use of electricity equipment P Manuscript received December 16, 2008; revised June 11, 2009. First published October 13, 2009; current version published March 24, 2010. This work was supported in part by Tenaga Nasional Berhad Distribution (TNBD) Sdn. Bhd. and in part by Tenaga Nasional Berhad Research (TNBR) Sdn. Bhd. under Grant RJO 10061948. Paper no. TPWRD-00920-2008. J. Nagi is with the Power Engineering Centre (PEC) of Universiti Tenaga Nasional, Kajang 43009, Selangor, Malaysia (e-mail: jawad@uniten.edu.my; awesomeawyeah@yahoo.com). K. S. Yap and S. K. Ahmed are with Department of Electronics and Communication Engineering, Universiti Tenaga Nasional, Kajang 43009, Selangor, Malaysia (e-mail: yapkeem@uniten.edu.my; keemsiahyap@yahoo.com; syedkhaleel@uniten.edu.my). S. K. Tiong is with the Power Engineering Centre (PEC), Universiti Tenaga Nasional, Kajang 43009, Selangor, Malaysia. He is also with the Department of Electronics and Communication Engineering, Universiti Tenaga Nasional, Kajang 43009, Selangor, Malaysia (e-mail: siehkiong@uniten.edu.my). M. Mohamad is with Tenaga Nasional Berhad Research (TNBR) Sdn. Bhd., Kajang 43000, Selangor, Malaysia (e-mail: m.malik@tnbr.com.my). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TPWRD.2009.2030890 or service with the intention to avoid billing charge. It is difficult to distinguish between honest and fraudulent customers. Realistically, electric utilities will never be able to eliminate fraud. It is possible, however, to take measures to detect, prevent and reduce fraud [1]. Investigations are undertaken by electric utility companies to assess the impact of technical losses in generation, transmission and distribution networks, and the overall performance of power networks [2]–[5]. Nontechnical losses (NTLs) comprise one of the most important concerns for electricity distribution utilities worldwide. In 2004, Tenaga Nasional Berhad (TNB) Sdn. Bhd., the sole electricity provider in peninsular Malaysia recorded revenue losses as high as U.S.$229 million a year as a result of electricity theft, faulty metering, and billing errors [6]. NTLs faced by electric utility companies in the United States were estimated between 0.5% and 3.5% of the gross annual revenue [7], which is relatively low when compared to losses faced by electric utilities in developing countries such as Bangladesh [8], India [9], and Pakistan [10]. Nevertheless, the loss is amounted between U.S.$1 billion and U.S.$10 billion given that utility companies in the U.S. had revenues around U.S.$280 billion in 1998 [7]. Due to the problem associated with NTLs in electric utilities [11] methods for efficient management of NTLs [12], protecting revenue in the distribution industry [13], [14] and detecting fraud electricity consumers [15] have been proposed. The most effective method to reduce NTLs and commercial losses up to date is by using intelligent and smart electronic meters that make fraudulent activities more difficult, and easy to detect [14]. In recent years, several data mining and research studies on fraud identification and prediction techniques have been carried out in the electricity distribution sector. These include statistical methods [16]–[18]; decision trees [19], [20]; artificial neural networks (ANNs) [21]; knowledge discovery in databases (KDDs) [22], [23]; and multiple classifiers using cross identification and voting schemes [1]. Among these methods, load profiling is one of the most widely used [24] approaches, which is defined as the pattern of electricity consumption of a customer or group of customers over a period of time [25]. NTLs appear to have never been adequately studied and to date there is no published evidence of research on NTLs in the Malaysian electricity supply industry. TNB Sdn. Bhd. is currently focusing on reducing its NTLs, which are estimated around 20% throughout peninsular Malaysia. At present, customer installation inspections by TNB Distribution (TNBD) Sdn. Bhd. are carried out without any specific focus due to unavailability of a system for shortlisting possible fraud suspects. TNBD’s current detection hitrate for manual onsite inspection is 3%. The approach proposed in this paper provides 0885-8977/$26.00 © 2010 IEEE Authorized licensed use limited to: UNIVERSITY TENAGA NASIONAL. Downloaded on May 31,2010 at 05:14:09 UTC from IEEE Xplore. Restrictions apply. NAGI et al.: NTL DETECTION FOR METERED CUSTOMERS IN POWER UTILITY USING SVMS an intelligent system for assisting TNBD inspection teams to increase effectiveness of their onsite operation for reducing NTLs (detecting fraud customers) based on load profiles retrieved from the customer database. The proposed system will increase TNBD’s detection hitrate from 3% to 60% and will complement their on-going practices. The proposed system will also reduce operational costs due to onsite inspection in monitoring NTL activities. This paper presents a framework to detect NTL activities in electric utilities, which is achieved by detecting customers with irregular consumption patterns. An automatic feature extraction method using load profiles with the combination of support vector machine (SVM) is used to identify customers with abnormalities and fraud activities. This study uses historical customer consumption data collected from different towns within peninsular Malaysia. Customer consumption patterns are extracted using data mining and statistical techniques, which represent load profiles. Based on the assumption that load profiles contain irregularities when a fraud event occurs, SVM classifies load profiles of customers into two categories: normal and fraud. There are several different types of fraud that can occur, but our research only concentrates on scenarios where abrupt changes appear in customer load profiles, which indicate possible fraud events. The rest of this paper is organized as follows. Section II presents a brief review of NTLs. Section III provides an overview of the theoretical concept of SVM. Section IV presents the framework used for development of the fraud detection model (FDM), which includes: data preprocessing, feature extraction, SVM training, parameter optimization, SVM classification and data postprocessing. In Section V, the pilot testing results obtained are used to fine tune the FDM developed. Finally, conclusions are presented in Section VI. 1163 As some power loss is inevitable, steps can be taken to ensure that it is minimized. Several measures have been applied to this end, including those based on technology and those that rely on human effort and ingenuity. Reduction of NTLs is crucial for distribution companies. As these losses are concentrated in the low-voltage network, their origins are spread along the whole system and are most critical at lower levels in residential and small commercial sectors [11]. As the current method of dealing with NTLs imposes high operational costs due to onsite inspection and requires extensive use of human resources [23]; therefore, this study aims to reduce operational costs in monitoring NTL activities. III. SUPPORT VECTOR MACHINE SVMs were introduced by Vapnik in the late 1960s. The SVM, based on the foundation of statistical learning theory, is a general classification method and its theoretical foundation is described in [27] and [28]. SVMs have recently been applied to several applications ranging from face identification [29], text categorization [30] to bioinformatics, and database mining [31]. The main purpose of the (binary) SVM algorithm used for classification is to construct an optimal decision function, that accurately predicts unseen data into two classes and minimizes the classification error using (1) where is the decision boundary between the two classes. This is achieved by following the method of structural risk minimization (SRM) principle, given by [27]: II. NONTECHNICAL LOSSES NTLs are mainly related to electricity theft and customer management processes in which there exist a number of means of consciously defrauding the utility concerned [11]. In most developing countries, transmission and distribution losses account for a large portion of NTLs, which implies that electric utilities have to concentrate on reducing NTLs prior to reducing technical losses [26]. NTLs include the following activities [23]: 1) tampering with meters so that meters record lower rates of consumption; 2) stealing by bypassing the meter or otherwise making illegal connections; 3) arranging false readings by bribing meter readers; 4) arranging billing irregularities with the help of internal employees by means of such subterfuges as making out lower bills, adjusting the decimal point position on bills, or just ignoring unpaid bills. By default, the amount of electrical energy generated should equal the amount of energy registered as consumed. However, in reality, this situation is different because losses are an integral result of energy transmission and distribution [20]. In [12], Davidson presented a method for NTL estimation, stating that the total system loss in a power system is given by the difference between the energy generated or delivered and the energy sold. (2) where is the classification error expectation, is the number of training errors, is the number of training samples and is a confidence measure. In the case of separable data, the first term in (2) is zero and the second term is minimized resulting in a good generalization in (1) is the deciperformance of the SVM. The function sion boundary, which is derived from a set of training samples (3) where each training sample has features describing a particular signature and belongs to one of two classes (4) The decision boundary between the two classes is a hyperplane described by the equation (5) where and are derived in such a way that the unseen data is classified correctly. This is achieved by maximizing the margin of separation between the two classes. According to [28], this Authorized licensed use limited to: UNIVERSITY TENAGA NASIONAL. Downloaded on May 31,2010 at 05:14:09 UTC from IEEE Xplore. Restrictions apply. 1164 IEEE TRANSACTIONS ON POWER DELIVERY, VOL. 25, NO. 2, APRIL 2010 can be formulated as a quadratic programming (QP) optimization problem (6) subjected to the constraint that all training samples are correctly classified (i.e., all training samples are placed on the margin or outside the margin), that is (7) are nonnegative slack variables. where for By minimizing the first term of (6) the complexity of the SVM is reduced, and by minimizing the second term the number of training errors are reduced. The parameter in (6) is a regularization parameter and is preselected to be the tradeoff between the two terms in (6). The constrained QP problem defined in (6) and (7) is solved by introducing Lagrange multipliers and (introduced to enforce the positivity of [32]–[34]) and the Lagrange functional (SVs), are needed to describe the hyperplane. In the case of linearly separable data, all SVs lie on the margin and hence the number of SVs are less. Consequently, the decision boundary, is determined by only using a subset of the training samples (13) is the inner product, where is the input test vector, is the number of SVs, and is the bias term. In cases where a linear decision boundary is inappropriate, the SVM maps the input vector, , to a higher dimensional feature space [27], [28]. This is achieved by introducing a kernel to obtain the following substitution in (10): function (14) This yields (15) and (15) is maximized under the constraints in (7), where the solution is provided by using a software package for solving in (13) is optimization problems. The decision boundary with then modified by substituting (8) According to the theory of QP optimization, it is better to solve (8) by introducing the dual formulation of the problem (9) and are the Lagrange multipliers. Therefore, the where optimal solution is given by firstly minimizing, , , and and and . By thereafter maximizing with respect to substituting (8) into (9), the problem is transformed into its dual formulation, given by (10) and is maximized under the constraints (11) Furthermore, the vector w has an expansion in terms of a subset of the training samples, where the Lagrange multipliers are nonzero. These training samples meet the Karush-KuhnTucker (KKT) condition (12) Equation (12) states that only the training vectors corresponding to nonzero Lagrange multipliers, the support vectors (16) In general, it is difficult to determine the type of kernel functions to use for specific data patterns [35], [36]. However, any function that satisfies Mercer’s condition by Vapnik [37] can be used as a kernel function. The conventionally used kernel functions in SVMs fall into two categories: kernels based on Euclidean distance and kernels based on Euclidean inner products [38]. Kernels are selected based on the data structure and type of the boundaries between the classes. In this work, a kernel function based on Euclidean distance, the radial basis function (RBF) kernel is used [39] (17) where parameter controls the width of the RBF kernel function. The RBF kernel induces an infinite-dimensional kernel space in which all image vectors have the same norm [40]. Generally, the RBF kernel is suggested for use in unknown applications [39]. More details about SVMs are available in [41] and [42]. are The selection of two C-SVM model parameters important to the accuracy of classification. For example, if is too large (approximated to infinity), then the objective is to minimize the empirical risk, without the model flatness in the optimization formulation. The parameter controls the Gaussian function width, which reflects the distribution range of values of the training data. Therefore, all two parameters affect model construction in different ways. There are a lot of existing prac, such as: tical approaches for the selection of parameters Authorized licensed use limited to: UNIVERSITY TENAGA NASIONAL. Downloaded on May 31,2010 at 05:14:09 UTC from IEEE Xplore. Restrictions apply. NAGI et al.: NTL DETECTION FOR METERED CUSTOMERS IN POWER UTILITY USING SVMS 1165 Fig. 1. Proposed fraud detection framework for the detection of customers with abnormalities and fraud activities. user-defined based on prior knowledge and experience, asymptotical optimization [43], cross-validation (CV), and grid search [44]. Besides conducting classifications, SVMs also compute the probabilities for each class [45]. This supports the analytic conis an estimate cept of generalization and certainty. Given that for the probability of the output of a pairwise classifiers between , ), class and class (i.e., is the probability of the th class, the probability and that of a class can be derived via a QP problem [39] (18) This paper employs LIBSVM [46], a library for SVMs, as the core of a C-SVM classifier and conducts two-class (binary) classifications using the RBF kernel. The C-SVM model parameters are optimized using the grid search method proposed by Hsu et al. in [44]. IV. METHODOLOGY This section presents methods applied for data mining, model development and optimization. It comprises of ten subsections. The proposed fraud detection framework for detection of customers with abnormalities and fraud activities is shown in Fig. 1. The fraud detection system presented in this paper is developed as a GUI software using Microsoft Visual Basic 6.0. The LIBSVM v2.86 [46] software is used in this research study for SVM training, and classification. The computer used for training and testing is a Dell PowerEdge workstation with a 2.40-GHz quad-core processor. The time elapsed for obtaining the detection results from the testing data is approximately 1.8 s per customer and varies based on the configuration of the computer used. A. Data Acquisition Historical customer data from TNBD’s electronic-Customer Information Billing System (e-CIBS) was obtained for the Kuala Lumpur (KL) Barat station. The e-CIBS data consists of 265 870 customers for a period of 25 months, i.e., from July 2006 through July 2008. The information in the e-CIBS data useful for the development of the FDM included customer billing information along with the: monthly kWh consumption, meter reading type, meter reading date, Theft Of Electricity (TOE) information, Credit Worthiness Rating (CWR) information, High Risk Customer (HRC) information, and Irregularity Report (IR) information. Additionally, the High Risk data was provided by TNBD for the KL Barat station. This data was additionally requested to further improve the detection hitrate of the FDM. The High Risk data contains information of the fraud customers previously detected by TNBD, which lists the detection dates of all the fraud customers detected. Both the e-CIBS and High Risk data were obtained from TNBD in the Microsoft Office Access database format. The high-risk data contained 105 525 fraud cases detected by TNBD from onsite inspection in the KL Barat area, from December 2000 through July 2008. Inspection on the High Risk data revealed that there were cases where customers were detected more than one time for fraud. The maximum number of times a customer was detected for fraud was 35 times. This customer was operating a small industry with multiple meters. However, the majority of customers detected for fraud in the High Risk data remained one to two times. B. Customer Filtering and Selection The e-CIBS data obtained from TNBD was firstly filtered for selecting customers with only complete and useful data. Hence, data mining techniques using structured query language (SQL) were applied to: 1) remove repeating customers in the monthly data; 2) remove customers having no consumption (0 kWh) throughout the entire 25 month period; 3) remove customers not present within the entire 25 month period (missing data); 4) remove new customers registered after the first month in the data i.e., customers registered after July 2006. After customer filtering and selection, only 186 968 customer records remained from the initial 265 870 customer population. Even though a large number of customers were removed after applying the four filtering conditions, the amount of customers (samples) remaining were more than sufficient for SVM training and testing (validation). The proposed framework for processing e-CIBS data for C-SVM training and validation is shown in Fig. 2. C. Data Preprocessing Real world datasets tend to be noisy and inconsistent. Therefore to overcome these problems, data mining techniques using statistical methods were applied on the e-CIBS data. Customer Authorized licensed use limited to: UNIVERSITY TENAGA NASIONAL. Downloaded on May 31,2010 at 05:14:09 UTC from IEEE Xplore. Restrictions apply. 1166 IEEE TRANSACTIONS ON POWER DELIVERY, VOL. 25, NO. 2, APRIL 2010 Fig. 2. Proposed framework for processing e-CIBS data for C-SVM training and validation. data with “estimated” monthly kWh consumption (cases where meter readers are unable to record meter readings due to customers not being present at their premises) were transformed into “normal” consumption values to smoothen out inconsistencies. The transformation was accomplished by statistically averaging the previous normal consumption of a customer with the duration in months of the consecutive following estimated consumption. By applying this technique, estimated consumption values were discarded and replaced with their respective normal consumption values. As the e-CIBS data alone is insufficient to build the fraud detection model (FDM), therefore, the High Risk data was preprocessed for extraction of useful information. SQL was applied to the High Risk data to group all individual fraud cases of customers as single records, which resulted in 32,972 fraud customers from the 105,525 fraud cases detected. Thus, on an average basis each customer was detected 3.2 times for fraud, i.e., all customers commit fraud at least 3 times, which indicates a high rate of repetitive fraud. To utilize this beneficial information, data mining techniques using SQL were applied to preprocess the High Risk data for transformation into Table I. The new data attributes obtained in Table I, detection count and last detection date provide useful information for the development of the FDM. with respect meter reading date between the following and current month. Using (19) 24 features (i.e., 24 daily average kilowatt-hour (kWh) consumption values were calculated for each customer). It is known that meter readings for each customer are recorded on different dates of the month and are not always the same for all customers, i.e. meters are not exactly read every 30/31 days and there are longer or shorter durations in the number of days. As meter reading dates effect the monthly kWh consumption recorded for each customer, thus, the 24 daily average kWh consumption values computed using (19) reveal an accurate consumption history of the customers. The 24 daily average kWh consumption values computed for each customer correspond to customer load profiles. For a selected group of customers, each customer load profile is characterized by a vector , where corresponds to 24 time domain intervals based on the daily average kWh consumption values. Therefore, the whole set of customer load profiles is represented by . The credit worthiness rating (CWR) was the other feature selected for the C-SVM classifier. Based on the data analysis of fraud customers previously detected by TNBD, it was observed that CWR contributed significantly towards customers committing fraud activities. CWR data is automatically generated from TNBD’s billing system and is targeted to identify customers intentionally avoiding paying bills and delaying payments. In the e-CIBS data, CWR is based on six integers ranging from 0 to 5, where 0 represents the minimum CWR and 5 represents the maximum CWR. Since CWR changes based on the monthly payment status of customers, therefore, averaged CWR for each customer over a period of 25 months was computed and used as the additional feature. Therefore, 25 features were selected to build the C-SVM classifier (i.e., 24 daily average kWh consumption features and 1 CWR feature). D. Feature Selection and Extraction E. Data Normalization Features were selected from the preprocessed e-CIBS data in order to build the C-SVM classifier. From the 25 month kWh consumption data, daily average kWh consumption values, corresponding to features were computed for each customer. These features were calculated using the following expression: The feature data need to be represented in a normalized scale for SVM training and validation. Therefore, all 25 modeling features were normalized by using (19) where represents the current consumption or CWR of the cusand represent the minimum and tomer and maximum consumption in the load profile of the customer or the minimum and maximum CWR throughout all customers. TABLE I CUSTOMER INFORMATION PREPROCESSED IN HIGH RISK DATA where represents the monthly kWh consumption of the folrepresents the difference of days lowing month and (20) Authorized licensed use limited to: UNIVERSITY TENAGA NASIONAL. Downloaded on May 31,2010 at 05:14:09 UTC from IEEE Xplore. Restrictions apply. NAGI et al.: NTL DETECTION FOR METERED CUSTOMERS IN POWER UTILITY USING SVMS 1167 Fig. 3. Normalized load profiles of two typical fraud customers over a period of two years. Fig. 4. Normalized load profiles of two normal customers over a period of two years. F. Feature Adjustment The LIBSVM software [46] requires the C-SVM training and testing (validation) data to be in a standard format with all feature values having labels. Feature labels are used by LIBSVM in order to identify respective feature values during C-SVM training and testing. Thus, all normalized features in the classifier were labeled, where labels were represented by integers. Normalized feature values with respective labels were represented as a LIBSVM feature file [46], denoted by the matrix in the form .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . (21) represents the feature label, reprewhere indicates the last sents the normalized feature value, feature, and indicates the number of customers. G. Load Profile Inspection This study constructs a two-class (binary) C-SVM classifier in order to categorize two different types of customer load profiles. Firstly, manual inspection was performed on all TOE cases listed in the KL Barat e-CIBS data to identify load profiles in which abrupt changes appear (indicating irregularities in consumption characteristics). From all the TOE cases inspected, only 53 cases (samples) were identified with the presence of abnormalities and fraud activities. These 53 samples were selected and labeled as fraud suspects (Class 1). Fig. 3 indicates the load profiles of two typical fraud customers from the 53 fraud cases identified. Secondly, inspection was performed on a set of 500 load profiles with no TOE cases. From the load profiles inspected, 330 load profiles in which no abrupt changes or fraud activities appear were selected and labeled as normal suspects (Class 2). Load profiles of two normal customers are indicated in Fig. 4. Thus in total, 383 customer samples from both classes were used to build the C-SVM classifier. H. SVM Classifier and Optimization As the ratio between the two classes is unbalanced (Class 1 having 53 samples and Class 2 having 330 samples), therefore, the C-SVM classifier was weightaged in order to balance the sample ratio. Weights were adjusted by calculating the sample ratio for each class. This was achieved by dividing the total number of classifier samples with the individual class samples. In addition, class weights were multiplied by a factor of 100 to achieve satisfactory weight ratios for C-SVM training. The optimum classification accuracy of the C-SVM classifier was estimated by optimizing the RBF kernel parameter, and the error penalty parameter, . For this study, the grid search method proposed by Hsu et al. in [44] was used. The Authorized licensed use limited to: UNIVERSITY TENAGA NASIONAL. Downloaded on May 31,2010 at 05:14:09 UTC from IEEE Xplore. Restrictions apply. 1168 IEEE TRANSACTIONS ON POWER DELIVERY, VOL. 25, NO. 2, APRIL 2010 TABLE II TEST BED USED FOR THE FRAUD DETECTION MODEL estimate the probabilities of the classified customers. The probability estimates (decision values) of the classified data provide additional information for the selection of suspected customers from the classification results. The total number of SVs in the C-SVM classifier after training was 160, with Class 1 and Class 2 having 42 and 118 SVs respectively. The maximum value of the optimal solution of the dual SVM problem in (10) was calculated by LIBSVM in the to be 133.5426. The parameter defined as decision function in (8) [44], was computed to be 0.6185 on the last (464th) training iteration. I. Pilot Testing and SVM Classification Fig. 5. Flowchart of the training engine proposed for C-SVM parameter optimization. training engine proposed for C-SVM parameter optimization is shown in Fig. 5. In the grid search method, exponentially were used to identify growing sequences of parameters C-SVM parameters obtaining the best cross-validation (CV) acand curacy. Sequences of parameters, were used for combinations respectively. For each pair of , validation performance was measured by training 67% of the classifier data and testing the remaining 33%. This procedure was repeated 100 times consecutively for tenfold CV, where every time training and testing data was selected in a random order. The reason for using tenfold CV was to ensure that the classifier does not overfit the training data. Experimentally, it was found that optimal paand obtained the highest 10-fold CV rameters training accuracy of 86.43%. The detection hitrate at this CV accuracy was theoretically calculated to be 77.41%. The accuracy of the C-SVM classifier is calculated using the following expression: Accuracy (22) where represents the number of samples correctly classified by the C-SVM and represents the total number of samples used for testing. The detection hitrate of the FDM is theoretically defined by the following expression: Hitrate (23) where represents the number of samples correctly classified as fraud cases by the C-SVM and labeled as fraud cases by represents the total number of samples classified TNBD and as fraud cases by the C-SVM. During classifier training, C-SVM pairwise probability information defined in (21) was calculated additionally in order to Pilot testing for the developed FDM was carried out using the e-CIBS and High Risk data for three towns in the state of Kelantan in Malaysia, which are listed in Table II. As seen from Table II, the percentage of fraud customers detected by TNBD in the past eight years is less than 1% of the total number of customers in each town. These towns have a high rate of fraud activities (indicated by TNBD informers) estimated around 35%; thus, pilot testing was conducted for these towns using the developed FDM. The FDM validation engine implemented for the detection of suspected customers is shown in Fig. 6. J. Data Postprocessing Data postprocessing involved integrating (correlating) the classification results with the e-CIBS and High Risk data as indicated in Fig. 6. The classification results include class labels and probability estimates of the tested customers, which are correlated with the customer data using SQL techniques. After integration of the classification results, a detection report is generated. This detection report shortlists suspected customers from the testing data based on the abnormalities and fraud activities detected by the FDM. V. EXPERIMENTAL RESULTS Feedback of the pilot testing results obtained from TNBD for onsite inspection of the three towns indicated that an average detection hitrate (percentage of customers detected with abnormalities and fraud activities from the shortlisted suspects) of 26% was achieved. The detection hitrate of 26% obtained was inclusive of 7% abnormalities and 19% fraud activities. Abnormalities include the following. 1) replaced meters; 2) abandoned houses; 3) change of tenants. 4) faulty meter wiring. In order to improve the detection hitrate of the FDM, a decision making system utilizing structured query language (SQL) was implemented in the data postprocessing stage of the FDM. Authorized licensed use limited to: UNIVERSITY TENAGA NASIONAL. Downloaded on May 31,2010 at 05:14:09 UTC from IEEE Xplore. Restrictions apply. NAGI et al.: NTL DETECTION FOR METERED CUSTOMERS IN POWER UTILITY USING SVMS 1169 archived data older than two years. This is the reason due to which a marginally low detection hitrate was obtained, as the customer data provided was not sufficient to backtrack most of the customer consumption history. VI. CONCLUSION Fig. 6. Flowchart of the FDM validation engine for the detection of suspected customers (customers with abnormalities and fraud activities). This paper presents a new approach towards NTL detection in power utilities using an artificial intelligence based technique. A range of NTL sources such as fraud activities (meter tampering, meter bypassing, etc.) and abnormalities have been considered. The present study applies a pattern classification technique in order to detect and identify load consumption patterns of fraud customers. The framework proposed for NTL detection facilitates SVM for classification using historical customer consumption data. Experimental results obtained indicate that the proposed FDM can be used for reliable detection of abnormalities and fraud activities within electricity supply utilities. The method of using SVM for detection of fraud customers has proven to be very promising. Firstly, SVM has nonlinear dividing hypersurfaces, which give it high discrimination. Secondly, SVM provides good generalization ability for unseen data classification. These properties enable the SVM to conduct complex classification problems with ease and good accuracy. The current actions implemented by TNBD for NTL detection achieve a detection hitrate of 3%. Our developed fraud detection system will guarantee TNBD a detection hitrate of 60%. This will benefit TNB not only in improving its handling of NTLs, but will complement their existing ongoing practices and it is envisaged that tremendous savings will result from the use of this system. VII. FUTURE WORK The decision making system is employed in the FDM to only select customers with high possibilities of fraud from the correlated data. The decision making system is based on parameter values from the: load profiles of customers, preprocessed e-CIBS and High Risk data, and C-SVM classification results (probability estimates). Parameter values of the decision making system were determined by inspecting load profiles of previously identified fraud customers using TNBD’s back billing data. This was achieved by determining the common characteristics differentiating the normal cases (see Fig. 4) from the cases with fraud activities (see Fig. 3). Performance of the FDM with the implementation of the decision making system improved the detection hitrate significantly. With the use of the decision making system, the detection hitrate on the pilot testing results improved from 26% to 64%. This significant increase of 38% in the detection hitrate resulted due to the inclusion of human knowledge and expertise in the FDM. Therefore, the desired detection hitrate of 60% was achieved. The only limitation of the developed fraud detection system is that customers committing fraud activities before the two year period will not be detected by the FDM, since the C-SVM is not trained for such instances. The two years of customer data provided by TNBD was due to problem associated with retrieving Our future work will implement fuzzy logic as the backbone for intelligent decision making in selecting suspicious customers with high possibilities of fraud. With the inclusion of a fuzzy inference system (FIS), SQL filtering will be replaced with human knowledge and intelligence. In addition, the genetic algorithm (GA) will be used as an optimization tool in order to determine the most suitable C-SVM parameters for the dual Lagrangian optimization problem. The GA will be combined with SVM in order to implement a hybrid SVM-GA FDM. These improvements will guarantee further improvement in the accuracy of the fraud detection system. ACKNOWLEDGMENT The authors would like to thank TNB Distribution (TNBD) Sdn. Bhd. for providing the customer data. REFERENCES [1] R. Jiang, H. Tagaris, A. Lachsz, and M. Jeffrey, “Wavelet based feature extraction and multiple classifiers for electricity fraud detection,” in Proc. IEEE/Power Eng. Soc. Transmission and Distribution Conf. Exhibit. Asia Pacific, Oct. 6–10, 2002, vol. 3, pp. 2251–2256. [2] C. R. Paul, “System loss in a metropolitan utility network,” Power Eng. J., vol. 1, no. 5, pp. 305–307, Sep. 1987. Authorized licensed use limited to: UNIVERSITY TENAGA NASIONAL. Downloaded on May 31,2010 at 05:14:09 UTC from IEEE Xplore. Restrictions apply. 1170 [3] N. Tobin and N. Sheil, “Managing to reduce power transmission system losses,” in Transmission Performance. Dublin, Ireland: Publ. Electricity Supply Board Int., 1987. [4] R. L. Sellick and C. T. Gaunt, “Load data preparation for losses estimation,” in Proc. 7th Southern African Universities Power Engineering Conf. , Stellenbosch, South Africa, 1998, vol. 7, pp. 117–120. [5] I. E. Davidson, A. Odubiyi, M. O. Kachienga, and B. Manhire, “Technical loss computation and economic dispatch model in T&D systems in a deregulated ESI,” Power Eng. J., vol. 16, no. 2, pp. 55–60, Apr. 2002. [6] Annual Report Tenaga Nasional Berhad 2004 TNB, 2004. [7] T. B. Smith, “Electricity theft: A comparative analysis,” Energy Policy, vol. 32, pp. 2067–2076, 2004. [8] M. S. Alam, E. Kabir, M. M. Rahman, and M. A. K. Chowdhury, “Power sector reform in Bangladesh: Electricity distribution system,” Energy, vol. 29, pp. 1773–1783, 2004. [9] A. Kumar and D. D. Saxena, “Decision priorities and scenarios for minimizing electrical power loss in an India power system network,” Elect. Power Compon. Syst., vol. 31, pp. 717–727, 2003. [10] M. A. Ram and M. Shrestha, “Environmental and utility planning implications of electricity loss reduction in a developing country: A comparative study of technical options,” Int. J. Energy Res., vol. 22, pp. 47–59, 1998. [11] A. H. Nizar, Z. Y. Dong, and Y. Wang, “Power utility nontechnical loss analysis with extreme learning machine model,” IEEE Trans. Power Syst., vol. 23, no. 3, pp. 946–955, Aug. 2008. [12] I. E. Davidson, “Evaluation and effective management of nontechnical losses in electrical power networks,” in Proc. 6th Africon Conf. Africa, Oct. 2–4, 2002, vol. 1, pp. 473–477. [13] R. Mano, R. Cespedes, and D. Maia, “Protecting revenue in the distribution industry: A new approach with the revenue assurance and audit process,” in Proc. IEEWPES Transmission & Distribution Conf. Expo.: Latin America, Nov. 2004, pp. 218–223. [14] M. V. K. Rao and S. H. Miller, “Revenue improvement from intelligent metering systems,” in Proc. 9th Int. Conf. Metering and Tariffs for Energy Supply, Birmingham, U.K., Aug. 1999, pp. 218–222. [15] A. J. Dick, “Theft of electricity—How UK electricity companies detect and deter,” in European Convention on Security and Detection, Brighton, U.K., May 16–18, 1995, pp. 90–95. [16] J. W. Fourie and J. E. Calmeyer, “A statistical method to minimize electrical energy losses in a local electricity distribution network,” in Proc. 7th IEEE AFRICON Conf. Africa: Technology Innovation, Gaborone, Botswana, Sep. 15–17, 2004, vol. 2, pp. 667–673. [17] J. E. Cabral, J. O. P. Pinto, E. M. Gontijo, and J. R. Filho, “Fraud detection in electrical energy consumers using rough sets,” in Proc. IEEE Int. Conf. Systems, Man and Cybernetics, Oct. 10–13, 2004, vol. 4, pp. 3625–3629. [18] J. Bilbao, E. Torres, P. Egufa, J. L. Berasategui, and J. R. Saenz, “Determination of energy losses,” in Proc. 16th Int. Conf. Exhibition on Electricity Distribution (CIRED) 2001, Amsterdam, The Netherlands, vol. 5, 4 pp. vol. 5-. [19] J. R. Filho, E. M. Gontijo, A. C. Delaiba, E. Mazina, J. E. Cabral, and J. O. P. Pinto, “Fraud identification in electricity company customers using decision trees,” in Proc. 2004 IEEE Int. Conf. Systems, Man and Cybernetics, Oct. 10–13, 2004, vol. 4, pp. 3730–3734. [20] A. H. Nizar, Z. Y. Dong, J. H. Zhao, and P. Zhang, “A data mining based NTL analysis method,” in Proc. IEEE Power Eng. Soc. General Meeting, Tampa, FL, Jun. 24–28, 2007, pp. 1–8. [21] J. R. Galvan, A. Elices, A. Munoz, T. Czernichow, and M. A. SanzBobi, “System for detection of abnormalities and fraud in customer consumption,” in Proc. 12th Conf. Electric Power Supply Industry, Pattaya, Thailand, Nov. 1998. [22] A. H. Nizar, Z. Y. Dong, and J. H. Zhao, “Load profiling and data mining techniques in electricity deregulated market,” presented at the IEEE Power Eng. Soc. General Meeting, Montreal, QC, Canada, Jun. 18–22, 2006, paper 06GM0828. [23] A. H. Nizar, Z. Y. Dong, M. Jalaluddin, and M. J. Raffles, “Load profiling method in detecting non-technical loss activities in a power utility,” in Proc. IEEE 1st Int. Power and Energy Conf., Putrajaya, Malaysia, Nov. 28–29, 2006, pp. 82–87. [24] D. Gerbec, S. Gasperic, I. Smon, and F. Gubina, “Allocation of the load profiles to consumers using probabilistic neural networks,” IEEE Trans. Power Syst., vol. 20, no. 2, pp. 548–555, May 2005. [25] S. V. Allera and A. G. Horsburgh, “Load profiling for the energy trading settlements in the UK electricity markets,” in Presentation Report From Distribu-TECH Europe DA/DSM Conf., London, U.K., Oct. 27–29, 1998. [26] Hydro and Thermal Power Business Team, Transmission and Distribution Loss Reduction Services #167 Samsung-dong, Kangnamgu, Seoul, 135-791, Korea. [27] V. N. Vapnik, Statistical Learning Theory. New York: Wiley, 1998. IEEE TRANSACTIONS ON POWER DELIVERY, VOL. 25, NO. 2, APRIL 2010 [28] N. Cristianini and J. Shawe-Taylor, An Introduction to Support Vector Machines. Cambridge, MA: Cambridge Univ. Press, 2000. [29] E. Osuna, “Applying SVMs to face detection,” IEEE Intell. Syst. Mag., Support Vector Machines, vol. 13, no. 4, pp. 23–26, Jul./Aug. 1998, M. A. Hearst, ed. [30] S. Dumais, “Using SVMs for text categorization,” IEEE Intell. Syst. Mag., Support Vector Machines, vol. 13, no. 4, pp. 21–23, Jul./Aug. 1998, M. A. Hearst, ed. [31] G. Dror, R. Sorek, and S. Shamir, “Accurate identification of alternatively spliced exons using support vector machine,” Bioinformatics, vol. 21, no. 7, pp. 897–901, Apr. 2005. [32] C. J. C. Burges, “A tutorial on support vector machines for pattern recognition,” Data Mining Knowl. Discovery, vol. 2, no. 2, pp. 121–167, 1998. [33] M. E. Mavroforakis and S. Theodoridis, “A geometric approach to Support Vector Machine (SVM) classification,” IEEE Trans. Neural Netw., vol. 17, no. 3, pp. 671–682, May 2006. [34] K. Veropoulos, “Machine learning approaches to medical decision making,” Ph.D. dissertation, Faculty Eng., Dept. Eng. Math., Univ. Bristol, Bristol, U.K., 2001. [35] K. Vojislav, Learning and Soft Computing—Support Vector Machines, Neural Networks and Fuzzy Logic Models. Cambridge, MA: MIT Press, 2001. [36] S. Amari and S. Wu, “Improving support vector machine classifiers by modifying kernel functions,” Neural Netw., vol. 12, pp. 783–789, 1999, vol. 6. [37] V. Vapnik, The Nature of Statistic Learning Theory. New York: Springer-Verlag, 1995. [38] B. Schölkopf and A. J. Smola, Learning With Kernels: Support Vector Machines, Regularization, Optimization and Beyond. Cambridge, MA: MIT Press, 2001. [39] M. A. Oskoei and H. Hu, “Support vector machine-based classification scheme for myoelectric control applied to upper limb,” IEEE Trans. Biomed. Eng., vol. 55, no. 8, pp. 1956–1965, Aug. 2008. [40] D. Wang, D. S. Yeung, and E. C. Tsang, “Weighted mahalanobis distance kernels for support vector machines,” IEEE Trans. Neural Netw., vol. 18, no. 5, pp. 1453–1462, Sep. 2007. [41] P.-H. Chen, C.-J. Lin, and B. Schölkopf, “A tutorial on nu-support vector machines,” Appl. Stochast. Models in Bus. Ind., vol. 21, pp. 111–136, 2005. [42] C. Cortes and V. Vapnik, “Support-vector networks,” Mach. Learn., vol. 20, pp. 273–297, Sep. 1995, vol. 3. [43] V. Cherkassky and Y. Ma, “Practical selection of SVM parameters and noise estimation for SVM regression,” Neural Netw., vol. 17, pp. 113–126, 2004, vol. 1. [44] C.-W. Hsu, C.-C. Chang, and C.-J. Lin, “A practical guide to support vector classification,” Dept. Comput. Sci., National Taiwan Univ., Taipei, Taiwan, 2003, Tech. Rep.. [45] T. F. Wu, C.-J. Lin, and R. C. Weng, “Probability estimates for multiclass classification by pairwise coupling,” J. Mach. Learn. Res., vol. 5, pp. 975–1005, 2004. [46] C.-C. Chang and C.-J. Lin, LIBSVM: A Library for Support Vector Machines. [Online]. Available: http://www.csie.ntu.edu.tw/~cjlin/libsvm Jawad Nagi was born in Karachi, Pakistan, on March 23, 1985. He received the Bachelor’s and Master’s degrees (Hons.) in electrical and electronics, and electrical engineering from Universiti Tenaga Nasional (UNITEN), Malaysia, in 2007 and 2009, respectively. Currently, he is a Project Engineer in the Power Engineering Centre (PEC) at UNITEN. His research interests include pattern recognition, image processing, load forecasting, fuzzy logic, neural networks, and support vector machines. Keem Siah Yap received the Bachelor’s and Master’s degrees from Universiti Teknologi Malaysia (UTM) in 1998 and 2000, respectively, and is currently pursuing the Ph.D. degree in electronic engineering at the Universiti Sains Malaysia (USM). He is a Senior Lecturer in the Department of Electronic and Communication Engineering, Universiti Tenaga Nasional, Selangor, Malaysia. His research interests include theory and application of artificial intelligence, digital image processing, computer vision systems, and robotics. Authorized licensed use limited to: UNIVERSITY TENAGA NASIONAL. Downloaded on May 31,2010 at 05:14:09 UTC from IEEE Xplore. Restrictions apply. NAGI et al.: NTL DETECTION FOR METERED CUSTOMERS IN POWER UTILITY USING SVMS Sieh Kiong Tiong (M’02) received the Bachelor’s and Master’s degrees in electrical, electronic, and system engineering from the Universiti Kebangsaan Malaysia (UKM) in 1997 and 2000, respectively, and the Ph.D. degree in mobile communication from the UKM in 2006. He is currently the Head of Electronic and Information Technology Unit of Power Engineering Centre (PEC) in the Universiti Tenaga Nasional (UNITEN), Selangor, Malaysia. He is also an Associate Professor in the Electronic and Communication Engineering Department of UNITEN. His research interests include digital electronics, microprocessor systems, artificial intelligence, and mobile cellular systems. 1171 Malik Mohamad received the Bachelor’s degree in electrical engineering (Hons.) from the University of Brighton, Brighton, U.K. He is a Senior Manager with TNB Research (TNBR) Sdn. Bhd. He has led numerous distribution projects in TNB, particularly in Information Systems and Metering. Currently he is a Project Director of seven research groups in TNBR Sdn. Bhd. dealing with clients in Sabah Electricity Sdn. Bhd. and TNB Transmission and Distribution division. He has more than 25 years of field and management experience in the electricity industry. Mr. Mohamad is a professional engineer registered with the Board of Engineers Malaysia (BEM), a corporate member of the Institute of Engineers Malaysia (IEM), and a certified competent engineer (33 kV) with Energy Commission Malaysia. Syed Khaleel Ahmed (M’98) received the Bachelor’s degree in electrical and electronics engineering from Anna University, Chennai, India, in 1988, and the Master’s degree in electrical and computer engineering from the University of Massachusetts, Amherst, in 1994. He is currently a Senior Lecturer in the Electronic and Communication Engineering Department, Universiti Tenaga Nasional, Selangor, Singapore. His research interests include robust control, fuzzy logic and control, neural networks, robotics, signal processing, and numerical analysis. Authorized licensed use limited to: UNIVERSITY TENAGA NASIONAL. Downloaded on May 31,2010 at 05:14:09 UTC from IEEE Xplore. Restrictions apply.