Nontechnical Loss Detection for Metered Customers in Power Utility

advertisement
1162
IEEE TRANSACTIONS ON POWER DELIVERY, VOL. 25, NO. 2, APRIL 2010
Nontechnical Loss Detection for Metered Customers
in Power Utility Using Support Vector Machines
Jawad Nagi, Keem Siah Yap, Sieh Kiong Tiong, Member, IEEE, Syed Khaleel Ahmed, Member, IEEE, and
Malik Mohamad
Abstract—Electricity consumer dishonesty is a problem faced
by all power utilities. Finding efficient measurements for detecting
fraudulent electricity consumption has been an active research
area in recent years. This paper presents a new approach towards
nontechnical loss (NTL) detection in power utilities using an artificial intelligence based technique, support vector machine (SVM).
The main motivation of this study is to assist Tenaga Nasional
Berhad (TNB) Sdn. Bhd. in peninsular Malaysia to reduce its
NTLs in the distribution sector due to abnormalities and fraud
activities, i.e., electricity theft. The fraud detection model (FDM)
developed in this research study preselects suspected customers to
be inspected onsite fraud based on irregularities in consumption
behavior. This approach provides a method of data mining, which
involves feature extraction from historical customer consumption
data. This SVM based approach uses customer load profile information and additional attributes to expose abnormal behavior
that is known to be highly correlated with NTL activities. The
result yields customer classes which are used to shortlist potential
suspects for onsite inspection based on significant behavior that
emerges due to fraud activities. Model testing is performed using
historical kWh consumption data for three towns within peninsular Malaysia. Feedback from TNB Distribution (TNBD) Sdn.
Bhd. for onsite inspection indicates that the proposed method is
more effective compared to the current actions taken by them.
With the implementation of this new fraud detection system
TNBD’s detection hitrate will increase from 3% to 60%.
Index Terms—Electricity theft, intelligent system, load profiling,
nontechnical loss, pattern classification.
I. INTRODUCTION
OWER utilities lose large amounts of money each year
due to fraud by electricity consumers. Electricity fraud can
be defined as a dishonest or illegal use of electricity equipment
P
Manuscript received December 16, 2008; revised June 11, 2009. First published October 13, 2009; current version published March 24, 2010. This work
was supported in part by Tenaga Nasional Berhad Distribution (TNBD) Sdn.
Bhd. and in part by Tenaga Nasional Berhad Research (TNBR) Sdn. Bhd. under
Grant RJO 10061948. Paper no. TPWRD-00920-2008.
J. Nagi is with the Power Engineering Centre (PEC) of Universiti Tenaga
Nasional, Kajang 43009, Selangor, Malaysia (e-mail: jawad@uniten.edu.my;
awesomeawyeah@yahoo.com).
K. S. Yap and S. K. Ahmed are with Department of Electronics and Communication Engineering, Universiti Tenaga Nasional, Kajang 43009, Selangor,
Malaysia (e-mail: yapkeem@uniten.edu.my; keemsiahyap@yahoo.com;
syedkhaleel@uniten.edu.my).
S. K. Tiong is with the Power Engineering Centre (PEC), Universiti Tenaga
Nasional, Kajang 43009, Selangor, Malaysia. He is also with the Department
of Electronics and Communication Engineering, Universiti Tenaga Nasional,
Kajang 43009, Selangor, Malaysia (e-mail: siehkiong@uniten.edu.my).
M. Mohamad is with Tenaga Nasional Berhad Research (TNBR) Sdn. Bhd.,
Kajang 43000, Selangor, Malaysia (e-mail: m.malik@tnbr.com.my).
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TPWRD.2009.2030890
or service with the intention to avoid billing charge. It is difficult
to distinguish between honest and fraudulent customers. Realistically, electric utilities will never be able to eliminate fraud.
It is possible, however, to take measures to detect, prevent and
reduce fraud [1].
Investigations are undertaken by electric utility companies to
assess the impact of technical losses in generation, transmission
and distribution networks, and the overall performance of power
networks [2]–[5]. Nontechnical losses (NTLs) comprise one of
the most important concerns for electricity distribution utilities
worldwide. In 2004, Tenaga Nasional Berhad (TNB) Sdn. Bhd.,
the sole electricity provider in peninsular Malaysia recorded
revenue losses as high as U.S.$229 million a year as a result
of electricity theft, faulty metering, and billing errors [6]. NTLs
faced by electric utility companies in the United States were estimated between 0.5% and 3.5% of the gross annual revenue
[7], which is relatively low when compared to losses faced by
electric utilities in developing countries such as Bangladesh [8],
India [9], and Pakistan [10]. Nevertheless, the loss is amounted
between U.S.$1 billion and U.S.$10 billion given that utility
companies in the U.S. had revenues around U.S.$280 billion in
1998 [7].
Due to the problem associated with NTLs in electric utilities [11] methods for efficient management of NTLs [12], protecting revenue in the distribution industry [13], [14] and detecting fraud electricity consumers [15] have been proposed.
The most effective method to reduce NTLs and commercial
losses up to date is by using intelligent and smart electronic meters that make fraudulent activities more difficult, and easy to
detect [14]. In recent years, several data mining and research
studies on fraud identification and prediction techniques have
been carried out in the electricity distribution sector. These include statistical methods [16]–[18]; decision trees [19], [20]; artificial neural networks (ANNs) [21]; knowledge discovery in
databases (KDDs) [22], [23]; and multiple classifiers using cross
identification and voting schemes [1]. Among these methods,
load profiling is one of the most widely used [24] approaches,
which is defined as the pattern of electricity consumption of a
customer or group of customers over a period of time [25].
NTLs appear to have never been adequately studied and to
date there is no published evidence of research on NTLs in
the Malaysian electricity supply industry. TNB Sdn. Bhd. is
currently focusing on reducing its NTLs, which are estimated
around 20% throughout peninsular Malaysia. At present, customer installation inspections by TNB Distribution (TNBD)
Sdn. Bhd. are carried out without any specific focus due
to unavailability of a system for shortlisting possible fraud
suspects. TNBD’s current detection hitrate for manual onsite
inspection is 3%. The approach proposed in this paper provides
0885-8977/$26.00 © 2010 IEEE
Authorized licensed use limited to: UNIVERSITY TENAGA NASIONAL. Downloaded on May 31,2010 at 05:14:09 UTC from IEEE Xplore. Restrictions apply.
NAGI et al.: NTL DETECTION FOR METERED CUSTOMERS IN POWER UTILITY USING SVMS
an intelligent system for assisting TNBD inspection teams to
increase effectiveness of their onsite operation for reducing
NTLs (detecting fraud customers) based on load profiles retrieved from the customer database. The proposed system will
increase TNBD’s detection hitrate from 3% to 60% and will
complement their on-going practices. The proposed system
will also reduce operational costs due to onsite inspection in
monitoring NTL activities.
This paper presents a framework to detect NTL activities in
electric utilities, which is achieved by detecting customers with
irregular consumption patterns. An automatic feature extraction method using load profiles with the combination of support
vector machine (SVM) is used to identify customers with abnormalities and fraud activities. This study uses historical customer
consumption data collected from different towns within peninsular Malaysia. Customer consumption patterns are extracted
using data mining and statistical techniques, which represent
load profiles. Based on the assumption that load profiles contain irregularities when a fraud event occurs, SVM classifies
load profiles of customers into two categories: normal and fraud.
There are several different types of fraud that can occur, but our
research only concentrates on scenarios where abrupt changes
appear in customer load profiles, which indicate possible fraud
events.
The rest of this paper is organized as follows. Section II
presents a brief review of NTLs. Section III provides an
overview of the theoretical concept of SVM. Section IV
presents the framework used for development of the fraud
detection model (FDM), which includes: data preprocessing,
feature extraction, SVM training, parameter optimization,
SVM classification and data postprocessing. In Section V, the
pilot testing results obtained are used to fine tune the FDM
developed. Finally, conclusions are presented in Section VI.
1163
As some power loss is inevitable, steps can be taken to ensure that it is minimized. Several measures have been applied
to this end, including those based on technology and those that
rely on human effort and ingenuity. Reduction of NTLs is crucial for distribution companies. As these losses are concentrated
in the low-voltage network, their origins are spread along the
whole system and are most critical at lower levels in residential
and small commercial sectors [11]. As the current method of
dealing with NTLs imposes high operational costs due to onsite
inspection and requires extensive use of human resources [23];
therefore, this study aims to reduce operational costs in monitoring NTL activities.
III. SUPPORT VECTOR MACHINE
SVMs were introduced by Vapnik in the late 1960s. The
SVM, based on the foundation of statistical learning theory, is
a general classification method and its theoretical foundation is
described in [27] and [28]. SVMs have recently been applied
to several applications ranging from face identification [29],
text categorization [30] to bioinformatics, and database mining
[31].
The main purpose of the (binary) SVM algorithm used for
classification is to construct an optimal decision function,
that accurately predicts unseen data into two classes and minimizes the classification error using
(1)
where
is the decision boundary between the two classes.
This is achieved by following the method of structural risk
minimization (SRM) principle, given by [27]:
II. NONTECHNICAL LOSSES
NTLs are mainly related to electricity theft and customer
management processes in which there exist a number of means
of consciously defrauding the utility concerned [11]. In most developing countries, transmission and distribution losses account
for a large portion of NTLs, which implies that electric utilities
have to concentrate on reducing NTLs prior to reducing technical losses [26]. NTLs include the following activities [23]:
1) tampering with meters so that meters record lower rates of
consumption;
2) stealing by bypassing the meter or otherwise making illegal
connections;
3) arranging false readings by bribing meter readers;
4) arranging billing irregularities with the help of internal employees by means of such subterfuges as making out lower
bills, adjusting the decimal point position on bills, or just
ignoring unpaid bills.
By default, the amount of electrical energy generated should
equal the amount of energy registered as consumed. However,
in reality, this situation is different because losses are an integral result of energy transmission and distribution [20]. In [12],
Davidson presented a method for NTL estimation, stating that
the total system loss in a power system is given by the difference
between the energy generated or delivered and the energy sold.
(2)
where is the classification error expectation, is the number
of training errors, is the number of training samples and is
a confidence measure.
In the case of separable data, the first term in (2) is zero and
the second term is minimized resulting in a good generalization
in (1) is the deciperformance of the SVM. The function
sion boundary, which is derived from a set of training samples
(3)
where each training sample has
features describing a particular signature and belongs to one of two classes
(4)
The decision boundary between the two classes is a hyperplane
described by the equation
(5)
where and are derived in such a way that the unseen data is
classified correctly. This is achieved by maximizing the margin
of separation between the two classes. According to [28], this
Authorized licensed use limited to: UNIVERSITY TENAGA NASIONAL. Downloaded on May 31,2010 at 05:14:09 UTC from IEEE Xplore. Restrictions apply.
1164
IEEE TRANSACTIONS ON POWER DELIVERY, VOL. 25, NO. 2, APRIL 2010
can be formulated as a quadratic programming (QP) optimization problem
(6)
subjected to the constraint that all training samples are correctly
classified (i.e., all training samples are placed on the margin or
outside the margin), that is
(7)
are nonnegative slack variables.
where for
By minimizing the first term of (6) the complexity of the SVM
is reduced, and by minimizing the second term the number of
training errors are reduced. The parameter in (6) is a regularization parameter and is preselected to be the tradeoff between
the two terms in (6). The constrained QP problem defined in (6)
and (7) is solved by introducing Lagrange multipliers
and
(introduced to enforce the positivity of [32]–[34])
and the Lagrange functional
(SVs), are needed to describe the hyperplane. In the case of
linearly separable data, all SVs lie on the margin and hence the
number of SVs are less. Consequently, the decision boundary,
is determined by only using a subset of the training
samples
(13)
is the inner product,
where is the input test vector,
is the number of SVs, and is the bias term.
In cases where a linear decision boundary is inappropriate,
the SVM maps the input vector, , to a higher dimensional feature space [27], [28]. This is achieved by introducing a kernel
to obtain the following substitution in (10):
function
(14)
This yields
(15)
and (15) is maximized under the constraints in (7), where the
solution is provided by using a software package for solving
in (13) is
optimization problems. The decision boundary
with
then modified by substituting
(8)
According to the theory of QP optimization, it is better to
solve (8) by introducing the dual formulation of the problem
(9)
and
are the Lagrange multipliers. Therefore, the
where
optimal solution is given by firstly minimizing, , , and and
and
. By
thereafter maximizing with respect to
substituting (8) into (9), the problem is transformed into its dual
formulation, given by
(10)
and is maximized under the constraints
(11)
Furthermore, the vector w has an expansion in terms of a
subset of the training samples, where the Lagrange multipliers
are nonzero. These training samples meet the Karush-KuhnTucker (KKT) condition
(12)
Equation (12) states that only the training vectors corresponding to nonzero Lagrange multipliers, the support vectors
(16)
In general, it is difficult to determine the type of kernel functions to use for specific data patterns [35], [36]. However, any
function that satisfies Mercer’s condition by Vapnik [37] can be
used as a kernel function. The conventionally used kernel functions in SVMs fall into two categories: kernels based on Euclidean distance and kernels based on Euclidean inner products
[38]. Kernels are selected based on the data structure and type
of the boundaries between the classes. In this work, a kernel
function based on Euclidean distance, the radial basis function
(RBF) kernel is used [39]
(17)
where parameter controls the width of the RBF kernel function. The RBF kernel induces an infinite-dimensional kernel
space in which all image vectors have the same norm [40]. Generally, the RBF kernel is suggested for use in unknown applications [39]. More details about SVMs are available in [41] and
[42].
are
The selection of two C-SVM model parameters
important to the accuracy of classification. For example, if is
too large (approximated to infinity), then the objective is to minimize the empirical risk, without the model flatness in the optimization formulation. The parameter controls the Gaussian
function width, which reflects the distribution range of values
of the training data. Therefore, all two parameters affect model
construction in different ways. There are a lot of existing prac, such as:
tical approaches for the selection of parameters
Authorized licensed use limited to: UNIVERSITY TENAGA NASIONAL. Downloaded on May 31,2010 at 05:14:09 UTC from IEEE Xplore. Restrictions apply.
NAGI et al.: NTL DETECTION FOR METERED CUSTOMERS IN POWER UTILITY USING SVMS
1165
Fig. 1. Proposed fraud detection framework for the detection of customers with abnormalities and fraud activities.
user-defined based on prior knowledge and experience, asymptotical optimization [43], cross-validation (CV), and grid search
[44].
Besides conducting classifications, SVMs also compute the
probabilities for each class [45]. This supports the analytic conis an estimate
cept of generalization and certainty. Given that
for the probability of the output of a pairwise classifiers between
,
),
class and class (i.e.,
is the probability of the th class, the probability
and that
of a class can be derived via a QP problem
[39]
(18)
This paper employs LIBSVM [46], a library for SVMs, as the
core of a C-SVM classifier and conducts two-class (binary) classifications using the RBF kernel. The C-SVM model parameters
are optimized using the grid search method proposed by
Hsu et al. in [44].
IV. METHODOLOGY
This section presents methods applied for data mining, model
development and optimization. It comprises of ten subsections.
The proposed fraud detection framework for detection of customers with abnormalities and fraud activities is shown in Fig. 1.
The fraud detection system presented in this paper is developed as a GUI software using Microsoft Visual Basic 6.0. The
LIBSVM v2.86 [46] software is used in this research study
for SVM training, and classification. The computer used for
training and testing is a Dell PowerEdge workstation with a
2.40-GHz quad-core processor. The time elapsed for obtaining
the detection results from the testing data is approximately 1.8 s
per customer and varies based on the configuration of the computer used.
A. Data Acquisition
Historical customer data from TNBD’s electronic-Customer
Information Billing System (e-CIBS) was obtained for the
Kuala Lumpur (KL) Barat station. The e-CIBS data consists
of 265 870 customers for a period of 25 months, i.e., from
July 2006 through July 2008. The information in the e-CIBS
data useful for the development of the FDM included customer
billing information along with the: monthly kWh consumption,
meter reading type, meter reading date, Theft Of Electricity
(TOE) information, Credit Worthiness Rating (CWR) information, High Risk Customer (HRC) information, and Irregularity
Report (IR) information.
Additionally, the High Risk data was provided by TNBD for
the KL Barat station. This data was additionally requested to further improve the detection hitrate of the FDM. The High Risk
data contains information of the fraud customers previously detected by TNBD, which lists the detection dates of all the fraud
customers detected. Both the e-CIBS and High Risk data were
obtained from TNBD in the Microsoft Office Access database
format.
The high-risk data contained 105 525 fraud cases detected by
TNBD from onsite inspection in the KL Barat area, from December 2000 through July 2008. Inspection on the High Risk
data revealed that there were cases where customers were detected more than one time for fraud. The maximum number
of times a customer was detected for fraud was 35 times. This
customer was operating a small industry with multiple meters.
However, the majority of customers detected for fraud in the
High Risk data remained one to two times.
B. Customer Filtering and Selection
The e-CIBS data obtained from TNBD was firstly filtered for
selecting customers with only complete and useful data. Hence,
data mining techniques using structured query language (SQL)
were applied to:
1) remove repeating customers in the monthly data;
2) remove customers having no consumption (0 kWh)
throughout the entire 25 month period;
3) remove customers not present within the entire 25 month
period (missing data);
4) remove new customers registered after the first month in
the data i.e., customers registered after July 2006.
After customer filtering and selection, only 186 968 customer
records remained from the initial 265 870 customer population.
Even though a large number of customers were removed after
applying the four filtering conditions, the amount of customers
(samples) remaining were more than sufficient for SVM training
and testing (validation). The proposed framework for processing
e-CIBS data for C-SVM training and validation is shown in
Fig. 2.
C. Data Preprocessing
Real world datasets tend to be noisy and inconsistent. Therefore to overcome these problems, data mining techniques using
statistical methods were applied on the e-CIBS data. Customer
Authorized licensed use limited to: UNIVERSITY TENAGA NASIONAL. Downloaded on May 31,2010 at 05:14:09 UTC from IEEE Xplore. Restrictions apply.
1166
IEEE TRANSACTIONS ON POWER DELIVERY, VOL. 25, NO. 2, APRIL 2010
Fig. 2. Proposed framework for processing e-CIBS data for C-SVM training and validation.
data with “estimated” monthly kWh consumption (cases where
meter readers are unable to record meter readings due to customers not being present at their premises) were transformed
into “normal” consumption values to smoothen out inconsistencies. The transformation was accomplished by statistically
averaging the previous normal consumption of a customer with
the duration in months of the consecutive following estimated
consumption. By applying this technique, estimated consumption values were discarded and replaced with their respective
normal consumption values.
As the e-CIBS data alone is insufficient to build the fraud detection model (FDM), therefore, the High Risk data was preprocessed for extraction of useful information. SQL was applied to
the High Risk data to group all individual fraud cases of customers as single records, which resulted in 32,972 fraud customers from the 105,525 fraud cases detected. Thus, on an average basis each customer was detected 3.2 times for fraud, i.e.,
all customers commit fraud at least 3 times, which indicates a
high rate of repetitive fraud. To utilize this beneficial information, data mining techniques using SQL were applied to preprocess the High Risk data for transformation into Table I. The new
data attributes obtained in Table I, detection count and last detection date provide useful information for the development of
the FDM.
with respect meter reading date between the following and current month.
Using (19) 24 features (i.e., 24 daily average kilowatt-hour
(kWh) consumption values were calculated for each customer).
It is known that meter readings for each customer are recorded
on different dates of the month and are not always the same for
all customers, i.e. meters are not exactly read every 30/31 days
and there are longer or shorter durations in the number of days.
As meter reading dates effect the monthly kWh consumption
recorded for each customer, thus, the 24 daily average kWh consumption values computed using (19) reveal an accurate consumption history of the customers.
The 24 daily average kWh consumption values computed for
each customer correspond to customer load profiles. For a selected group of customers, each customer load profile is characterized by a vector
, where
corresponds to 24 time domain intervals based on the
daily average kWh consumption values. Therefore, the whole
set of customer load profiles is represented by
.
The credit worthiness rating (CWR) was the other feature selected for the C-SVM classifier. Based on the data analysis of
fraud customers previously detected by TNBD, it was observed
that CWR contributed significantly towards customers committing fraud activities. CWR data is automatically generated from
TNBD’s billing system and is targeted to identify customers intentionally avoiding paying bills and delaying payments. In the
e-CIBS data, CWR is based on six integers ranging from 0 to
5, where 0 represents the minimum CWR and 5 represents the
maximum CWR. Since CWR changes based on the monthly
payment status of customers, therefore, averaged CWR for each
customer over a period of 25 months was computed and used
as the additional feature. Therefore, 25 features were selected
to build the C-SVM classifier (i.e., 24 daily average kWh consumption features and 1 CWR feature).
D. Feature Selection and Extraction
E. Data Normalization
Features were selected from the preprocessed e-CIBS data in
order to build the C-SVM classifier. From the 25 month kWh
consumption data, daily average kWh consumption values, corresponding to features were computed for each customer. These
features were calculated using the following expression:
The feature data need to be represented in a normalized scale
for SVM training and validation. Therefore, all 25 modeling
features were normalized by using
(19)
where represents the current consumption or CWR of the cusand
represent the minimum and
tomer and
maximum consumption in the load profile of the customer or
the minimum and maximum CWR throughout all customers.
TABLE I
CUSTOMER INFORMATION PREPROCESSED IN HIGH RISK DATA
where represents the monthly kWh consumption of the folrepresents the difference of days
lowing month and
(20)
Authorized licensed use limited to: UNIVERSITY TENAGA NASIONAL. Downloaded on May 31,2010 at 05:14:09 UTC from IEEE Xplore. Restrictions apply.
NAGI et al.: NTL DETECTION FOR METERED CUSTOMERS IN POWER UTILITY USING SVMS
1167
Fig. 3. Normalized load profiles of two typical fraud customers over a period of two years.
Fig. 4. Normalized load profiles of two normal customers over a period of two years.
F. Feature Adjustment
The LIBSVM software [46] requires the C-SVM training and
testing (validation) data to be in a standard format with all feature values having labels. Feature labels are used by LIBSVM
in order to identify respective feature values during C-SVM
training and testing. Thus, all normalized features in the classifier were labeled, where labels were represented by integers.
Normalized feature values with respective labels were represented as a LIBSVM feature file [46], denoted by the matrix
in the form
..
.
..
.
..
.
..
.
..
.
..
.
..
.
..
.
..
.
..
.
(21)
represents the feature label, reprewhere
indicates the last
sents the normalized feature value,
feature, and indicates the number of customers.
G. Load Profile Inspection
This study constructs a two-class (binary) C-SVM classifier
in order to categorize two different types of customer load profiles. Firstly, manual inspection was performed on all TOE cases
listed in the KL Barat e-CIBS data to identify load profiles in
which abrupt changes appear (indicating irregularities in consumption characteristics). From all the TOE cases inspected,
only 53 cases (samples) were identified with the presence of abnormalities and fraud activities. These 53 samples were selected
and labeled as fraud suspects (Class 1). Fig. 3 indicates the load
profiles of two typical fraud customers from the 53 fraud cases
identified.
Secondly, inspection was performed on a set of 500 load profiles with no TOE cases. From the load profiles inspected, 330
load profiles in which no abrupt changes or fraud activities appear were selected and labeled as normal suspects (Class 2).
Load profiles of two normal customers are indicated in Fig. 4.
Thus in total, 383 customer samples from both classes were used
to build the C-SVM classifier.
H. SVM Classifier and Optimization
As the ratio between the two classes is unbalanced (Class 1
having 53 samples and Class 2 having 330 samples), therefore,
the C-SVM classifier was weightaged in order to balance the
sample ratio. Weights were adjusted by calculating the sample
ratio for each class. This was achieved by dividing the total
number of classifier samples with the individual class samples.
In addition, class weights were multiplied by a factor of 100 to
achieve satisfactory weight ratios for C-SVM training.
The optimum classification accuracy of the C-SVM classifier was estimated by optimizing the RBF kernel parameter,
and the error penalty parameter, . For this study, the grid
search method proposed by Hsu et al. in [44] was used. The
Authorized licensed use limited to: UNIVERSITY TENAGA NASIONAL. Downloaded on May 31,2010 at 05:14:09 UTC from IEEE Xplore. Restrictions apply.
1168
IEEE TRANSACTIONS ON POWER DELIVERY, VOL. 25, NO. 2, APRIL 2010
TABLE II
TEST BED USED FOR THE FRAUD DETECTION MODEL
estimate the probabilities of the classified customers. The probability estimates (decision values) of the classified data provide
additional information for the selection of suspected customers
from the classification results.
The total number of SVs in the C-SVM classifier after
training was 160, with Class 1 and Class 2 having 42 and 118
SVs respectively. The maximum value of the optimal solution
of the dual SVM problem in (10) was calculated by LIBSVM
in the
to be 133.5426. The parameter defined as
decision function in (8) [44], was computed to be
0.6185
on the last (464th) training iteration.
I. Pilot Testing and SVM Classification
Fig. 5. Flowchart of the training engine proposed for C-SVM parameter
optimization.
training engine proposed for C-SVM parameter optimization
is shown in Fig. 5. In the grid search method, exponentially
were used to identify
growing sequences of parameters
C-SVM parameters obtaining the best cross-validation (CV) acand
curacy. Sequences of parameters,
were used for
combinations respectively. For each pair of
, validation
performance was measured by training 67% of the classifier data
and testing the remaining 33%. This procedure was repeated 100
times consecutively for tenfold CV, where every time training
and testing data was selected in a random order. The reason for
using tenfold CV was to ensure that the classifier does not overfit
the training data. Experimentally, it was found that optimal paand
obtained the highest 10-fold CV
rameters
training accuracy of 86.43%. The detection hitrate at this CV
accuracy was theoretically calculated to be 77.41%. The accuracy of the C-SVM classifier is calculated using the following
expression:
Accuracy
(22)
where
represents the number of samples correctly classified
by the C-SVM and
represents the total number of samples
used for testing. The detection hitrate of the FDM is theoretically defined by the following expression:
Hitrate
(23)
where
represents the number of samples correctly classified as fraud cases by the C-SVM and labeled as fraud cases by
represents the total number of samples classified
TNBD and
as fraud cases by the C-SVM.
During classifier training, C-SVM pairwise probability information defined in (21) was calculated additionally in order to
Pilot testing for the developed FDM was carried out using the
e-CIBS and High Risk data for three towns in the state of Kelantan in Malaysia, which are listed in Table II. As seen from
Table II, the percentage of fraud customers detected by TNBD
in the past eight years is less than 1% of the total number of customers in each town. These towns have a high rate of fraud activities (indicated by TNBD informers) estimated around 35%;
thus, pilot testing was conducted for these towns using the developed FDM. The FDM validation engine implemented for the
detection of suspected customers is shown in Fig. 6.
J. Data Postprocessing
Data postprocessing involved integrating (correlating) the
classification results with the e-CIBS and High Risk data as
indicated in Fig. 6. The classification results include class labels
and probability estimates of the tested customers, which are
correlated with the customer data using SQL techniques. After
integration of the classification results, a detection report is
generated. This detection report shortlists suspected customers
from the testing data based on the abnormalities and fraud
activities detected by the FDM.
V. EXPERIMENTAL RESULTS
Feedback of the pilot testing results obtained from TNBD for
onsite inspection of the three towns indicated that an average
detection hitrate (percentage of customers detected with abnormalities and fraud activities from the shortlisted suspects) of
26% was achieved. The detection hitrate of 26% obtained was
inclusive of 7% abnormalities and 19% fraud activities. Abnormalities include the following.
1) replaced meters;
2) abandoned houses;
3) change of tenants.
4) faulty meter wiring.
In order to improve the detection hitrate of the FDM, a decision making system utilizing structured query language (SQL)
was implemented in the data postprocessing stage of the FDM.
Authorized licensed use limited to: UNIVERSITY TENAGA NASIONAL. Downloaded on May 31,2010 at 05:14:09 UTC from IEEE Xplore. Restrictions apply.
NAGI et al.: NTL DETECTION FOR METERED CUSTOMERS IN POWER UTILITY USING SVMS
1169
archived data older than two years. This is the reason due to
which a marginally low detection hitrate was obtained, as the
customer data provided was not sufficient to backtrack most of
the customer consumption history.
VI. CONCLUSION
Fig. 6. Flowchart of the FDM validation engine for the detection of suspected
customers (customers with abnormalities and fraud activities).
This paper presents a new approach towards NTL detection in
power utilities using an artificial intelligence based technique. A
range of NTL sources such as fraud activities (meter tampering,
meter bypassing, etc.) and abnormalities have been considered.
The present study applies a pattern classification technique in
order to detect and identify load consumption patterns of fraud
customers. The framework proposed for NTL detection facilitates SVM for classification using historical customer consumption data.
Experimental results obtained indicate that the proposed
FDM can be used for reliable detection of abnormalities and
fraud activities within electricity supply utilities. The method
of using SVM for detection of fraud customers has proven
to be very promising. Firstly, SVM has nonlinear dividing
hypersurfaces, which give it high discrimination. Secondly,
SVM provides good generalization ability for unseen data
classification. These properties enable the SVM to conduct
complex classification problems with ease and good accuracy.
The current actions implemented by TNBD for NTL detection achieve a detection hitrate of 3%. Our developed fraud detection system will guarantee TNBD a detection hitrate of 60%.
This will benefit TNB not only in improving its handling of
NTLs, but will complement their existing ongoing practices and
it is envisaged that tremendous savings will result from the use
of this system.
VII. FUTURE WORK
The decision making system is employed in the FDM to only
select customers with high possibilities of fraud from the correlated data. The decision making system is based on parameter values from the: load profiles of customers, preprocessed
e-CIBS and High Risk data, and C-SVM classification results
(probability estimates). Parameter values of the decision making
system were determined by inspecting load profiles of previously identified fraud customers using TNBD’s back billing
data. This was achieved by determining the common characteristics differentiating the normal cases (see Fig. 4) from the cases
with fraud activities (see Fig. 3).
Performance of the FDM with the implementation of the decision making system improved the detection hitrate significantly.
With the use of the decision making system, the detection hitrate
on the pilot testing results improved from 26% to 64%. This significant increase of 38% in the detection hitrate resulted due to
the inclusion of human knowledge and expertise in the FDM.
Therefore, the desired detection hitrate of 60% was achieved.
The only limitation of the developed fraud detection system
is that customers committing fraud activities before the two year
period will not be detected by the FDM, since the C-SVM is not
trained for such instances. The two years of customer data provided by TNBD was due to problem associated with retrieving
Our future work will implement fuzzy logic as the backbone
for intelligent decision making in selecting suspicious customers with high possibilities of fraud. With the inclusion of
a fuzzy inference system (FIS), SQL filtering will be replaced
with human knowledge and intelligence. In addition, the genetic
algorithm (GA) will be used as an optimization tool in order
to determine the most suitable C-SVM parameters for the dual
Lagrangian optimization problem. The GA will be combined
with SVM in order to implement a hybrid SVM-GA FDM.
These improvements will guarantee further improvement in the
accuracy of the fraud detection system.
ACKNOWLEDGMENT
The authors would like to thank TNB Distribution (TNBD)
Sdn. Bhd. for providing the customer data.
REFERENCES
[1] R. Jiang, H. Tagaris, A. Lachsz, and M. Jeffrey, “Wavelet based feature extraction and multiple classifiers for electricity fraud detection,”
in Proc. IEEE/Power Eng. Soc. Transmission and Distribution Conf.
Exhibit. Asia Pacific, Oct. 6–10, 2002, vol. 3, pp. 2251–2256.
[2] C. R. Paul, “System loss in a metropolitan utility network,” Power Eng.
J., vol. 1, no. 5, pp. 305–307, Sep. 1987.
Authorized licensed use limited to: UNIVERSITY TENAGA NASIONAL. Downloaded on May 31,2010 at 05:14:09 UTC from IEEE Xplore. Restrictions apply.
1170
[3] N. Tobin and N. Sheil, “Managing to reduce power transmission system
losses,” in Transmission Performance. Dublin, Ireland: Publ. Electricity Supply Board Int., 1987.
[4] R. L. Sellick and C. T. Gaunt, “Load data preparation for losses estimation,” in Proc. 7th Southern African Universities Power Engineering
Conf. , Stellenbosch, South Africa, 1998, vol. 7, pp. 117–120.
[5] I. E. Davidson, A. Odubiyi, M. O. Kachienga, and B. Manhire, “Technical loss computation and economic dispatch model in T&D systems
in a deregulated ESI,” Power Eng. J., vol. 16, no. 2, pp. 55–60, Apr.
2002.
[6] Annual Report Tenaga Nasional Berhad 2004 TNB, 2004.
[7] T. B. Smith, “Electricity theft: A comparative analysis,” Energy Policy,
vol. 32, pp. 2067–2076, 2004.
[8] M. S. Alam, E. Kabir, M. M. Rahman, and M. A. K. Chowdhury,
“Power sector reform in Bangladesh: Electricity distribution system,”
Energy, vol. 29, pp. 1773–1783, 2004.
[9] A. Kumar and D. D. Saxena, “Decision priorities and scenarios for
minimizing electrical power loss in an India power system network,”
Elect. Power Compon. Syst., vol. 31, pp. 717–727, 2003.
[10] M. A. Ram and M. Shrestha, “Environmental and utility planning implications of electricity loss reduction in a developing country: A comparative study of technical options,” Int. J. Energy Res., vol. 22, pp.
47–59, 1998.
[11] A. H. Nizar, Z. Y. Dong, and Y. Wang, “Power utility nontechnical loss
analysis with extreme learning machine model,” IEEE Trans. Power
Syst., vol. 23, no. 3, pp. 946–955, Aug. 2008.
[12] I. E. Davidson, “Evaluation and effective management of nontechnical
losses in electrical power networks,” in Proc. 6th Africon Conf. Africa,
Oct. 2–4, 2002, vol. 1, pp. 473–477.
[13] R. Mano, R. Cespedes, and D. Maia, “Protecting revenue in the distribution industry: A new approach with the revenue assurance and audit
process,” in Proc. IEEWPES Transmission & Distribution Conf. Expo.:
Latin America, Nov. 2004, pp. 218–223.
[14] M. V. K. Rao and S. H. Miller, “Revenue improvement from intelligent
metering systems,” in Proc. 9th Int. Conf. Metering and Tariffs for Energy Supply, Birmingham, U.K., Aug. 1999, pp. 218–222.
[15] A. J. Dick, “Theft of electricity—How UK electricity companies detect and deter,” in European Convention on Security and Detection,
Brighton, U.K., May 16–18, 1995, pp. 90–95.
[16] J. W. Fourie and J. E. Calmeyer, “A statistical method to minimize electrical energy losses in a local electricity distribution network,” in Proc.
7th IEEE AFRICON Conf. Africa: Technology Innovation, Gaborone,
Botswana, Sep. 15–17, 2004, vol. 2, pp. 667–673.
[17] J. E. Cabral, J. O. P. Pinto, E. M. Gontijo, and J. R. Filho, “Fraud detection in electrical energy consumers using rough sets,” in Proc. IEEE
Int. Conf. Systems, Man and Cybernetics, Oct. 10–13, 2004, vol. 4, pp.
3625–3629.
[18] J. Bilbao, E. Torres, P. Egufa, J. L. Berasategui, and J. R. Saenz, “Determination of energy losses,” in Proc. 16th Int. Conf. Exhibition on
Electricity Distribution (CIRED) 2001, Amsterdam, The Netherlands,
vol. 5, 4 pp. vol. 5-.
[19] J. R. Filho, E. M. Gontijo, A. C. Delaiba, E. Mazina, J. E. Cabral, and
J. O. P. Pinto, “Fraud identification in electricity company customers
using decision trees,” in Proc. 2004 IEEE Int. Conf. Systems, Man and
Cybernetics, Oct. 10–13, 2004, vol. 4, pp. 3730–3734.
[20] A. H. Nizar, Z. Y. Dong, J. H. Zhao, and P. Zhang, “A data mining
based NTL analysis method,” in Proc. IEEE Power Eng. Soc. General
Meeting, Tampa, FL, Jun. 24–28, 2007, pp. 1–8.
[21] J. R. Galvan, A. Elices, A. Munoz, T. Czernichow, and M. A. SanzBobi, “System for detection of abnormalities and fraud in customer
consumption,” in Proc. 12th Conf. Electric Power Supply Industry, Pattaya, Thailand, Nov. 1998.
[22] A. H. Nizar, Z. Y. Dong, and J. H. Zhao, “Load profiling and data
mining techniques in electricity deregulated market,” presented at the
IEEE Power Eng. Soc. General Meeting, Montreal, QC, Canada, Jun.
18–22, 2006, paper 06GM0828.
[23] A. H. Nizar, Z. Y. Dong, M. Jalaluddin, and M. J. Raffles, “Load
profiling method in detecting non-technical loss activities in a power
utility,” in Proc. IEEE 1st Int. Power and Energy Conf., Putrajaya,
Malaysia, Nov. 28–29, 2006, pp. 82–87.
[24] D. Gerbec, S. Gasperic, I. Smon, and F. Gubina, “Allocation of the
load profiles to consumers using probabilistic neural networks,” IEEE
Trans. Power Syst., vol. 20, no. 2, pp. 548–555, May 2005.
[25] S. V. Allera and A. G. Horsburgh, “Load profiling for the energy
trading settlements in the UK electricity markets,” in Presentation
Report From Distribu-TECH Europe DA/DSM Conf., London, U.K.,
Oct. 27–29, 1998.
[26] Hydro and Thermal Power Business Team, Transmission and Distribution Loss Reduction Services #167 Samsung-dong, Kangnamgu,
Seoul, 135-791, Korea.
[27] V. N. Vapnik, Statistical Learning Theory. New York: Wiley, 1998.
IEEE TRANSACTIONS ON POWER DELIVERY, VOL. 25, NO. 2, APRIL 2010
[28] N. Cristianini and J. Shawe-Taylor, An Introduction to Support Vector
Machines. Cambridge, MA: Cambridge Univ. Press, 2000.
[29] E. Osuna, “Applying SVMs to face detection,” IEEE Intell. Syst. Mag.,
Support Vector Machines, vol. 13, no. 4, pp. 23–26, Jul./Aug. 1998, M.
A. Hearst, ed.
[30] S. Dumais, “Using SVMs for text categorization,” IEEE Intell. Syst.
Mag., Support Vector Machines, vol. 13, no. 4, pp. 21–23, Jul./Aug.
1998, M. A. Hearst, ed.
[31] G. Dror, R. Sorek, and S. Shamir, “Accurate identification of alternatively spliced exons using support vector machine,” Bioinformatics,
vol. 21, no. 7, pp. 897–901, Apr. 2005.
[32] C. J. C. Burges, “A tutorial on support vector machines for pattern
recognition,” Data Mining Knowl. Discovery, vol. 2, no. 2, pp.
121–167, 1998.
[33] M. E. Mavroforakis and S. Theodoridis, “A geometric approach to Support Vector Machine (SVM) classification,” IEEE Trans. Neural Netw.,
vol. 17, no. 3, pp. 671–682, May 2006.
[34] K. Veropoulos, “Machine learning approaches to medical decision
making,” Ph.D. dissertation, Faculty Eng., Dept. Eng. Math., Univ.
Bristol, Bristol, U.K., 2001.
[35] K. Vojislav, Learning and Soft Computing—Support Vector Machines,
Neural Networks and Fuzzy Logic Models. Cambridge, MA: MIT
Press, 2001.
[36] S. Amari and S. Wu, “Improving support vector machine classifiers by
modifying kernel functions,” Neural Netw., vol. 12, pp. 783–789, 1999,
vol. 6.
[37] V. Vapnik, The Nature of Statistic Learning Theory. New York:
Springer-Verlag, 1995.
[38] B. Schölkopf and A. J. Smola, Learning With Kernels: Support Vector
Machines, Regularization, Optimization and Beyond. Cambridge,
MA: MIT Press, 2001.
[39] M. A. Oskoei and H. Hu, “Support vector machine-based classification
scheme for myoelectric control applied to upper limb,” IEEE Trans.
Biomed. Eng., vol. 55, no. 8, pp. 1956–1965, Aug. 2008.
[40] D. Wang, D. S. Yeung, and E. C. Tsang, “Weighted mahalanobis distance kernels for support vector machines,” IEEE Trans. Neural Netw.,
vol. 18, no. 5, pp. 1453–1462, Sep. 2007.
[41] P.-H. Chen, C.-J. Lin, and B. Schölkopf, “A tutorial on nu-support
vector machines,” Appl. Stochast. Models in Bus. Ind., vol. 21, pp.
111–136, 2005.
[42] C. Cortes and V. Vapnik, “Support-vector networks,” Mach. Learn.,
vol. 20, pp. 273–297, Sep. 1995, vol. 3.
[43] V. Cherkassky and Y. Ma, “Practical selection of SVM parameters
and noise estimation for SVM regression,” Neural Netw., vol. 17, pp.
113–126, 2004, vol. 1.
[44] C.-W. Hsu, C.-C. Chang, and C.-J. Lin, “A practical guide to support vector classification,” Dept. Comput. Sci., National Taiwan Univ.,
Taipei, Taiwan, 2003, Tech. Rep..
[45] T. F. Wu, C.-J. Lin, and R. C. Weng, “Probability estimates for multiclass classification by pairwise coupling,” J. Mach. Learn. Res., vol. 5,
pp. 975–1005, 2004.
[46] C.-C. Chang and C.-J. Lin, LIBSVM: A Library for Support Vector Machines. [Online]. Available: http://www.csie.ntu.edu.tw/~cjlin/libsvm
Jawad Nagi was born in Karachi, Pakistan, on
March 23, 1985. He received the Bachelor’s and
Master’s degrees (Hons.) in electrical and electronics, and electrical engineering from Universiti
Tenaga Nasional (UNITEN), Malaysia, in 2007 and
2009, respectively.
Currently, he is a Project Engineer in the Power Engineering Centre (PEC) at UNITEN. His research interests include pattern recognition, image processing,
load forecasting, fuzzy logic, neural networks, and
support vector machines.
Keem Siah Yap received the Bachelor’s and
Master’s degrees from Universiti Teknologi
Malaysia (UTM) in 1998 and 2000, respectively, and
is currently pursuing the Ph.D. degree in electronic
engineering at the Universiti Sains Malaysia (USM).
He is a Senior Lecturer in the Department of Electronic and Communication Engineering, Universiti
Tenaga Nasional, Selangor, Malaysia. His research
interests include theory and application of artificial
intelligence, digital image processing, computer
vision systems, and robotics.
Authorized licensed use limited to: UNIVERSITY TENAGA NASIONAL. Downloaded on May 31,2010 at 05:14:09 UTC from IEEE Xplore. Restrictions apply.
NAGI et al.: NTL DETECTION FOR METERED CUSTOMERS IN POWER UTILITY USING SVMS
Sieh Kiong Tiong (M’02) received the Bachelor’s
and Master’s degrees in electrical, electronic, and
system engineering from the Universiti Kebangsaan
Malaysia (UKM) in 1997 and 2000, respectively,
and the Ph.D. degree in mobile communication from
the UKM in 2006.
He is currently the Head of Electronic and Information Technology Unit of Power Engineering
Centre (PEC) in the Universiti Tenaga Nasional
(UNITEN), Selangor, Malaysia. He is also an Associate Professor in the Electronic and Communication
Engineering Department of UNITEN. His research interests include digital
electronics, microprocessor systems, artificial intelligence, and mobile cellular
systems.
1171
Malik Mohamad received the Bachelor’s degree in
electrical engineering (Hons.) from the University of
Brighton, Brighton, U.K.
He is a Senior Manager with TNB Research
(TNBR) Sdn. Bhd. He has led numerous distribution
projects in TNB, particularly in Information Systems
and Metering. Currently he is a Project Director of
seven research groups in TNBR Sdn. Bhd. dealing
with clients in Sabah Electricity Sdn. Bhd. and TNB
Transmission and Distribution division. He has more
than 25 years of field and management experience
in the electricity industry.
Mr. Mohamad is a professional engineer registered with the Board of
Engineers Malaysia (BEM), a corporate member of the Institute of Engineers
Malaysia (IEM), and a certified competent engineer (33 kV) with Energy
Commission Malaysia.
Syed Khaleel Ahmed (M’98) received the Bachelor’s degree in electrical and electronics engineering
from Anna University, Chennai, India, in 1988, and
the Master’s degree in electrical and computer
engineering from the University of Massachusetts,
Amherst, in 1994.
He is currently a Senior Lecturer in the Electronic
and Communication Engineering Department,
Universiti Tenaga Nasional, Selangor, Singapore.
His research interests include robust control, fuzzy
logic and control, neural networks, robotics, signal
processing, and numerical analysis.
Authorized licensed use limited to: UNIVERSITY TENAGA NASIONAL. Downloaded on May 31,2010 at 05:14:09 UTC from IEEE Xplore. Restrictions apply.
Download