A Distributed Data Mining Framework for Just-In

advertisement
A Distributed Data Mining Framework for Just-In-Time Electricity Services
Aldo Dagnino and Lan Lin
US Corporate Research Center, ABB Inc.
{aldo.dagnino, lan.lin}@us.abb.com
Abstract
We propose a distributed data mining framework for
supporting real-time decision-makings in electric power
systems. The framework is computational grid-based Web
services that provide data mining functionalities on
geographically distributed computers on power grids. We
want to apply this framework to solving problems in power
systems, such as blackout prevention, power quality
monitoring, and asset management.
1. Introduction
As the paradigm shift is taking place in the electric
power industry, actions and decisions made at micro-level
processes are contributing more and more to the significant
changes at the macro-levels [3], which may result in
unconventional gains on value, such as, value of prediction,
value of look-ahead decision making, and value of ITenabled cooperation. New opportunities brought by this
change, however, are not supported by the current practices
in operation, design, and planning in the industry. The
interactions among and within electric utilities have
become much more complex to the point of beyond the
human ability to control. Prohibited by economic factors,
robust operations can no longer be ensured solely through
design, but need to be supported by just-in-time (JIT) and
just-in-place (JIP) decision-makings, which in turn are
assisted by more and more on-line sensing, monitoring and
software-based operations.
Demands are substantial for value-added information
acquisition and exchange through distributed JIT and JIP
services at various spatial, temporal, and contextual layers
of electric power systems, such as, demand, generation,
and transmission distribution. Intelligent mechanisms
embedded in different layers of the system through sensors,
actuation, and interactive communications are used to
collect and analyze data as well as incorporate acquired
knowledge into real-time decision-support procedures.
Useful information extracted from raw data is interactively
exchanged throughout the system in the assistance of
fulfilling system-wide performance objectives.
Electric power systems are complex distributed
systems. Data collected by sensors and measurement
devices in the systems has provided great opportunities for
large-scale data-driven knowledge acquisition with the
potentials of better understanding of the systems. But as
the data sources of interest are often geographically
dispersed and are usually large in size, gathering all of the
data into a central repository is in general undesirable and
infeasible due to the bandwidth and storage requirements.
Thus, there is a need for knowledge acquisition systems
that can perform the necessary data analysis at distributed
locations and transmit the knowledge acquired from the
data to the locations where they are needed.
We propose a framework for distributed data mining
and knowledge discovery with the aim of supporting JIT
and JIP decision-making processes. The computational
grid-based infrastructure provides Web services for
conducting data mining tasks on geographically distributed
data sources so as to discover useful information for realtime decision support. By leveraging the power of
distributed computing resources to improve on-line data
processing time and reduce network costs, this framework
is suitable for providing JIT and JIP electricity services.
2. Distributed Data Mining Issues
Data mining [2] is the process of extracting valid,
previously unknown, comprehensible, and useful
information from large data sets. It can be applied to
discovering interesting patterns and structures in the large
amounts of data collected in power systems. In power
systems, there are three main sources of data [4]: (1) field
data collected by various devices distributed throughout
the system, (2) centralized data archive, such as those
maintained by SCADA systems, and (3) simulation data
carried out in planning or operation environments. The
following characteristics of data in power systems make
data mining techniques applicable: (1) the large scale of
power systems where state variables can be in the
thousands, (2) statistical and temporal data measures from
milliseconds to minutes, hours, weeks, and years, (3) a
mixture of discrete and continuous features, such as,
discrete topology changes and continuous analog state
variables, (4) data visualization for interaction with domain
experts, (5) time constraints on fast on-line decision
making, and (6) the existence of uncertainties, such as,
noise and outliers.
As the cost for transmitting data across networks
increases with the size of the data, distributed data mining
becomes advantageous by analyzing data in a distributed
manner instead of in a centralized collection. The shift
toward intrinsically distributed, complex computing
environments has raised a range of challenges in data
mining. In addition to the distributed locations of data, data
types become increasingly complex due to the variety of
data collection mechanisms of embedded intelligent
devices. To make matters more complicated, incremental
or on-line mining methods are needed so as not to reduce
the workloads as data being collected are fast changing,
massive, and potentially infinite in volume. Providing
these features in distributed data mining systems asks for
novel solutions. It is also crucial to ensure scalability and
interactivity as data mining infrastructure continues to
grow substantially in size and complexity.
The design of data mining algorithms for distributed
systems, therefore, needs to pay more attention to the
distributed features of data as well as the distributed
resources
for
computation and
communication.
Approaches need to be scalable for distributed processing
of data, integration of many data sources, including realtime data from power grids and various utility databases,
features of processed data, and human factors.
The algorithm design for distributed data mining
involves the design of efficient, scalable, disk-based,
parallel and distributed algorithms for large-scale data sets,
with the challenge of scalability to thousands of attributes
and millions of transactions. The techniques of interest
cover all major categories of data mining methods, such as,
association rules, sequence patterns, classification,
clustering, anomaly detection, as well as various data preprocessing and post-processing tasks, such as, sampling,
feature selection, data reduction and transformation, and
interactive visualization. Although distributed algorithms
for classifiers, frequent patterns, and clustering already
exist, efficient methods for mining complex data types
such as stream data still need to be developed. In particular,
we are interested in designing scalable data analysis
methods such as genetic algorithms, and graph-based
classification for distributed data.
3. Grid-Based Web Service Architecture
There are two most important requirements for dataintensive and computation-intensive tasks in distributed
data mining: the synthesizing of useful knowledge from
large amounts of data, and the large-scale, complex
computations leveraging the distributed platforms. We
select the computational grids as the underlying
infrastructure to support the integration of the two. Gird
computing is a form of distributed computing that uses
open standards and protocols to share heterogeneous
resources, such as, hardware and software architectures,
and computer languages, located in geographically
different locations. The computational grids have been
designed to support applications that can benefit from
distribution, collaboration, data sharing, high performance,
and complex interaction of autonomous and geographically
dispersed resources. In the past decade, a few
computational grid-based data mining systems have been
proposed, including abstract models, general frameworks,
and domain-specific implementations [1].
We choose a Service Oriented Architecture (SOA) based technology for our framework due to the fact that the
next generation of grid computing and the Web is being
designed and implemented based the SOA model. The
SOA is a programming model for building flexible,
modular, and interoperable software applications by
distributing application and system functions among
multiple computers or domains. The Open Grid Services
Architecture (OGSA) is an implementation of the SOA
model within the grid context. It defines an architectural
model for grid systems in which distributed resources and
applications are modeled as interactive Web services. A
Web service is a software component that can be accessed
by remote entities using standard Internet protocols. An
important character of Web services is the independence of
the service interface from the implementation of the
functions. Web services are used in grid computing as
uniform interfaces for accessing remote resources and
composing distributed applications independently from
their geographical location and domain-specific
implementation. Data-intensive knowledge discovery
applications can take advantage of grid computing
technologies for managing distributed data and knowledge
and achieving high performance. Our focus is on
distributed data mining services that allow data analysts at
distributed locations to conduct high-level data mining in
standard and reliable manner.
The systems issues to be considered are the actual
implementations of data mining algorithms on a variety of
hardware platforms, including parallel hardware platforms,
network of workstations, geographically distributed
systems, etc. The key challenges are to integrate
heterogeneous data sources, find suitable data layouts,
improve load balancing and locality, and minimize
synchronization and communication.
4. Problems to Be Studied in Power Systems
The functions of power grid with embedded
intelligence can be classified into two categories, real-time
functions and non-real-time functions [5]. In the real-time
category, distributed sensing improves the observability of
power grid for grid device status and health monitoring,
failure detection and localization, power quality and
reliability monitoring, and safety and security monitoring.
In the non-real-time category, functions include the
integration of new and existing utility databases so as to
fuse operational data with financial and other data to
support operational optimization, asset utilization
maximization and replacement optimization, strategic
planning and capital expenditure planning, maximization
of customer satisfaction, optimization of system
performance metrics and regulatory reporting. Although
electric utilities already have many of the data sources
needed to support these functions, the data sources are
usually distributed located thus hard to combine into one
central repository. Also, the operational data are usually
stored in separate SCADA systems not ready for
supporting data analysis or business intelligence tools. By
providing a common integration point and an enterprise
service bus, an intelligent power grid can enable data
mining for strategic planning support and real-time
displays such as daily asset profitability and asset failure
system risk for utility executives.
Some of the problems in the above categories we want
to apply distributed data mining techniques to are as
follows. First, we want to use classification methods to
build distributed prediction models for instable
components of power grids.
Second, we want to apply
supervised and unsupervised learning strategies in data
mining to diagnose power quality disturbance problems.
Third, we also want to investigate efficient methods for
processing huge amount of data and generating real-time
recommendations to operators for ensuring Quality of
Service as system conditions change. Fourth, we want to
tackle asset management problem in electric utility
industry by building asset parametric models in order to
select an optimal set of assets for replacement or upgrade
under budget constraints. Fifth, we want to discover spatial
and temporal patterns of variations at key load centers so
as to better facilitate the deployment of large-scale
intermittent energy sources and the price-responsive
demand. Sixth, we want to study patterns of unexpected
network response to utilities trade that may cause widespread backbone effects of equipment failures that can lead
to cascading system-wide failures. Seventh, we want to
design distributed performance metrics associated with
new technologies for various layers of power grids, and
identify IT and software capable of accommodating these
performance metrics.
References
[1] M. Cannataro and D. Talia. The Knowledge Grid.
Communications of the ACM. 46(1):89–93. 2003.
[2] U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and
R. Uthurusamy. Advances in Knowledge Discovery and Data
Mining. AAAI Press / MIT Press. 1996.
[3] M. Jelinek and M. Ilic. Strategic Framework for Electric
Technologies: Technology and Institutional Factors and IT in a
Deregulated Industry. NSF Workshop, 2000.
[4] C. Olaru, P. Geurts, and L. Wehenkel, Data Mining Tools and
Applications in Power System Engineering, Proc. of the 13th
Power Systems Computation Conference, pp. 324-330. 1999.
[5] J. Taft. The Intelligent Power Grid. IBM. 2006.
Download