A Distributed Data Mining Framework for Just-In-Time Electricity Services Aldo Dagnino and Lan Lin US Corporate Research Center, ABB Inc. {aldo.dagnino, lan.lin}@us.abb.com Abstract We propose a distributed data mining framework for supporting real-time decision-makings in electric power systems. The framework is computational grid-based Web services that provide data mining functionalities on geographically distributed computers on power grids. We want to apply this framework to solving problems in power systems, such as blackout prevention, power quality monitoring, and asset management. 1. Introduction As the paradigm shift is taking place in the electric power industry, actions and decisions made at micro-level processes are contributing more and more to the significant changes at the macro-levels [3], which may result in unconventional gains on value, such as, value of prediction, value of look-ahead decision making, and value of ITenabled cooperation. New opportunities brought by this change, however, are not supported by the current practices in operation, design, and planning in the industry. The interactions among and within electric utilities have become much more complex to the point of beyond the human ability to control. Prohibited by economic factors, robust operations can no longer be ensured solely through design, but need to be supported by just-in-time (JIT) and just-in-place (JIP) decision-makings, which in turn are assisted by more and more on-line sensing, monitoring and software-based operations. Demands are substantial for value-added information acquisition and exchange through distributed JIT and JIP services at various spatial, temporal, and contextual layers of electric power systems, such as, demand, generation, and transmission distribution. Intelligent mechanisms embedded in different layers of the system through sensors, actuation, and interactive communications are used to collect and analyze data as well as incorporate acquired knowledge into real-time decision-support procedures. Useful information extracted from raw data is interactively exchanged throughout the system in the assistance of fulfilling system-wide performance objectives. Electric power systems are complex distributed systems. Data collected by sensors and measurement devices in the systems has provided great opportunities for large-scale data-driven knowledge acquisition with the potentials of better understanding of the systems. But as the data sources of interest are often geographically dispersed and are usually large in size, gathering all of the data into a central repository is in general undesirable and infeasible due to the bandwidth and storage requirements. Thus, there is a need for knowledge acquisition systems that can perform the necessary data analysis at distributed locations and transmit the knowledge acquired from the data to the locations where they are needed. We propose a framework for distributed data mining and knowledge discovery with the aim of supporting JIT and JIP decision-making processes. The computational grid-based infrastructure provides Web services for conducting data mining tasks on geographically distributed data sources so as to discover useful information for realtime decision support. By leveraging the power of distributed computing resources to improve on-line data processing time and reduce network costs, this framework is suitable for providing JIT and JIP electricity services. 2. Distributed Data Mining Issues Data mining [2] is the process of extracting valid, previously unknown, comprehensible, and useful information from large data sets. It can be applied to discovering interesting patterns and structures in the large amounts of data collected in power systems. In power systems, there are three main sources of data [4]: (1) field data collected by various devices distributed throughout the system, (2) centralized data archive, such as those maintained by SCADA systems, and (3) simulation data carried out in planning or operation environments. The following characteristics of data in power systems make data mining techniques applicable: (1) the large scale of power systems where state variables can be in the thousands, (2) statistical and temporal data measures from milliseconds to minutes, hours, weeks, and years, (3) a mixture of discrete and continuous features, such as, discrete topology changes and continuous analog state variables, (4) data visualization for interaction with domain experts, (5) time constraints on fast on-line decision making, and (6) the existence of uncertainties, such as, noise and outliers. As the cost for transmitting data across networks increases with the size of the data, distributed data mining becomes advantageous by analyzing data in a distributed manner instead of in a centralized collection. The shift toward intrinsically distributed, complex computing environments has raised a range of challenges in data mining. In addition to the distributed locations of data, data types become increasingly complex due to the variety of data collection mechanisms of embedded intelligent devices. To make matters more complicated, incremental or on-line mining methods are needed so as not to reduce the workloads as data being collected are fast changing, massive, and potentially infinite in volume. Providing these features in distributed data mining systems asks for novel solutions. It is also crucial to ensure scalability and interactivity as data mining infrastructure continues to grow substantially in size and complexity. The design of data mining algorithms for distributed systems, therefore, needs to pay more attention to the distributed features of data as well as the distributed resources for computation and communication. Approaches need to be scalable for distributed processing of data, integration of many data sources, including realtime data from power grids and various utility databases, features of processed data, and human factors. The algorithm design for distributed data mining involves the design of efficient, scalable, disk-based, parallel and distributed algorithms for large-scale data sets, with the challenge of scalability to thousands of attributes and millions of transactions. The techniques of interest cover all major categories of data mining methods, such as, association rules, sequence patterns, classification, clustering, anomaly detection, as well as various data preprocessing and post-processing tasks, such as, sampling, feature selection, data reduction and transformation, and interactive visualization. Although distributed algorithms for classifiers, frequent patterns, and clustering already exist, efficient methods for mining complex data types such as stream data still need to be developed. In particular, we are interested in designing scalable data analysis methods such as genetic algorithms, and graph-based classification for distributed data. 3. Grid-Based Web Service Architecture There are two most important requirements for dataintensive and computation-intensive tasks in distributed data mining: the synthesizing of useful knowledge from large amounts of data, and the large-scale, complex computations leveraging the distributed platforms. We select the computational grids as the underlying infrastructure to support the integration of the two. Gird computing is a form of distributed computing that uses open standards and protocols to share heterogeneous resources, such as, hardware and software architectures, and computer languages, located in geographically different locations. The computational grids have been designed to support applications that can benefit from distribution, collaboration, data sharing, high performance, and complex interaction of autonomous and geographically dispersed resources. In the past decade, a few computational grid-based data mining systems have been proposed, including abstract models, general frameworks, and domain-specific implementations [1]. We choose a Service Oriented Architecture (SOA) based technology for our framework due to the fact that the next generation of grid computing and the Web is being designed and implemented based the SOA model. The SOA is a programming model for building flexible, modular, and interoperable software applications by distributing application and system functions among multiple computers or domains. The Open Grid Services Architecture (OGSA) is an implementation of the SOA model within the grid context. It defines an architectural model for grid systems in which distributed resources and applications are modeled as interactive Web services. A Web service is a software component that can be accessed by remote entities using standard Internet protocols. An important character of Web services is the independence of the service interface from the implementation of the functions. Web services are used in grid computing as uniform interfaces for accessing remote resources and composing distributed applications independently from their geographical location and domain-specific implementation. Data-intensive knowledge discovery applications can take advantage of grid computing technologies for managing distributed data and knowledge and achieving high performance. Our focus is on distributed data mining services that allow data analysts at distributed locations to conduct high-level data mining in standard and reliable manner. The systems issues to be considered are the actual implementations of data mining algorithms on a variety of hardware platforms, including parallel hardware platforms, network of workstations, geographically distributed systems, etc. The key challenges are to integrate heterogeneous data sources, find suitable data layouts, improve load balancing and locality, and minimize synchronization and communication. 4. Problems to Be Studied in Power Systems The functions of power grid with embedded intelligence can be classified into two categories, real-time functions and non-real-time functions [5]. In the real-time category, distributed sensing improves the observability of power grid for grid device status and health monitoring, failure detection and localization, power quality and reliability monitoring, and safety and security monitoring. In the non-real-time category, functions include the integration of new and existing utility databases so as to fuse operational data with financial and other data to support operational optimization, asset utilization maximization and replacement optimization, strategic planning and capital expenditure planning, maximization of customer satisfaction, optimization of system performance metrics and regulatory reporting. Although electric utilities already have many of the data sources needed to support these functions, the data sources are usually distributed located thus hard to combine into one central repository. Also, the operational data are usually stored in separate SCADA systems not ready for supporting data analysis or business intelligence tools. By providing a common integration point and an enterprise service bus, an intelligent power grid can enable data mining for strategic planning support and real-time displays such as daily asset profitability and asset failure system risk for utility executives. Some of the problems in the above categories we want to apply distributed data mining techniques to are as follows. First, we want to use classification methods to build distributed prediction models for instable components of power grids. Second, we want to apply supervised and unsupervised learning strategies in data mining to diagnose power quality disturbance problems. Third, we also want to investigate efficient methods for processing huge amount of data and generating real-time recommendations to operators for ensuring Quality of Service as system conditions change. Fourth, we want to tackle asset management problem in electric utility industry by building asset parametric models in order to select an optimal set of assets for replacement or upgrade under budget constraints. Fifth, we want to discover spatial and temporal patterns of variations at key load centers so as to better facilitate the deployment of large-scale intermittent energy sources and the price-responsive demand. Sixth, we want to study patterns of unexpected network response to utilities trade that may cause widespread backbone effects of equipment failures that can lead to cascading system-wide failures. Seventh, we want to design distributed performance metrics associated with new technologies for various layers of power grids, and identify IT and software capable of accommodating these performance metrics. References [1] M. Cannataro and D. Talia. The Knowledge Grid. Communications of the ACM. 46(1):89–93. 2003. [2] U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge Discovery and Data Mining. AAAI Press / MIT Press. 1996. [3] M. Jelinek and M. Ilic. Strategic Framework for Electric Technologies: Technology and Institutional Factors and IT in a Deregulated Industry. NSF Workshop, 2000. [4] C. Olaru, P. Geurts, and L. Wehenkel, Data Mining Tools and Applications in Power System Engineering, Proc. of the 13th Power Systems Computation Conference, pp. 324-330. 1999. [5] J. Taft. The Intelligent Power Grid. IBM. 2006.