Multi-Layered Framework for Distributed Data Mining Masum Serazi, Amal Perera, Qiang Ding, Vasiliy Malakhov, William Perrizo North Dakota State University Computer Science Department Fargo, ND 58105, USA {Md.Serazi, Amal.Perera, Qiang.Ding, Vasiliy.Malakhov, William.Perrizo}@ndsu.nodak.edu Tel: 1.701.231.6404 Topic Areas: Distributed Intelligent Systems, Software Architecture Multi-Layered Framework for Distributed Data Mining Abstract There is an increase in the demand for data mining applications on the web. With the increase in the size of data sets there is also a demand for scalable generic solutions. Scalability and generic data mining models can be provided with the use of distributed computing. In this paper we propose the use of a multilayered framework for a distributed data mining system. A multi-layered architecture can take advantage of the latest technological advances in hardware to provide efficient solutions and also allow the easy addition of new data mining and data capture components to the basic system. A multi layered architecture also facilitates an iterative development process. In this paper we show the use of a high end data mining engine with generic data mining models on the server side and a client that can capture the client requirements over the web. 1. Introduction and Background: In an increasing number of disciplines, large data collections are emerging as important resources. In domains as diverse as global climate change, high energy physics, and computational genomics, the volume of interesting data is already measured in terabytes and will soon reach petabytes. The communities of users that need to access and mine this data (often using sophisticated and computationally expensive techniques) are often large and are almost always geographically distributed. The computing and storage resources that these communities rely upon to store are also diverse and distributed [MBM99]. This combination of large dataset size, diversity of data, geographic distribution of users and resources, and computationally intensive result generation are demands that are not satisfied by existing stand alone data mining approaches. There is an immerging trend towards delivering data mining services over the web [SN00]. The focus is in providing data mining models as services over the internet. The important issues raised in this context are standards for describing generic data mining models, integrating models, personalizing models, capture and integration of data, exchange of messages, description of task requests, estimation of compute time, efficient computing on massive data sets, and data transfer versus process transfer [SN00, MBM99, CDG+98, GKM+99, RWL+00, KPH99, and CGG+02]. There were several attempts on large scale distributed data mining. The Kensington project [CDG+98] is for mining enterprise data distributed across the internet. The Papyrus project [GKM+99] is a distributed data mining system developed for clusters and super clusters of workstations. It is composed of four software layers: data management, data mining, predictive modeling, and agents. Papyrus is based on mobile agents implemented using Java. Another distributed data mining suite based on Java is PaDDMAS [RWL+00], a component-based tool set that integrates pre-developed or custom packages (that can be sequential or parallel) using a dataflow approach. JAM [SPT97] is an agent-based distributed data mining system that has been developed to mine data stored in different sites for building so called meta-models as a combination of several models learned at the different sites where data are stored. JAM uses Java applets to move data mining agents to remote sites. BODHI is a project [KPH99] for doing collective data mining with stress on learning from vertically partitioned data. Discovery Net [CGG+02] provides an architecture for building and managing KDD processes on a Grid. Most of the projects are implemented as prototypes. In this paper we attempt to address some of the major issues related to distributed data mining. The solutions suggested were implemented as part of DataMIME™. From a user point of view the basic requirement for providing data mining models as services over the internet is the ability to use efficiently generic customizable data mining tools on a wide array of data sets over the internet. There are 3 types of architectural models described in the literature for distributed data mining. They are Client-Server, Agent based, and Hybrid. Each approach has its own advantages and disadvantages. In this paper we describe a client-server model. A client-server model is characterized by the presence of one or more data mining servers. Data and data mining requests are fed by the client from different locations and are brought to the server for execution. Once the execution is done the results are presented to the client. There are several reasons to choose a client-server model in this approach. One major reason is the ability to use high performance computing on the server side to do the data mining. In this paper we describe the use of Ptrees1, a distributed compressed vertical data mining ready data structure. Ptrees have been shown to be scalable on a wide array of data mining applications [DDP02, KDP02, PSP02, PDD+03, and PWR+03]. The use of Ptrees on the server side implies converting the data to a uniform data structure. Having a uniform data structure promotes optimization and ease of developing generic data mining components. With massive data sets and computationally demanding algorithms, data mining demands efficient computing. With the rapid increase in hardware performance capabilities, having a generic mining engine that can efficiently serve multiple mining demands is useful. Most of the current data mining applications are developed for a particular data structure and optimized for a particular platform. It is an advantage to be able to develop applications independent to the underlying data structure and expect to execute in an efficient manner. The data mining applications and the data structure is separated with an application programming interface (API). This enables the architecture to allow the Ptree data structure to be optimized with respect to its functionalities without the dependency on the data mining algorithms. The API also allows the independent development of data mining algorithms. In providing the capability to be able to mine on a diverse set of data sets, there is a requirement to be able to capture any type of data. There are two options for the user. The user can change the data to a format that can be captured by the system or the user can write a data feeder by implementing the required interface for the new data file type. In many situations data mining requires “Grayware”. Human input is required to tune the data and the algorithms to suite the need. The client architecture will allow the user to specify the mining request with a wide array of flexible parameters. The generic modules 1 Patents are pending on the P-tree technology. This work is partially supported by GSA Grant ACT#: K96130308. on the server side will be fired up according to the user request. Client server communication will facilitate the transfer of information. In the above discussion we can clearly see the importance of having distinct layers in providing the basic requirements for a client server based distributed data mining system. The layers identified are, Uniform data structure, Data capture and integration layer, data mining interface, data mining algorithms, client-server communication, and the client interface. In the next section we describe the architecture in details of these layers and how they are organized in the system. In section 3 we describe the characteristics of the prototype system that was developed to validate the proposed layered architecture. 2. System Architecture In this section we explain the proposed layered architecture with the use of an example system DataMIME™, that was developed as proof-of-concept. DataMIME™ is an efficient and scalable data mining system providing the flexibility of plugging in new data mining applications when needed. Clients can interact with the DataMIME™ system to capture their data and convert it into the Ptree format after which they can apply different data mining applications. The actual data converter along with all the data mining applications execute on the server side. Figure 1 depicts a conceptual view of the system. Capture dataset to DataMIME Mine on DataMIME ™ Integrate data to DataMIME™ Internet System performan ce analysis Client Side Server Side One of the Slave Servers Master Server ........ .... Figure 1: The Conceptual Design level view of the System 2.1 Server Architecture: Multi-threaded concurrent and distributed DataMIME™ server has a layered architecture: DCI/DII, DMI, DMA, DPMI, and Ptree Data. Figure 2 describes the organization of the layers. Already Plugged Algorithm Plugs for new algorithms DMA Layer Room for new feeder DCI/DII Layer DMI Layer DPMI: Distributed Ptree Management Interface Distributed Ptree data Figure 2: Server Side Architecture of the system Ptree (Data Structure): In this system, we use Ptree as the internal representation of the data. Ptrees are tree-like data structures that store relational data in column-wise, bit-compressed format by splitting each attribute into bits (i.e. representing each attribute value by its binary equivalent), grouping together all bits in each bit position for all tuples, and representing each bit group by a Ptree. Ptrees are a loss less and are structured to facilitate efficient data mining processes. Once we have represented our data using P-trees, no scans of the database are needed to perform the data mining task at hand. Data querying can be achieved through logical operations such as AND, OR and NOT referred to as the Ptree algebra in the literature. Various aspects of Ptrees including representation, querying, algebra, speed and compression have been discussed in greater details in [DKR+02] DCI/DII (Data Integration and Capture Interface): This layer allows user to capture and to integrate data to the required format (p-tree format). The main component of this layer is the feeder. An individual feeder can process a particular format of incoming data. If there is no feeder that can process a particular format then user can write his own feeder and plug it very easily in this architecture. DMI (Data Mining Interface): DMI does counting, the most important operation for data mining provided by P-trees, including basic P-trees, value P-trees, tuple P-trees, Interval P-trees, and Cube P-trees. DMI also provide the P-tree algebra, which has four operations, AND, OR, NOT (complement) and XOR, to implement the point wise logical operations on P-trees for (Data Mining Algorithms) DMA. Distributed P-tree Management Interface (DPMI): The DMPI layer provides access, location, and concurrency transparency by hiding the fact that data representation may differ and resource access protocol may vary, resources may be located in different places, and resources may be shared by several competitive users. DMA (Data Mining Algorithm) Layer: This layer is a collection of data mining tools (algorithms). Upon receiving a request from the client side an algorithm will be fired for mining. This layer depends on DMI for accessing meta-info and required counts. Ptree based K Nearest Neighbor PKNN [KDP02], Podium Incremental Neighbor Evaluator PINE [PDD+03], and P-BAYESIAN [PSP02] are available as built-in algorithms in the in the current DataMIMETM system. The architecture has the flexibility to plug-in a new algorithm on this layer. 2.2 Communication In DCI and DMA protocols a client will create connection, send request, receive answer and close connection. A client will send only one request in a single threaded connection. In the DCI protocol, a request contains a text-based header which may be followed by a set of binary files with checksums for each file. The header contains a command to the server, number of files, and, if request contains files, information about each file (name and length). The response for a request is just a line with a message indicating the outcome of the request. DMA protocol request has a similar structure – header and an optional set of binary files with checksums. The header in DMA protocol is a set of key / value pairs (properties), similar to the Java properties file, followed by a terminator. Response to the DMA protocol request also contains key/value pairs. Each request contains property 'cmd' with name of a command. Other parameters may represent arguments of the requested command. Depending on a command name and its parameters the server will call different data mining algorithms to manage this request. 2.3 Client Structure: In the client side DataMIMETM has a graphical user interface (GUI) to visually interact with a user. The two main functionalities are: Capturing : send a set of data along with its meta information to the DII/DCI layer of the server Mining: ability to send a request to DMA for data mining technique on a dataset that has already been captured, and to be able to display the result. Meta Data Prediction Model Data D Unclassi fied data D Client side DMA Client Side DCI Data M C Meta-data generator Client Side DCI Visualization Tool A I (a) (b) Figure 3: Client side architecture of the system: (a) Capturing (b) Mining 2.4 GUI for client: A client side Graphical User Interface has been developed and implemented to facilitate user interaction with the system. The following figures depict a capture instance and a classification instance of the system [DM03]. (a) (b) Figure 4: Client side Graphical User Interface: (a) Capturing (b) Mining 3. System Characteristics Initially we raised certain issues that are related in providing scalable data mining services on the web. In the previous section we describe a layered architecture that can address most of the issue. In this section we describe the characteristics of DataMIMETM, a prototype system implemented as a proof-of-concept. To increase usability, we have designed and implemented the system with an increased emphasis on extensibility and flexibility. We have developed a wide variety of functions and algorithms. Most algorithms have turned out to be superior to other well-known methods in terms of speed, and/or accuracy. We summarize the characteristics as follows. The system has the ability to handle formatted record-based, relational-like data with numerical and/or categorical attributes. The data could be in text format, relational format, or TIFF image format. In addition, easy conversion from any other machine readable format can be provided through customized feeders System users can do any data analysis and mining on data sets in the system, or on any new data they capture or integrate into the system. The system is capable of handling large quantities of data and mines them in scalable time. Clients of the system can run on UNIX and Microsoft Windows (including 95, 98, NT, 2000, XP, and Server 2003) platform with the server designed to be a UNIXonly system. The system supports major RDBMS platforms. The system has an N-Tier architecture providing high flexibility. The server engine can be run on single machine or distributed across multiple computers for better scalability and efficiency. The system can automate data ETL (extraction, transformation, and load) processes or just let the users handle everything manually. The system has an open architecture provides high degree of software extensibility and integration capabilities. Users can not only use the system provided approaches in association rule mining, classification, prediction, and similarity search, they can also write their own data mining algorithms using the Ptree API and compile and deploy them in the DataMIME™ environment, so as to see the performance of their own algorithms. With large amounts of data, data operations require time to process. The system provides high level of asynchronous background operations, performing most data intensive operations in the background or offline and allowing users to continue their work. The system minimizes the flow of data across the network. 4. Conclusion From a user point of view the basic requirement for providing data mining models as services over the internet is the ability to use efficiently generic customizable data mining tools on a wide array of data sets over the internet. From a developers point of view the architecture should facilitate an iterative development process. This will enable the integration of new components to the system. This will also allow the developers to take advantage of the latest developments in hardware. We were able to identify a uniform efficient vertical data structure at the lowest layer that can take advantage of the latest hardware. We were able to identify data management layer that facilitates the data distribution. We also identify a data mining and data capture layer that is defined in the form of an API. The generic data mining models are developed on top of the data mining interface. The client is built on top of the communication layer to capture the user requirements for a particular job. In this paper we have shown the importance of having a layered architecture for a distributed data mining system. Key requirements were identified in deciding on the different layers. A prototype system was developed as a proof-of-concept to show the feasibility of the approach. References: [CDG+98] J. Chattratichat, J. Darlington, Y. Guo, S. Hedvall, M. Khler, J. S. A. Saleem, and D. Yang. Deploying enterprise data mining on the internet. In PAKDD, 1998. [CGG+02] V. Curcin, M. Ghanem, Y. Guo, M. Kohler, A. Rowe, J. Syed, P. Wendel. Discovery Net: Towards a Grid of Knowledge Discovery. ACM KDD 2002. [DDP02] Qin Ding, Qiang Ding and W. Perrizo, Association Rule Mining on Remotely Sensed Images Using P-trees, Proceedings of PAKDD2002, Taipei, Taiwan, May 6-8, 2002. [DKR+02] Q. Ding, M. Khan, A. Roy, and W. Perrizo, The P-tree algebra. Proceedings of the ACM SAC, Symposium on Applied Computing, 2002. [DM03] DataMIMETM Homepage, “http://midas.cs.ndsu.nodak.edu/~datasurg/datamime/” [GKM+99] R. L. Grossman, S. Kasif, D. Mon, A. Ramu, and B. Malhi. The preliminary design of papyrus: A system for high performance, distributed data mining over clusters, meta-clusters and super-clusters, 1999. [KDP02] M. Khan, Q. Ding and W. Perrizo, K-nearest Neighbor Classification on Spatial Data Stream Using P-trees, Proceedings of PAKDD 2002, Springer-Verlag, Lecture Notes in Artificial Intelligence 2336, May 2002, pp. 517-528. [KPH99] H. Kargupta, B. Park, D. Hershberger and E. Johnson, Collective data mining: a new perspective toward distributed data mining, In H. Kargupta and P. Chan (eds.) Advances in Distributed and Parallel Knowledge Discovery, AAAI Press 1999. [MBM99] R. Moore, C. Baru, R. Marciano, A. Rajasekar, and M. Wan. Dataintensive computing. In The Grid: Blueprint for a Future Computing Infrastructure.edited by I. Foster and C. Kesselman, Morgan Kaufmann Publishers, 1999., pages 105 - 129 [PDD+03] W. Perrizo, Qin Ding, A. Denton, K. Scott, Qiang Ding, and M. Khan, PINE - Podium Incremental Neighbor Evaluator for Spatial Data using Ptrees ,ACM SAC 2003. [PSP02] A. Perera, M. Serazi, W. Perrizo ,Performance Improvement for Bayesian Classification with Ptrees ,CAINE'02, San Diego, Nov. 2002. [RWL+00] O.F. Rana, D.W. Walker, M. Li, S. Lynden and M. Ward, PaDDMAS: parallel and distributed data mining application suite, Proc. International Parallel and Distributed Processing Symposium (IPDPS/SPDP), IEEE Computer Society Press, 2000, pp. 387-392. [PWR+03] F. Pan, B. Wang, D. Ren, X. Hu, and W. Perrizo, Proximal Support Vector Machine for Spatial Data Using Peano Trees, CAINE 2003. [SN00] S. Sarawagi, and S.H. Nagaralu, .Data Mining Models as Services on the Internet., SIGKDD Explorations. June 2003 Volume 2 Issue1 [SPT97] S.J. Stolfo, A.L. Prodromidis, S. Tselepis, W. Lee, D.W. Fan, and P.K. Chan, JAM: Java agents for meta-learning over distributed databases, International KDD’97 Conference, 1997, pp. 74-81.