Distributed Information Systems - Project Final Report Distributed Data Mining Systems in Java D91725001 王春笙、D92725002 林俊甫、D92725001 王慧芬 January, 2004 Installation Guide Project requirements Must involve multithread programming. Available middleware: CORBA, DCOM, or Java RMI. Must have components spreading over different machines. Each team member I. must independently be in charge of some of the components, and be fully responsible of their functionality. Must have a clear interface definition between components. Must handle failures. Failures to be considered: link broken, stop failures. System integration and Collaboration: 10 % Degree of fault tolerance (and scalability, if applies): 10 % Completeness: 10 % Introduction Motivation A weblog is a comprehensive web access log. It keeps track of activity on the site by month, week, day and hour. There exists many meaningful information in a weblog, such as total hits, bytes transferred and page views, and the most popular pages. Data mining is a set of search techniques used to extract useful information hidden within a large set of data, such as weblog. However, it is time-consuming to perform multi-layer data mining on a large data file, such as a weblog. The sharing of resources, such as sharing CPU for computation and search, is a main motivation for constructing distributed systems. Goal The goal of this project is to build a distributed data mining system that the components located at networked computers work collaboratively to mine the meaningful information from weblog. We expect that the performance of mining weblog increase by the jointed computing resources. 1 Besides, by the fault tolerate techniques, we believe the mining process will continue despite of any component crash in the distributed system. II. System Architecture Figure 1 depicts the system architecture. Prediction engine A prediction engine can deduct future references on the basis of predictive prefetching can be implemented. After processing the past reference prediction engine derives the probability of future access for the documents accessed so far. The prediction engine resides in the server side. The web frequent patterns mining process mainly resides in prediction engine. Web Server The server store web pages and record web access log. Request service module Usually a proxy server which use mining results request from prediction engine. According to most recently access patterns, this module prefetch pages that the prediction engine predict. Web Client The web client is the web browsers. Prediction engine Node Node Node distributed mining 2 Web Server Log files Request service module Figure 1:system architecture III. Technological Infrastructure Client Client Client ... LAN Mining data chunk Server/Coordinator Figure 2:System diagram The system consists pf server/coordinator and clients. A complete weblog is kept in server. Server divides weblog into pieces of data chunks, and dispatches chucks to the connected clients for mining. 3 The system design follows both transparency and scalability criteria. The design uses tools provided by Java:multi-thread, RMI, multicast, and socket. Besides, the system provides redundancy to cope with server or client failure. The rationale of the system covers three aspects. First, data mining is a computing intensive task. Second, speed pf weblog data generation may be too quick that one single computer cannot handle it. Third, implementing distributed prediction engine have fault tolerance advantage. Two alternatives could be considered:fully distributed data mining system and client/server distributed data mining system. In a fully distributed data mining system, each participant acts as peer-to-peer autonomous node. For the client/server distributed data mining system, the data server acts as fixed coordinator. IV. Implementation Phase The system requirements for hardware composed of 2 or 3 PCs with Microsoft Windows platform, one acts as server and the other act as clients as well as redundant servers. About software, the system requires J2SE SDK 1.4.1, Eclipse 2.1, and Netbeam 3.5.1 for implementation; Java web start for execution. The server design logic covers the following parts. First, server activates a well-known port waiting for connection from clients by threads. The server keeps track of all the connection information in a hash table. While connected with clients, server dispatches the designate mining data chunks to clients. Server maintains and periodically multicasts the hast table to all clients/redundant servers. Finally, server merges the mined results replied from clients. Server detects the connections status all the time. If a client fail, server performs the backup mechanism and orders backup client to take over failure client’s job. The client design logic is described below. Once activated, client enrolls with server. While receiving the hash table broadcasted from server, client periodically updates the local hash table and the mining data sent from server. While receiving the data chunk from server, client performs data mining execution and returns the result to server. Client also detects the connection status all the time. If server is not alive, client performs the backup mechanism to elect a client acting as backup server. 4 While failure occurred, the system launches the backup mechanism. If a client fails, server will be informed the connection failure with client. Then, server modifies the connection information in the hash table, finds a client without any designate job in the hash table , and dispatches the unfinished job to the client. If a server fails, all clients will be informed the connection failure with server. Clients keep all connection information in hash table which is periodically updated from server. After server failed, all clients elect a new server through the same election mechanism. Then, new server broadcasts the result to all clients, and enters server listening state. The data mining algorithm utilizes sequential patterns mining algorithm which is Apriori like. Web log pages can be viewed as individual items. The log is separated into transaction of web page sequences. Sequential patterns mining algorithm have a minimum support and minimum confidence parameters and sequential associate rules can be derived from frequent patterns which transaction counts larger than minimum support. Left hand items of association rule pages means web client accessed pages so far. Because minimum confidence means the probability this user will access the right hand items of pages. So request service module can prefetch these pages in advance. After mining data partition by client, the result is sent back to coordinator. Coordinator receives client’s mining results, union and validate results by scan all data again. After this, the total frequent patterns are obtained. Sequential association rules then can direct derived from frequent patterns. The coordinator present association rules as mining result on screen. Installation for server includes weblog file, server module, and client module. For client, the installation includes client module and server module. The role of a node in mining process may switch between server and client. The test covers three parts:component’s unit test, system integration test, and fault tolerance test. The system is divided into three components:server, client, and user interface. The unit tests cover these three components. The fault could happen in component or transmission. Both errors will be handled and tested. V. Conclusion This project is planed to build a distributed data mining system that the components at networked computers work collaboratively to mine the meaningful information from weblog. The implementation begins at mid November, 2003. The system integration test plan to begin at mid December, 2003 through mid January, 2004. A fully distributed data mining system that each participant acts as 5 peer-to-peer autonomous node might be considered in further research. VI. Reference N. Alexandro, K. Dimitrios, and M, Yannis, "A Data Mining Algorithm for generalized Web Prefetching," IEEE Trans. Knowledge and Data Eng., vol. 15, no. 5, pp.1155-1169, Sept/Oct. 2003. 6