Distributed Information Systems - Project Proposal

advertisement
Distributed Information Systems - Project Final Report
Distributed Data Mining Systems in Java
D91725001 王春笙、D92725002 林俊甫、D92725001 王慧芬
January, 2004
Installation Guide
Project requirements

Must involve multithread programming.

Available middleware: CORBA, DCOM, or Java RMI.
Must have components spreading over different machines. Each team member






I.
must independently be in charge of some of the components, and be fully
responsible of their functionality.
Must have a clear interface definition between components.
Must handle failures. Failures to be considered: link broken, stop failures.
System integration and Collaboration: 10 %
Degree of fault tolerance (and scalability, if applies): 10 %
Completeness: 10 %
Introduction
Motivation
A weblog is a comprehensive web access log. It keeps track of activity on the
site by month, week, day and hour. There exists many meaningful information in a
weblog, such as total hits, bytes transferred and page views, and the most popular
pages. Data mining is a set of search techniques used to extract useful information
hidden within a large set of data, such as weblog.
However, it is time-consuming
to perform multi-layer data mining on a large data file, such as a weblog.
The sharing of resources, such as sharing CPU for computation and search, is a
main motivation for constructing distributed systems.
Goal
The goal of this project is to build a distributed data mining system that the
components located at networked computers work collaboratively to mine the
meaningful information from weblog. We expect that the performance of mining
weblog increase by the jointed computing resources.
1
Besides, by the fault
tolerate techniques, we believe the mining process will continue despite of any
component crash in the distributed system.
II.
System Architecture
Figure 1 depicts the system architecture.
Prediction engine
A prediction engine can deduct future references on the basis of predictive
prefetching can be implemented. After processing the past reference prediction
engine derives the probability of future access for the documents accessed so far.
The prediction engine resides in the server side. The web frequent patterns mining
process mainly resides in prediction engine.
Web Server
The server store web pages and record web access log.
Request service module
Usually a proxy server which use mining results request from prediction engine.
According to most recently access patterns, this module prefetch pages that the
prediction engine predict.
Web Client
The web client is the web browsers.
Prediction engine
Node
Node
Node
distributed
mining
2
Web Server
Log files
Request service
module
Figure 1:system architecture
III.
Technological Infrastructure
Client
Client
Client
...
LAN
Mining data chunk
Server/Coordinator
Figure 2:System diagram
The system consists pf server/coordinator and clients.
A complete weblog is
kept in server. Server divides weblog into pieces of data chunks, and dispatches
chucks to the connected clients for mining.
3
The system design follows both transparency and scalability criteria. The design
uses tools provided by Java:multi-thread, RMI, multicast, and socket. Besides, the
system provides redundancy to cope with server or client failure.
The rationale of the system covers three aspects.
First, data mining is a
computing intensive task. Second, speed pf weblog data generation may be too
quick that one single computer cannot handle it. Third, implementing distributed
prediction engine have fault tolerance advantage.
Two alternatives could be considered:fully distributed data mining system and
client/server distributed data mining system. In a fully distributed data mining
system, each participant acts as peer-to-peer autonomous node.
For the
client/server distributed data mining system, the data server acts as fixed
coordinator.
IV. Implementation Phase
The system requirements for hardware composed of 2 or 3 PCs with Microsoft
Windows platform, one acts as server and the other act as clients as well as
redundant servers. About software, the system requires J2SE SDK 1.4.1, Eclipse
2.1, and Netbeam 3.5.1 for implementation; Java web start for execution.
The server design logic covers the following parts.
First, server activates a
well-known port waiting for connection from clients by threads. The server keeps
track of all the connection information in a hash table. While connected with clients,
server dispatches the designate mining data chunks to clients. Server maintains
and periodically multicasts the hast table to all clients/redundant servers. Finally,
server merges the mined results replied from clients. Server detects the
connections status all the time. If a client fail, server performs the backup
mechanism and orders backup client to take over failure client’s job.
The client design logic is described below. Once activated, client enrolls with
server. While receiving the hash table broadcasted from server, client periodically
updates the local hash table and the mining data sent from server. While receiving
the data chunk from server, client performs data mining execution and returns the
result to server. Client also detects the connection status all the time. If server is
not alive, client performs the backup mechanism to elect a client acting as backup
server.
4
While failure occurred, the system launches the backup mechanism. If a client
fails, server will be informed the connection failure with client. Then, server
modifies the connection information in the hash table, finds a client without any
designate job in the hash table , and dispatches the unfinished job to the client.
If
a server fails, all clients will be informed the connection failure with server. Clients
keep all connection information in hash table which is periodically updated from
server. After server failed, all clients elect a new server through the same election
mechanism. Then, new server broadcasts the result to all clients, and enters server
listening state.
The data mining algorithm utilizes sequential patterns mining algorithm which is
Apriori like. Web log pages can be viewed as individual items. The log is separated
into transaction of web page sequences. Sequential patterns mining algorithm
have a minimum support and minimum confidence parameters and sequential
associate rules can be derived from frequent patterns which transaction counts
larger than minimum support. Left hand items of association rule pages means
web client accessed pages so far. Because minimum confidence means the
probability this user will access the right hand items of pages. So request service
module can prefetch these pages in advance. After mining data partition by client,
the result is sent back to coordinator. Coordinator receives client’s mining results,
union and validate results by scan all data again. After this, the total frequent
patterns are obtained. Sequential association rules then can direct derived from
frequent patterns. The coordinator present association rules as mining result on
screen.
Installation for server includes weblog file, server module, and client module. For
client, the installation includes client module and server module. The role of a
node in mining process may switch between server and client.
The test covers three parts:component’s unit test, system integration test, and
fault tolerance test. The system is divided into three components:server, client,
and user interface. The unit tests cover these three components. The fault could
happen in component or transmission. Both errors will be handled and tested.
V.
Conclusion
This project is planed to build a distributed data mining system that the
components at networked computers work collaboratively to mine the meaningful
information from weblog. The implementation begins at mid November, 2003.
The system integration test plan to begin at mid December, 2003 through mid
January, 2004. A fully distributed data mining system that each participant acts as
5
peer-to-peer autonomous node might be considered in further research.
VI. Reference
N. Alexandro, K. Dimitrios, and M, Yannis, "A Data Mining Algorithm for generalized
Web Prefetching," IEEE Trans. Knowledge and Data Eng., vol. 15, no. 5,
pp.1155-1169, Sept/Oct. 2003.
6
Download
Study collections