DataMIME: Component Based Data mining System Architecture

advertisement
Multi-Layered Framework for Distributed Data Mining
Masum Serazi, Amal Perera, Qiang Ding,
Vasiliy Malakhov, William Perrizo
North Dakota State University
Computer Science Department
Fargo, ND 58105, USA
{Md.Serazi, Amal.Perera, Qiang.Ding,
Vasiliy.Malakhov, William.Perrizo}@ndsu.nodak.edu
Tel: 1.701.231.6404
Topic Areas: Distributed Intelligent Systems, Software Architecture
Multi-Layered Framework for Distributed Data Mining
Abstract
There is an increase in the demand for data mining applications on the web. With the
increase in the size of data sets there is also a demand for scalable generic solutions.
Scalability and generic data mining models can be provided with the use of distributed
computing. In this paper we propose the use of a multilayered framework for a
distributed data mining system. A multi-layered architecture can take advantage of the
latest technological advances in hardware to provide efficient solutions and also allow the
easy addition of new data mining and data capture components to the basic system. A
multi layered architecture also facilitates an iterative development process. In this paper
we show the use of a high end data mining engine with generic data mining models on
the server side and a client that can capture the client requirements over the web.
1. Introduction and Background:
In an increasing number of disciplines, large data collections are emerging as important
resources. In domains as diverse as global climate change, high energy physics, and
computational genomics, the volume of interesting data is already measured in terabytes
and will soon reach petabytes. The communities of users that need to access and mine
this data (often using sophisticated and computationally expensive techniques) are often
large and are almost always geographically distributed. The computing and storage
resources that these communities rely upon to store are also diverse and distributed
[MBM99]. This combination of large dataset size, diversity of data, geographic
distribution of users and resources, and computationally intensive result generation are
demands that are not satisfied by existing stand alone data mining approaches.
There is an immerging trend towards delivering data mining services over the web
[SN00]. The focus is in providing data mining models as services over the internet. The
important issues raised in this context are standards for describing generic data mining
models, integrating models, personalizing models, capture and integration of data,
exchange of messages, description of task requests, estimation of compute time, efficient
computing on massive data sets, and data transfer versus process transfer [SN00,
MBM99, CDG+98, GKM+99, RWL+00, KPH99, and CGG+02].
There were several attempts on large scale distributed data mining. The Kensington
project [CDG+98] is for mining enterprise data distributed across the internet. The
Papyrus project [GKM+99] is a distributed data mining system developed for clusters and
super clusters of workstations. It is composed of four software layers: data management,
data mining, predictive modeling, and agents. Papyrus is based on mobile agents
implemented using Java. Another distributed data mining suite based on Java is
PaDDMAS [RWL+00], a component-based tool set that integrates pre-developed or
custom packages (that can be sequential or parallel) using a dataflow approach. JAM
[SPT97] is an agent-based distributed data mining system that has been developed to
mine data stored in different sites for building so called meta-models as a combination of
several models learned at the different sites where data are stored. JAM uses Java applets
to move data mining agents to remote sites. BODHI is a project [KPH99] for doing
collective data mining with stress on learning from vertically partitioned data. Discovery
Net [CGG+02] provides an architecture for building and managing KDD processes on a
Grid. Most of the projects are implemented as prototypes. In this paper we attempt to
address some of the major issues related to distributed data mining. The solutions
suggested were implemented as part of DataMIME™.
From a user point of view the basic requirement for providing data mining models as
services over the internet is the ability to use efficiently generic customizable data mining
tools on a wide array of data sets over the internet. There are 3 types of architectural
models described in the literature for distributed data mining. They are Client-Server,
Agent based, and Hybrid. Each approach has its own advantages and disadvantages. In
this paper we describe a client-server model. A client-server model is characterized by
the presence of one or more data mining servers. Data and data mining requests are fed
by the client from different locations and are brought to the server for execution. Once
the execution is done the results are presented to the client. There are several reasons to
choose a client-server model in this approach. One major reason is the ability to use high
performance computing on the server side to do the data mining. In this paper we
describe the use of Ptrees1, a distributed compressed vertical data mining ready data
structure. Ptrees have been shown to be scalable on a wide array of data mining
applications [DDP02, KDP02, PSP02, PDD+03, and PWR+03]. The use of Ptrees on the
server side implies converting the data to a uniform data structure. Having a uniform data
structure promotes optimization and ease of developing generic data mining components.
With massive data sets and computationally demanding algorithms, data mining demands
efficient computing. With the rapid increase in hardware performance capabilities, having
a generic mining engine that can efficiently serve multiple mining demands is useful.
Most of the current data mining applications are developed for a particular data structure
and optimized for a particular platform. It is an advantage to be able to develop
applications independent to the underlying data structure and expect to execute in an
efficient manner. The data mining applications and the data structure is separated with an
application programming interface (API). This enables the architecture to allow the Ptree
data structure to be optimized with respect to its functionalities without the dependency
on the data mining algorithms. The API also allows the independent development of data
mining algorithms. In providing the capability to be able to mine on a diverse set of data
sets, there is a requirement to be able to capture any type of data. There are two options
for the user. The user can change the data to a format that can be captured by the system
or the user can write a data feeder by implementing the required interface for the new
data file type.
In many situations data mining requires “Grayware”. Human input is required to tune the
data and the algorithms to suite the need. The client architecture will allow the user to
specify the mining request with a wide array of flexible parameters. The generic modules
1
Patents are pending on the P-tree technology.
This work is partially supported by GSA Grant ACT#: K96130308.
on the server side will be fired up according to the user request. Client server
communication will facilitate the transfer of information.
In the above discussion we can clearly see the importance of having distinct layers in
providing the basic requirements for a client server based distributed data mining system.
The layers identified are, Uniform data structure, Data capture and integration layer, data
mining interface, data mining algorithms, client-server communication, and the client
interface. In the next section we describe the architecture in details of these layers and
how they are organized in the system. In section 3 we describe the characteristics of the
prototype system that was developed to validate the proposed layered architecture.
2. System Architecture
In this section we explain the proposed layered architecture with the use of an example
system DataMIME™, that was developed as proof-of-concept. DataMIME™ is an
efficient and scalable data mining system providing the flexibility of plugging in new
data mining applications when needed. Clients can interact with the DataMIME™ system
to capture their data and convert it into the Ptree format after which they can apply
different data mining applications. The actual data converter along with all the data
mining applications execute on the server side. Figure 1 depicts a conceptual view of the
system.
Capture
dataset to
DataMIME
Mine on
DataMIME
™
Integrate data
to
DataMIME™
Internet
System
performan
ce analysis
Client Side
Server Side
One of the Slave
Servers
Master Server
........
....
Figure 1: The Conceptual Design level view of the System
2.1 Server Architecture:
Multi-threaded concurrent and distributed DataMIME™ server has a layered architecture:
DCI/DII, DMI, DMA, DPMI, and Ptree Data. Figure 2 describes the organization of the
layers.
Already Plugged
Algorithm
Plugs for new
algorithms
DMA Layer
Room for new feeder
DCI/DII Layer
DMI Layer
DPMI: Distributed Ptree Management Interface
Distributed Ptree data
Figure 2: Server Side Architecture of the system
Ptree (Data Structure):
In this system, we use Ptree as the internal representation of the data. Ptrees are tree-like
data structures that store relational data in column-wise, bit-compressed format by
splitting each attribute into bits (i.e. representing each attribute value by its binary
equivalent), grouping together all bits in each bit position for all tuples, and representing
each bit group by a Ptree. Ptrees are a loss less and are structured to facilitate efficient
data mining processes. Once we have represented our data using P-trees, no scans of the
database are needed to perform the data mining task at hand. Data querying can be
achieved through logical operations such as AND, OR and NOT referred to as the Ptree
algebra in the literature. Various aspects of Ptrees including representation, querying,
algebra, speed and compression have been discussed in greater details in [DKR+02]
DCI/DII (Data Integration and Capture Interface):
This layer allows user to capture and to integrate data to the required format (p-tree
format). The main component of this layer is the feeder. An individual feeder can process
a particular format of incoming data. If there is no feeder that can process a particular
format then user can write his own feeder and plug it very easily in this architecture.
DMI (Data Mining Interface):
DMI does counting, the most important operation for data mining provided by P-trees,
including basic P-trees, value P-trees, tuple P-trees, Interval P-trees, and Cube P-trees.
DMI also provide the P-tree algebra, which has four operations, AND, OR, NOT
(complement) and XOR, to implement the point wise logical operations on P-trees for
(Data Mining Algorithms) DMA.
Distributed P-tree Management Interface (DPMI):
The DMPI layer provides access, location, and concurrency transparency by hiding the
fact that data representation may differ and resource access protocol may vary, resources
may be located in different places, and resources may be shared by several competitive
users.
DMA (Data Mining Algorithm) Layer: This layer is a collection of data mining tools
(algorithms). Upon receiving a request from the client side an algorithm will be fired for
mining. This layer depends on DMI for accessing meta-info and required counts. Ptree
based K Nearest Neighbor PKNN [KDP02], Podium Incremental Neighbor Evaluator
PINE [PDD+03], and P-BAYESIAN [PSP02] are available as built-in algorithms in the in
the current DataMIMETM system. The architecture has the flexibility to plug-in a new
algorithm on this layer.
2.2 Communication
In DCI and DMA protocols a client will create connection, send request, receive
answer and close connection. A client will send only one request in a single threaded
connection. In the DCI protocol, a request contains a text-based header which may be
followed by a set of binary files with checksums for each file. The header contains a
command to the server, number of files, and, if request contains files, information about
each file (name and length). The response for a request is just a line with a message
indicating the outcome of the request.
DMA protocol request has a similar structure – header and an optional set of binary
files with checksums. The header in DMA protocol is a set of key / value pairs
(properties), similar to the Java properties file, followed by a terminator. Response to the
DMA protocol request also contains key/value pairs. Each request contains property
'cmd' with name of a command. Other parameters may represent arguments of the
requested command. Depending on a command name and its parameters the server will
call different data mining algorithms to manage this request.
2.3 Client Structure:
In the client side DataMIMETM has a graphical user interface (GUI) to visually interact
with a user. The two main functionalities are:
 Capturing : send a set of data along with its meta information to the DII/DCI
layer of the server

Mining: ability to send a request to DMA for data mining technique on a
dataset that has already been captured, and to be able to display the result.
Meta
Data
Prediction
Model
Data
D
Unclassi
fied data
D
Client side DMA
Client Side
DCI
Data
M
C
Meta-data
generator
Client Side
DCI
Visualization Tool
A
I
(a)
(b)
Figure 3: Client side architecture of the system: (a) Capturing (b) Mining
2.4 GUI for client:
A client side Graphical User Interface has been developed and implemented to facilitate
user interaction with the system. The following figures depict a capture instance and a
classification instance of the system [DM03].
(a)
(b)
Figure 4: Client side Graphical User Interface: (a) Capturing (b) Mining
3. System Characteristics
Initially we raised certain issues that are related in providing scalable data mining
services on the web. In the previous section we describe a layered architecture that can
address most of the issue. In this section we describe the characteristics of DataMIMETM,
a prototype system implemented as a proof-of-concept. To increase usability, we have
designed and implemented the system with an increased emphasis on extensibility and
flexibility. We have developed a wide variety of functions and algorithms. Most
algorithms have turned out to be superior to other well-known methods in terms of speed,
and/or accuracy. We summarize the characteristics as follows.






The system has the ability to handle formatted record-based, relational-like data
with numerical and/or categorical attributes. The data could be in text format,
relational format, or TIFF image format. In addition, easy conversion from any
other machine readable format can be provided through customized feeders
System users can do any data analysis and mining on data sets in the system, or
on any new data they capture or integrate into the system.
The system is capable of handling large quantities of data and mines them in
scalable time.
Clients of the system can run on UNIX and Microsoft Windows (including 95, 98,
NT, 2000, XP, and Server 2003) platform with the server designed to be a UNIXonly system.
The system supports major RDBMS platforms.
The system has an N-Tier architecture providing high flexibility. The server
engine can be run on single machine or distributed across multiple computers for



better scalability and efficiency. The system can automate data ETL (extraction,
transformation, and load) processes or just let the users handle everything
manually.
The system has an open architecture provides high degree of software
extensibility and integration capabilities. Users can not only use the system
provided approaches in association rule mining, classification, prediction, and
similarity search, they can also write their own data mining algorithms using the
Ptree API and compile and deploy them in the DataMIME™ environment, so as
to see the performance of their own algorithms.
With large amounts of data, data operations require time to process. The system
provides high level of asynchronous background operations, performing most data
intensive operations in the background or offline and allowing users to continue
their work.
The system minimizes the flow of data across the network.
4. Conclusion
From a user point of view the basic requirement for providing data mining models as
services over the internet is the ability to use efficiently generic customizable data mining
tools on a wide array of data sets over the internet. From a developers point of view the
architecture should facilitate an iterative development process. This will enable the
integration of new components to the system. This will also allow the developers to take
advantage of the latest developments in hardware. We were able to identify a uniform
efficient vertical data structure at the lowest layer that can take advantage of the latest
hardware. We were able to identify data management layer that facilitates the data
distribution. We also identify a data mining and data capture layer that is defined in the
form of an API. The generic data mining models are developed on top of the data mining
interface. The client is built on top of the communication layer to capture the user
requirements for a particular job.
In this paper we have shown the importance of having a layered architecture for a
distributed data mining system. Key requirements were identified in deciding on the
different layers. A prototype system was developed as a proof-of-concept to show the
feasibility of the approach.
References:
[CDG+98]
J. Chattratichat, J. Darlington, Y. Guo, S. Hedvall, M. Khler, J. S. A.
Saleem, and D. Yang. Deploying enterprise data mining on the internet. In
PAKDD, 1998.
[CGG+02]
V. Curcin, M. Ghanem, Y. Guo, M. Kohler, A. Rowe, J. Syed, P. Wendel.
Discovery Net: Towards a Grid of Knowledge Discovery. ACM KDD
2002.
[DDP02]
Qin Ding, Qiang Ding and W. Perrizo, Association Rule Mining on
Remotely Sensed Images Using P-trees, Proceedings of PAKDD2002,
Taipei, Taiwan, May 6-8, 2002.
[DKR+02]
Q. Ding, M. Khan, A. Roy, and W. Perrizo, The P-tree algebra.
Proceedings of the ACM SAC, Symposium on Applied Computing, 2002.
[DM03]
DataMIMETM Homepage,
“http://midas.cs.ndsu.nodak.edu/~datasurg/datamime/”
[GKM+99]
R. L. Grossman, S. Kasif, D. Mon, A. Ramu, and B. Malhi. The
preliminary design of papyrus: A system for high performance, distributed
data mining over clusters, meta-clusters and super-clusters, 1999.
[KDP02]
M. Khan, Q. Ding and W. Perrizo, K-nearest Neighbor Classification on
Spatial Data Stream Using P-trees, Proceedings of PAKDD 2002,
Springer-Verlag, Lecture Notes in Artificial Intelligence 2336, May 2002,
pp. 517-528.
[KPH99]
H. Kargupta, B. Park, D. Hershberger and E. Johnson, Collective data
mining: a new perspective toward distributed data mining, In H. Kargupta
and P. Chan (eds.) Advances in Distributed and Parallel Knowledge
Discovery, AAAI Press 1999.
[MBM99]
R. Moore, C. Baru, R. Marciano, A. Rajasekar, and M. Wan. Dataintensive computing. In The Grid: Blueprint for a Future Computing
Infrastructure.edited by I. Foster and C. Kesselman, Morgan Kaufmann
Publishers, 1999., pages 105 - 129
[PDD+03]
W. Perrizo, Qin Ding, A. Denton, K. Scott, Qiang Ding, and M. Khan,
PINE - Podium Incremental Neighbor Evaluator for Spatial Data using
Ptrees ,ACM SAC 2003.
[PSP02]
A. Perera, M. Serazi, W. Perrizo ,Performance Improvement for Bayesian
Classification with Ptrees ,CAINE'02, San Diego, Nov. 2002.
[RWL+00]
O.F. Rana, D.W. Walker, M. Li, S. Lynden and M. Ward, PaDDMAS:
parallel and distributed data mining application suite, Proc. International
Parallel and Distributed Processing Symposium (IPDPS/SPDP), IEEE
Computer Society Press, 2000, pp. 387-392.
[PWR+03]
F. Pan, B. Wang, D. Ren, X. Hu, and W. Perrizo, Proximal Support Vector
Machine for Spatial Data Using Peano Trees, CAINE 2003.
[SN00]
S. Sarawagi, and S.H. Nagaralu, .Data Mining Models as Services on the
Internet., SIGKDD Explorations. June 2003 Volume 2 Issue1
[SPT97]
S.J. Stolfo, A.L. Prodromidis, S. Tselepis, W. Lee, D.W. Fan, and P.K.
Chan, JAM: Java agents for meta-learning over distributed databases,
International KDD’97 Conference, 1997, pp. 74-81.
Download