Weka4WS for Distributed Data Mining on Grids DEIS

advertisement
Weka4WS: A WSRF-Enabled Weka Toolkit
for Distributed Data Mining on Grids
Domenico Talia, Paolo Trunfio, Oreste Verta
DEIS
University of Calabria
CoreGRID Virtual Institute on Knowledge and Data Management
TG5 Meeting, Edinburgh, January 2006
Outline
† Goal of this work
„
Weka4WS, WSRF and GT4
† Weka4WS architecture
„
Software components
„
Graphical user interface
† Web Services operations
† An example
† Performance analysis
† Conclusions
TG5 Meeting, Edinburgh, January 2006
2
Goal of this work
† By exploting a service-oriented approach, knowledge
discovery applications can be developed on Grids
to deliver high performance and manage data and
knowledge distribution
† The goal of this work is to extend the Weka toolkit for
supporting distributed data mining through the use of
standard Grid technologies, such as Globus Toolkit 4 and
the emerging Web Services Resource Framework (WSRF)
TG5 Meeting, Edinburgh, January 2006
3
The Weka4WS framework
† Weka is the most used open source suite that provides a
large collection of data mining algorithms for data preprocessing, classification, clustering, association rules,
and visualization, which are used through a common GUI
† In Weka, the overall data mining process takes place on
a single machine, since the algorithms can be executed
only locally
† The goal of Weka4WS is to extend Weka to support
remote execution of Weka data mining algorithms
TG5 Meeting, Edinburgh, January 2006
4
The Weka4WS framework
† In Weka4WS the data mining algorithms for
classification, clustering and association rules can be
executed on remote Grid resources to implement
distributed data mining and speedup applications
† To enable remote invocation, all data mining
algorithms provided by the Weka library are
exposed as Grid Services
† To achieve integration and interoperability with standard
Grid environments, Weka4WS has been designed and
developed by using the emerging Web Services Resource
Framework (WSRF) as enabling technology
TG5 Meeting, Edinburgh, January 2006
5
Outline
† Goal of this work
„
Weka4WS, WSRF and GT4
† Weka4WS architecture
„
Software components
„
Graphical user interface
† Web Services operations
† An example
† Performance analysis
† Conclusions
TG5 Meeting, Edinburgh, January 2006
6
Weka4WS architecture
† In Weka4WS all nodes use GT4 services for standard Grid
functionalities, such as security and data management
† We distinguish those nodes in two categories:
„
user nodes, which are the local machines of the users
providing the Weka4WS client software
„
computing nodes, which provide the Weka4WS Web
Services allowing the execution of remote data mining tasks
† Data can be located on computing nodes, user nodes, or
third-party nodes
† If the dataset to be mined is not locally available on a
computing node, it can be downloaded or replicated by
means of the GT4 data management services
TG5 Meeting, Edinburgh, January 2006
7
Software components
Web
Service
Graphical
User Interface
Weka
Library
Client
Module
Weka
Library
GT4 Services
GT4 Services
User node
Computing node
† User nodes include three software components:
„
Graphical User Interface (GUI)
„
Client Module (CM)
„
Weka Library (WL)
TG5 Meeting, Edinburgh, January 2006
8
Software components
Graphical
User Interface
Weka
Library
Client
Module
Web
Service
Weka
Library
GT4 Services
GT4 Services
User node
Computing node
† Computing nodes include two software components:
„
Web Service (WS)
„
Weka Library (WL)
TG5 Meeting, Edinburgh, January 2006
9
Software components
local
task
Web
Service
Graphical
User Interface
Weka
Library
Client
Module
remote
task
Weka
Library
GT4 Services
GT4 Services
User node
Computing node
† The GUI extends the Weka Explorer environment to allow
the execution of both local and remote data mining tasks:
„
local tasks are executed by directly invoking the local WL
„
remote tasks are executed through the CM, which operates
as an intermediary between the GUI and Web Services on
remote computing nodes
TG5 Meeting, Edinburgh, January 2006
10
Software components
Graphical
User Interface
Weka
Library
Client
Module
Web
Service
Weka
Library
GT4 Services
GT4 Services
User node
Computing node
Algorithm
invocation
† The WS is a WSRF-compliant Web Service that exposes
the data mining algorithms provided by the underlying
Weka Library
† Therefore, requests to the WS are executed by invoking
the corresponding WL algorithms
TG5 Meeting, Edinburgh, January 2006
11
Graphical user interface
† A “Remote pane” has been added to the original Weka
Explorer environment
TG5 Meeting, Edinburgh, January 2006
12
Graphical user interface
† This pane provides a list of the available remote Grid
nodes and two buttons to start and stop the data mining
task on the selected Grid node
TG5 Meeting, Edinburgh, January 2006
13
Graphical user interface
† Through the GUI a user can both:
„
start the execution locally by using the standard Local pane
„
start the execution remotely by using the Remote pane
† Each task in the GUI is managed by an independent
thread in an asynchronous way.
† A user can start multiple data mining tasks in
parallel on different Web Services, this way taking full
advantage of the distributed Grid environment
† Whenever the output of a data mining task has been
received from a remote computing node, it is visualized
in the standard Output pane
TG5 Meeting, Edinburgh, January 2006
14
Outline
† Goal of this work
„
Weka4WS, WSRF and GT4
† Weka4WS architecture
„
Software components
„
Graphical user interface
† Web Services operations
† An example
† Performance analysis
† Conclusions
TG5 Meeting, Edinburgh, January 2006
15
Web Services operations
WSRF
specific
Data
mining
createResource
Creates a new WS-Resource.
subscribe
Subscribes to notifications about resource
properties changes.
destroy
Explicitly requests the destruction of a WSResource.
classification
Submits the execution of a classification task.
clustering
Submits the execution of a clustering task.
associationRules
Submits the execution of an association rules
task.
† The first three operations are related to WSRF-specific
invocation mechanisms
† The last three operations are used to require the
execution of a specific data mining task
TG5 Meeting, Edinburgh, January 2006
16
Web Services operations
† The classification operation provides access to the
complete set of classifiers in the Weka Library (currently,
71 algorithms)
† The clustering and associationRules operations expose all
the clustering and association rules algorithms provided
by the Weka Library (5 and 2 algorithms, respectively)
† To improve concurrency the data mining operations are
invoked in an asynchronous way:
„
the client submits the execution in a non-blocking mode,
and results are notified to the client whenever they have
been computed
TG5 Meeting, Edinburgh, January 2006
17
Outline
† Goal of this work
„
Weka4WS, WSRF and GT4
† Weka4WS architecture
„
Software components
„
Graphical user interface
† Web Services operations
„
Execution mechanisms
† An example
† Performance analysis
† Conclusions
TG5 Meeting, Edinburgh, January 2006
18
Example: file selection
TG5 Meeting, Edinburgh, January 2006
19
Example: file selection
TG5 Meeting, Edinburgh, January 2006
20
Example: task selection
TG5 Meeting, Edinburgh, January 2006
21
Example: algorithm selection
TG5 Meeting, Edinburgh, January 2006
22
Example: host selection
TG5 Meeting, Edinburgh, January 2006
23
Example: execution start
TG5 Meeting, Edinburgh, January 2006
24
Example: task execution…
Status of the
remote execution
TG5 Meeting, Edinburgh, January 2006
25
Example: output visualization
Output
TG5 Meeting, Edinburgh, January 2006
26
Outline
† Goal of this work
„
Weka4WS, WSRF and GT4
† Weka4WS architecture
„
Software components
„
Graphical user interface
† Web Services operations
† An example
† Performance analysis
† Conclusions
TG5 Meeting, Edinburgh, January 2006
27
Performance analysis
† Goals:
„
evaluating the execution times of the different steps needed to
perform a typical data mining task in different network scenarios
„
evaluating the efficiency of the WSRF mechanisms and Weka4WS
as methods to execute distributed data mining services
† We used 10 datasets extrated from the census dataset
available at the UCI repository:
„
number of instances: from 1700 to 17000
„
dataset size: from 0.5 to 5 MB
† Weka4WS has been used to perform a clustering analysis
on each of these datasets:
„
algorithm used: Expectation Maximization (EM)
„
number of clusters to be identified: 10
TG5 Meeting, Edinburgh, January 2006
28
Performance analysis
† The clustering analysis on each dataset was executed in
two network scenarios:
„
Local Area Grid (LAG): the computing node and the user
node are connected by a local area network (Bw = 94.4
Mbps, RTT = 1.4 ms)
„
Wide Area Grid (WAG): the computing node and the user
node are connected by a wide area network (Bw = 213
kbps, RTT = 19 ms)
† For each dataset size and network scenario we run 20
independent executions:
„
the values reported in the following graphs are computed
as an average of the values measured in the 20 executions
TG5 Meeting, Edinburgh, January 2006
29
1.03E+06
9.24E+05
8.30E+05
7.26E+05
6.25E+05
5.19E+05
4.16E+05
2.12E+05
1.0E+06
1.12E+05
Execution
(ms)
Execution time
time (ms)
1.0E+07
3.11E+05
Execution times - LAG scenario
data mining
Resource creation
Notification subscription
Task submission
1.0E+05
Dataset download
Data mining
1.0E+04
Results notification
Resource destruction
1.0E+03
Total
dataset download
1.0E+02
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
Dataset size
size (MB)
Dataset
(MB)
„
The data mining time takes most of the totale time, so WSRF
invocation and execution mechanisms take a very small
portion of time.
TG5 Meeting, Edinburgh, January 2006
30
1.03E+06
9.24E+05
8.30E+05
7.26E+05
6.25E+05
5.19E+05
4.16E+05
2.12E+05
1.0E+06
1.12E+05
Execution
(ms)
Execution time
time (ms)
1.0E+07
3.11E+05
Execution times - LAG scenario
total time
Resource creation
Notification subscription
Task submission
1.0E+05
Dataset download
Data mining
1.0E+04
Results notification
Resource destruction
1.0E+03
Total
1.0E+02
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
Dataset size
size (MB)
Dataset
(MB)
„
The total execution time ranges from 111798 ms for the
dataset of 0.5 MB, to 1031209 ms for the dataset of 5 MB
†
The lines representing the total execution time and the data
mining execution time appear coincident, because the data mining
step takes from 96% to 99% of the total execution time
TG5 Meeting, Edinburgh, January 2006
31
1.18E+06
1.06E+06
9.64E+05
8.49E+05
7.25E+05
6.16E+05
4.86E+05
2.46E+05
1.0E+06
1.30E+05
Execution
(ms)
Execution time
time (ms)
1.0E+07
3.61E+05
Execution times - WAG scenario
Resource creation
Notification subscription
Task submission
1.0E+05
Dataset download
Data mining
1.0E+04
Results notification
Resource destruction
1.0E+03
Total
1.0E+02
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
Dataset size
size (MB)
Dataset
(MB)
„
The execution times of the WSRF-specific steps are similar to
those measured in the LAG scenario
†
The only significant difference is the execution time of the results
notification step (2790 ms), due to additional time needed to
transfer the clustering model through a low-speed network
TG5 Meeting, Edinburgh, January 2006
32
1.18E+06
1.06E+06
9.64E+05
8.49E+05
7.25E+05
6.16E+05
4.86E+05
2.46E+05
1.0E+06
1.30E+05
Execution
(ms)
Execution time
time (ms)
1.0E+07
3.61E+05
Execution times - WAG scenario
Resource creation
Notification subscription
Task submission
1.0E+05
Dataset download
Data mining
1.0E+04
Results notification
Resource destruction
1.0E+03
Total
1.0E+02
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
Dataset size
size (MB)
Dataset
(MB)
„
For the same reason, the transfer of the dataset to be mined
requires an execution time significantly greater than the one
measured in the LAG scenario:
†
Dataset download: 14638 ms (0.5 MB) ... 132463 ms (5 MB)
TG5 Meeting, Edinburgh, January 2006
33
100
90
80
99.21%
99.32%
99.40%
99.48%
99.52%
99.55%
40
99.02%
50
98.71%
60
98.02%
70
96.13%
% of %
the
total
of the
total execution
execution timetime
Execution times percentage - LAG scenario
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
Other steps
Dataset download
Data mining
30
20
10
0
Dataset size (MB)
Dataset size (MB)
„
This graph show the percentage of the execution times of the
data mining, dataset download, and the other steps (i.e.,
resource creation, notification subscription, task submission,
results notification, resource destruction), w.r.t. the total
execution time in the LAG scenario
TG5 Meeting, Edinburgh, January 2006
34
100
90
80
99.21%
99.32%
99.40%
99.48%
99.52%
99.55%
40
99.02%
50
98.71%
60
98.02%
70
96.13%
% of %
the
total
of the
total execution
execution timetime
Execution times percentage - LAG scenario
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
Other steps
Dataset download
Data mining
30
20
10
0
Dataset size (MB)
Dataset size (MB)
„
In the LAG scenario the data mining step represents from
96.13% to 99.55% of the total execution time, the dataset
download ranges from 0.19% to 0.06%, and the other steps
range from 3.67% to 0.38%
TG5 Meeting, Edinburgh, January 2006
35
100
90
80
70
60
87.29%
87.56%
88.10%
88.19%
88.31%
88.32%
88.25%
88.32%
40
86.40%
50
Other steps
84.62%
% of %
the
total
of the
total execution
execution timetime
Execution times percentage - WAG scenario
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
Dataset download
Data mining
30
20
10
0
Dataset size (MB)
Dataset size (MB)
„
In the WAG scenario the data mining step represents from
84.62% to 88.32% of the total execution time, the dataset
download ranges from 11.22% to 11.20%, while the other
steps range from 4.16% to 0.48%
TG5 Meeting, Edinburgh, January 2006
36
Conclusions
† Weka4WS adopts the emerging Web Services Resource
Framework (WSRF) for accessing remote data mining
algorithms and executing remote computations
† The experimental results demonstrate the efficiency of
the WSRF mechanisms as a means to execute data
mining tasks on remote resources
† The Weka4WS software (including source code) is
available for download at the following URL:
grid.deis.unical.it/weka4ws
TG5 Meeting, Edinburgh, January 2006
37
TG5 Meeting, Edinburgh, January 2006
38
Thanks!
Download