Weka4WS: A WSRF-Enabled Weka Toolkit for Distributed Data Mining on Grids Domenico Talia, Paolo Trunfio, Oreste Verta DEIS University of Calabria CoreGRID Virtual Institute on Knowledge and Data Management TG5 Meeting, Edinburgh, January 2006 Outline Goal of this work Weka4WS, WSRF and GT4 Weka4WS architecture Software components Graphical user interface Web Services operations An example Performance analysis Conclusions TG5 Meeting, Edinburgh, January 2006 2 Goal of this work By exploting a service-oriented approach, knowledge discovery applications can be developed on Grids to deliver high performance and manage data and knowledge distribution The goal of this work is to extend the Weka toolkit for supporting distributed data mining through the use of standard Grid technologies, such as Globus Toolkit 4 and the emerging Web Services Resource Framework (WSRF) TG5 Meeting, Edinburgh, January 2006 3 The Weka4WS framework Weka is the most used open source suite that provides a large collection of data mining algorithms for data preprocessing, classification, clustering, association rules, and visualization, which are used through a common GUI In Weka, the overall data mining process takes place on a single machine, since the algorithms can be executed only locally The goal of Weka4WS is to extend Weka to support remote execution of Weka data mining algorithms TG5 Meeting, Edinburgh, January 2006 4 The Weka4WS framework In Weka4WS the data mining algorithms for classification, clustering and association rules can be executed on remote Grid resources to implement distributed data mining and speedup applications To enable remote invocation, all data mining algorithms provided by the Weka library are exposed as Grid Services To achieve integration and interoperability with standard Grid environments, Weka4WS has been designed and developed by using the emerging Web Services Resource Framework (WSRF) as enabling technology TG5 Meeting, Edinburgh, January 2006 5 Outline Goal of this work Weka4WS, WSRF and GT4 Weka4WS architecture Software components Graphical user interface Web Services operations An example Performance analysis Conclusions TG5 Meeting, Edinburgh, January 2006 6 Weka4WS architecture In Weka4WS all nodes use GT4 services for standard Grid functionalities, such as security and data management We distinguish those nodes in two categories: user nodes, which are the local machines of the users providing the Weka4WS client software computing nodes, which provide the Weka4WS Web Services allowing the execution of remote data mining tasks Data can be located on computing nodes, user nodes, or third-party nodes If the dataset to be mined is not locally available on a computing node, it can be downloaded or replicated by means of the GT4 data management services TG5 Meeting, Edinburgh, January 2006 7 Software components Web Service Graphical User Interface Weka Library Client Module Weka Library GT4 Services GT4 Services User node Computing node User nodes include three software components: Graphical User Interface (GUI) Client Module (CM) Weka Library (WL) TG5 Meeting, Edinburgh, January 2006 8 Software components Graphical User Interface Weka Library Client Module Web Service Weka Library GT4 Services GT4 Services User node Computing node Computing nodes include two software components: Web Service (WS) Weka Library (WL) TG5 Meeting, Edinburgh, January 2006 9 Software components local task Web Service Graphical User Interface Weka Library Client Module remote task Weka Library GT4 Services GT4 Services User node Computing node The GUI extends the Weka Explorer environment to allow the execution of both local and remote data mining tasks: local tasks are executed by directly invoking the local WL remote tasks are executed through the CM, which operates as an intermediary between the GUI and Web Services on remote computing nodes TG5 Meeting, Edinburgh, January 2006 10 Software components Graphical User Interface Weka Library Client Module Web Service Weka Library GT4 Services GT4 Services User node Computing node Algorithm invocation The WS is a WSRF-compliant Web Service that exposes the data mining algorithms provided by the underlying Weka Library Therefore, requests to the WS are executed by invoking the corresponding WL algorithms TG5 Meeting, Edinburgh, January 2006 11 Graphical user interface A “Remote pane” has been added to the original Weka Explorer environment TG5 Meeting, Edinburgh, January 2006 12 Graphical user interface This pane provides a list of the available remote Grid nodes and two buttons to start and stop the data mining task on the selected Grid node TG5 Meeting, Edinburgh, January 2006 13 Graphical user interface Through the GUI a user can both: start the execution locally by using the standard Local pane start the execution remotely by using the Remote pane Each task in the GUI is managed by an independent thread in an asynchronous way. A user can start multiple data mining tasks in parallel on different Web Services, this way taking full advantage of the distributed Grid environment Whenever the output of a data mining task has been received from a remote computing node, it is visualized in the standard Output pane TG5 Meeting, Edinburgh, January 2006 14 Outline Goal of this work Weka4WS, WSRF and GT4 Weka4WS architecture Software components Graphical user interface Web Services operations An example Performance analysis Conclusions TG5 Meeting, Edinburgh, January 2006 15 Web Services operations WSRF specific Data mining createResource Creates a new WS-Resource. subscribe Subscribes to notifications about resource properties changes. destroy Explicitly requests the destruction of a WSResource. classification Submits the execution of a classification task. clustering Submits the execution of a clustering task. associationRules Submits the execution of an association rules task. The first three operations are related to WSRF-specific invocation mechanisms The last three operations are used to require the execution of a specific data mining task TG5 Meeting, Edinburgh, January 2006 16 Web Services operations The classification operation provides access to the complete set of classifiers in the Weka Library (currently, 71 algorithms) The clustering and associationRules operations expose all the clustering and association rules algorithms provided by the Weka Library (5 and 2 algorithms, respectively) To improve concurrency the data mining operations are invoked in an asynchronous way: the client submits the execution in a non-blocking mode, and results are notified to the client whenever they have been computed TG5 Meeting, Edinburgh, January 2006 17 Outline Goal of this work Weka4WS, WSRF and GT4 Weka4WS architecture Software components Graphical user interface Web Services operations Execution mechanisms An example Performance analysis Conclusions TG5 Meeting, Edinburgh, January 2006 18 Example: file selection TG5 Meeting, Edinburgh, January 2006 19 Example: file selection TG5 Meeting, Edinburgh, January 2006 20 Example: task selection TG5 Meeting, Edinburgh, January 2006 21 Example: algorithm selection TG5 Meeting, Edinburgh, January 2006 22 Example: host selection TG5 Meeting, Edinburgh, January 2006 23 Example: execution start TG5 Meeting, Edinburgh, January 2006 24 Example: task execution… Status of the remote execution TG5 Meeting, Edinburgh, January 2006 25 Example: output visualization Output TG5 Meeting, Edinburgh, January 2006 26 Outline Goal of this work Weka4WS, WSRF and GT4 Weka4WS architecture Software components Graphical user interface Web Services operations An example Performance analysis Conclusions TG5 Meeting, Edinburgh, January 2006 27 Performance analysis Goals: evaluating the execution times of the different steps needed to perform a typical data mining task in different network scenarios evaluating the efficiency of the WSRF mechanisms and Weka4WS as methods to execute distributed data mining services We used 10 datasets extrated from the census dataset available at the UCI repository: number of instances: from 1700 to 17000 dataset size: from 0.5 to 5 MB Weka4WS has been used to perform a clustering analysis on each of these datasets: algorithm used: Expectation Maximization (EM) number of clusters to be identified: 10 TG5 Meeting, Edinburgh, January 2006 28 Performance analysis The clustering analysis on each dataset was executed in two network scenarios: Local Area Grid (LAG): the computing node and the user node are connected by a local area network (Bw = 94.4 Mbps, RTT = 1.4 ms) Wide Area Grid (WAG): the computing node and the user node are connected by a wide area network (Bw = 213 kbps, RTT = 19 ms) For each dataset size and network scenario we run 20 independent executions: the values reported in the following graphs are computed as an average of the values measured in the 20 executions TG5 Meeting, Edinburgh, January 2006 29 1.03E+06 9.24E+05 8.30E+05 7.26E+05 6.25E+05 5.19E+05 4.16E+05 2.12E+05 1.0E+06 1.12E+05 Execution (ms) Execution time time (ms) 1.0E+07 3.11E+05 Execution times - LAG scenario data mining Resource creation Notification subscription Task submission 1.0E+05 Dataset download Data mining 1.0E+04 Results notification Resource destruction 1.0E+03 Total dataset download 1.0E+02 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 Dataset size size (MB) Dataset (MB) The data mining time takes most of the totale time, so WSRF invocation and execution mechanisms take a very small portion of time. TG5 Meeting, Edinburgh, January 2006 30 1.03E+06 9.24E+05 8.30E+05 7.26E+05 6.25E+05 5.19E+05 4.16E+05 2.12E+05 1.0E+06 1.12E+05 Execution (ms) Execution time time (ms) 1.0E+07 3.11E+05 Execution times - LAG scenario total time Resource creation Notification subscription Task submission 1.0E+05 Dataset download Data mining 1.0E+04 Results notification Resource destruction 1.0E+03 Total 1.0E+02 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 Dataset size size (MB) Dataset (MB) The total execution time ranges from 111798 ms for the dataset of 0.5 MB, to 1031209 ms for the dataset of 5 MB The lines representing the total execution time and the data mining execution time appear coincident, because the data mining step takes from 96% to 99% of the total execution time TG5 Meeting, Edinburgh, January 2006 31 1.18E+06 1.06E+06 9.64E+05 8.49E+05 7.25E+05 6.16E+05 4.86E+05 2.46E+05 1.0E+06 1.30E+05 Execution (ms) Execution time time (ms) 1.0E+07 3.61E+05 Execution times - WAG scenario Resource creation Notification subscription Task submission 1.0E+05 Dataset download Data mining 1.0E+04 Results notification Resource destruction 1.0E+03 Total 1.0E+02 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 Dataset size size (MB) Dataset (MB) The execution times of the WSRF-specific steps are similar to those measured in the LAG scenario The only significant difference is the execution time of the results notification step (2790 ms), due to additional time needed to transfer the clustering model through a low-speed network TG5 Meeting, Edinburgh, January 2006 32 1.18E+06 1.06E+06 9.64E+05 8.49E+05 7.25E+05 6.16E+05 4.86E+05 2.46E+05 1.0E+06 1.30E+05 Execution (ms) Execution time time (ms) 1.0E+07 3.61E+05 Execution times - WAG scenario Resource creation Notification subscription Task submission 1.0E+05 Dataset download Data mining 1.0E+04 Results notification Resource destruction 1.0E+03 Total 1.0E+02 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 Dataset size size (MB) Dataset (MB) For the same reason, the transfer of the dataset to be mined requires an execution time significantly greater than the one measured in the LAG scenario: Dataset download: 14638 ms (0.5 MB) ... 132463 ms (5 MB) TG5 Meeting, Edinburgh, January 2006 33 100 90 80 99.21% 99.32% 99.40% 99.48% 99.52% 99.55% 40 99.02% 50 98.71% 60 98.02% 70 96.13% % of % the total of the total execution execution timetime Execution times percentage - LAG scenario 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 Other steps Dataset download Data mining 30 20 10 0 Dataset size (MB) Dataset size (MB) This graph show the percentage of the execution times of the data mining, dataset download, and the other steps (i.e., resource creation, notification subscription, task submission, results notification, resource destruction), w.r.t. the total execution time in the LAG scenario TG5 Meeting, Edinburgh, January 2006 34 100 90 80 99.21% 99.32% 99.40% 99.48% 99.52% 99.55% 40 99.02% 50 98.71% 60 98.02% 70 96.13% % of % the total of the total execution execution timetime Execution times percentage - LAG scenario 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 Other steps Dataset download Data mining 30 20 10 0 Dataset size (MB) Dataset size (MB) In the LAG scenario the data mining step represents from 96.13% to 99.55% of the total execution time, the dataset download ranges from 0.19% to 0.06%, and the other steps range from 3.67% to 0.38% TG5 Meeting, Edinburgh, January 2006 35 100 90 80 70 60 87.29% 87.56% 88.10% 88.19% 88.31% 88.32% 88.25% 88.32% 40 86.40% 50 Other steps 84.62% % of % the total of the total execution execution timetime Execution times percentage - WAG scenario 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 Dataset download Data mining 30 20 10 0 Dataset size (MB) Dataset size (MB) In the WAG scenario the data mining step represents from 84.62% to 88.32% of the total execution time, the dataset download ranges from 11.22% to 11.20%, while the other steps range from 4.16% to 0.48% TG5 Meeting, Edinburgh, January 2006 36 Conclusions Weka4WS adopts the emerging Web Services Resource Framework (WSRF) for accessing remote data mining algorithms and executing remote computations The experimental results demonstrate the efficiency of the WSRF mechanisms as a means to execute data mining tasks on remote resources The Weka4WS software (including source code) is available for download at the following URL: grid.deis.unical.it/weka4ws TG5 Meeting, Edinburgh, January 2006 37 TG5 Meeting, Edinburgh, January 2006 38 Thanks!