WEEKLY STATUS REPORT Juan Bernal 7/17/2008 ACTIVITIES: This week work on Weka-parallel was done, first there was some problems on the use of the system but after it Weka-parallel was able to be used and tested. Basically, Weka-parallel was executed from the GUI interface rather than from command line due to constant errors given when trying the command line execution. From the GUI tests could be done on the SMO classification algorithm, but RandomForest and MultiLayerPerceptron were not able to be tested because Weka-Parallel was created based on an old version of Weka that did not had RF or MLP implemented. So only SMO was tested under Weka-parallel in localhost and distributed/parallel mode. Besides Weka-parallel, other tests were done with Weka-Grid. These tests were focused on the MLP algorithm but with the dataset that excluded the NUMFAULTS attribute. This was done because when doing SMO with the same dataset the results for the time to build the model were lower than when the test that was done with the complete dataset including the NUMFAULTS attribute. However the error rates were very high compared to those of the test wit hthe NUMFAULTS attribute dataset. The next batch of test initiated were the test with the combination of datasets CCCS-Fit and CCCS-Test comprehending 282 instances, and running them on weka4ws, Weka-grid, and Weka-parallel. Only the tests with Weka-grid were completed at the end of the week and the other test were left for the last week of research activities. ACCOMPLISHMENTS: Tested Weka-parallel with the available algorithm SMO. Also, tested the dataset without the NUMFAULTS attribute and using MLP algorithm. Additionally, tests on a bigger dataset of 282 instances with Weka-grid were initiated. ISSUES/PROBLEMS: Weka-parallel was only able to perform the tests using the SMO algorithm, since it is based on an old version of the Weka software it didn’t had implemented Random Forest and MultiLayer Perceptron algorithm. Regarding Weka-grid and the test with the bigger dataset, Weka-grid could never yield results for distributed SMO and MLP, only results able to be obtained were under local host. Once again, Random Forest was tested again but with the same results as before where Weka-grid would just freeze and no give any results at the local host level or distributed level. There was some unavailability of the system for a day. PLANS: For coming week work with weka4ws to do the test on the CCCS dataset with 282 instances, as well with Weka grid and he algorithms left undone like MLP. Also gather results and create tables and result charts. SUMMARIES/CRITIQUES OF PAPERS: Mohammed J. Zaki, Ching-Tien Ho, Rakesh Agrawal, “Parallel Classification for Data Mining on Shared-Memory Multiprocessors”, IBM Almaden Research Center, San Jose, CA. Classification is a main task on the Data Mining domain and consists of using a training dataset with attributes and a particular class attribute to predict the class of a new dataset that doesn’t have a class label. From the different classification algorithms, decision trees are best suited for data mining applications. Decision trees can be built faster compared to other methods and they are easily interpreted. Also they can be converted into SQL statements that allow easy and efficient access to databases. Finally they often obtain similar or better accuracy compared to other classification algorithms. The development of classification models for larger training sets can enable higher accuracy models. Recent examples of classification systems that manage disk-resident data include SLIQ and SPRINT. With the growth of data collected being always bigger, the creation of high-performance data mining tools must be based on parallel computing. Previous works were focused on distributed-memory parallel machines, where each machine has its private memory and local disks. However, Shared-Memory Multiprocessors systems (SMP) also deliver highperformance with low to medium degree of parallelism and with economic prices. In the SMP systems messages are passed on shared variables and any processor can access any disk on the system. The programming architecture for SMP systems is different from the distributed-memory systems. The paper presents parallel algorithms for building decision-tree classifiers on sharedmemory systems. The data parallelism is based on attribute scheduling between processors and also task pipelining and dynamic subtree partitioning. The algorithm were evaluated with 2 SMP configurations, one were data was larger that available memory and needed to be paged to local disk, and other where memory y is large enough to hold all data.for the 1st configuration speedup ranged from 2.97 to 3.86 for the build phase and 2.20 to 3.67 for the total time on a 4 processor SMP. For the second configuration, the speed up was 5.36 to 6.67 for the build phase and 3.07 to 5.98 total time in an 8-procesor SMP. Serial classification building is described in detail. Basically it has 2 phases a growth phase and a prune phase. The SMP classifier in the paper does the same basic computations as the serial SPRINT but the procedure of execution is different. A parallel SPRINT implementation on a distributed-memory machine was done based on record-data-parallelism, where each processor managed 1/P fraction of each attribute list. And all processors synchronously build one global decision tree one node at a time. The reason to not map the distributed-memory SPRINT to a SMP system is based on performance constrains. That is why building classification trees in parallel on SMP systems is developed on the paper by explaining the three growth phases: evaluate split points per attribute, find winning split-point and construct a probe structure, and finally split the attribute list in two, one for each child using the probe structure. The parallelism part of the algorithm is based on attribute data parallelism (attributes are divided equally among processors) and also on tasks makes use of inter-node parallelism (different parts of the decision tree are build separated among the processors). After this the authors go about the BASIC scheme, the Fixed-Window-K scheme, and the Moving-Window-K algorithm. A performance evaluation was done and showed that the MWK and the SUBTREE algorithm yields good speedups building the classifier on a 4 processor SMP with disk configuration and on an 8-procesors SMP with memory configuration. The experiments showed that the classification data mining task can be effectively parallelized on SMP machines. This paper was a long paper but was very detailed and not complicated to understand. In some areas it assumes the reader has previous deep knowledge to understand the different process of classification and in particular of the algorithms proposed. Overall it was a very interesting topic and well written to general audiences. This paper is important because it yield insight into what is needed to parallelize data mining algorithms and in particular classification and decision trees. It helped me understand decision trees in-depth and the processes involved with it. This paper showed me that parallelizing a single data mining algorithm is not an easy task and what confronts me in the future is a big task.