crg weekly status report

advertisement
WEEKLY STATUS REPORT
Juan Bernal
7/17/2008
ACTIVITIES:
This week work on Weka-parallel was done, first there was some problems on the use of the
system but after it Weka-parallel was able to be used and tested. Basically, Weka-parallel was
executed from the GUI interface rather than from command line due to constant errors given
when trying the command line execution. From the GUI tests could be done on the SMO
classification algorithm, but RandomForest and MultiLayerPerceptron were not able to be tested
because Weka-Parallel was created based on an old version of Weka that did not had RF or MLP
implemented. So only SMO was tested under Weka-parallel in localhost and distributed/parallel
mode.
Besides Weka-parallel, other tests were done with Weka-Grid. These tests were focused on the
MLP algorithm but with the dataset that excluded the NUMFAULTS attribute. This was done
because when doing SMO with the same dataset the results for the time to build the model were
lower than when the test that was done with the complete dataset including the NUMFAULTS
attribute. However the error rates were very high compared to those of the test wit hthe
NUMFAULTS attribute dataset.
The next batch of test initiated were the test with the combination of datasets CCCS-Fit and
CCCS-Test comprehending 282 instances, and running them on weka4ws, Weka-grid, and
Weka-parallel. Only the tests with Weka-grid were completed at the end of the week and the
other test were left for the last week of research activities.
ACCOMPLISHMENTS:
Tested Weka-parallel with the available algorithm SMO. Also, tested the dataset without the
NUMFAULTS attribute and using MLP algorithm. Additionally, tests on a bigger dataset of 282
instances with Weka-grid were initiated.
ISSUES/PROBLEMS:
Weka-parallel was only able to perform the tests using the SMO algorithm, since it is based on
an old version of the Weka software it didn’t had implemented Random Forest and MultiLayer
Perceptron algorithm.
Regarding Weka-grid and the test with the bigger dataset, Weka-grid could never yield results for
distributed SMO and MLP, only results able to be obtained were under local host. Once again,
Random Forest was tested again but with the same results as before where Weka-grid would just
freeze and no give any results at the local host level or distributed level. There was some
unavailability of the system for a day.
PLANS:
For coming week work with weka4ws to do the test on the CCCS dataset with 282 instances, as
well with Weka grid and he algorithms left undone like MLP. Also gather results and create tables
and result charts.
SUMMARIES/CRITIQUES OF PAPERS:
Mohammed J. Zaki, Ching-Tien Ho, Rakesh Agrawal, “Parallel Classification for Data Mining on
Shared-Memory Multiprocessors”, IBM Almaden Research Center, San Jose, CA.
Classification is a main task on the Data Mining domain and consists of using a training
dataset with attributes and a particular class attribute to predict the class of a new dataset
that doesn’t have a class label. From the different classification algorithms, decision trees
are best suited for data mining applications. Decision trees can be built faster compared
to other methods and they are easily interpreted. Also they can be converted into SQL
statements that allow easy and efficient access to databases. Finally they often obtain
similar or better accuracy compared to other classification algorithms.
The development of classification models for larger training sets can enable higher
accuracy models. Recent examples of classification systems that manage disk-resident
data include SLIQ and SPRINT.
With the growth of data collected being always bigger, the creation of high-performance
data mining tools must be based on parallel computing. Previous works were focused on
distributed-memory parallel machines, where each machine has its private memory and
local disks. However, Shared-Memory Multiprocessors systems (SMP) also deliver highperformance with low to medium degree of parallelism and with economic prices. In the
SMP systems messages are passed on shared variables and any processor can access any
disk on the system. The programming architecture for SMP systems is different from the
distributed-memory systems.
The paper presents parallel algorithms for building decision-tree classifiers on sharedmemory systems. The data parallelism is based on attribute scheduling between
processors and also task pipelining and dynamic subtree partitioning. The algorithm were
evaluated with 2 SMP configurations, one were data was larger that available memory
and needed to be paged to local disk, and other where memory y is large enough to hold
all data.for the 1st configuration speedup ranged from 2.97 to 3.86 for the build phase and
2.20 to 3.67 for the total time on a 4 processor SMP. For the second configuration, the
speed up was 5.36 to 6.67 for the build phase and 3.07 to 5.98 total time in an 8-procesor
SMP.
Serial classification building is described in detail. Basically it has 2 phases a growth
phase and a prune phase. The SMP classifier in the paper does the same basic
computations as the serial SPRINT but the procedure of execution is different.
A parallel SPRINT implementation on a distributed-memory machine was done based on
record-data-parallelism, where each processor managed 1/P fraction of each attribute list.
And all processors synchronously build one global decision tree one node at a time.
The reason to not map the distributed-memory SPRINT to a SMP system is based on
performance constrains. That is why building classification trees in parallel on SMP
systems is developed on the paper by explaining the three growth phases: evaluate split
points per attribute, find winning split-point and construct a probe structure, and finally
split the attribute list in two, one for each child using the probe structure.
The parallelism part of the algorithm is based on attribute data parallelism (attributes are
divided equally among processors) and also on tasks makes use of inter-node parallelism
(different parts of the decision tree are build separated among the processors).
After this the authors go about the BASIC scheme, the Fixed-Window-K scheme, and the
Moving-Window-K algorithm.
A performance evaluation was done and showed that the MWK and the SUBTREE
algorithm yields good speedups building the classifier on a 4 processor SMP with disk
configuration and on an 8-procesors SMP with memory configuration. The experiments
showed that the classification data mining task can be effectively parallelized on SMP
machines.
This paper was a long paper but was very detailed and not complicated to understand. In
some areas it assumes the reader has previous deep knowledge to understand the different
process of classification and in particular of the algorithms proposed. Overall it was a
very interesting topic and well written to general audiences.
This paper is important because it yield insight into what is needed to parallelize data
mining algorithms and in particular classification and decision trees. It helped me
understand decision trees in-depth and the processes involved with it. This paper showed
me that parallelizing a single data mining algorithm is not an easy task and what
confronts me in the future is a big task.
Download