De La Salle University • College of Computer Studies INTROAI / Introduction to Artificial Intelligence AY 2010-11 Term 1 Assignment 3 Empirical Analysis of Machine Learning Algorithms Instructions: Your overall task is to compare the learning curves1 of a decision tree learner and a multilayer neural network learner on a published data set. There is no need to implement the learning algorithms; you can use Weka, a suite of machine learning algorithms implemented in Java, available at http://www.cs.waikato.ac.nz/ml/weka/. Extensive documentation is also available from the said site. For this assignment, use Weka’s J48 (which implements C4.5, a more robust version of ID3) and MultilayerPerceptrons (which implements Backprop for multilayer neural networks.) Get the data set from the UC Irvine Machine Learning Repository at http://archive.ics.uci.edu/ml/datasets.html. Many datasets in the UCI repository are in C4.5 data format, a brief description of which is available at http://www.cs.washington.edu/dm/vfml/appendixes/c45.htm. Weka accepts C4.5 format as well as ARFF, Weka’s own format, explained in detail in http://www.cs.waikato.ac.nz/~ml/weka/arff.html.2 No 2 groups may work on the same dataset, so obtain dataset approval first as soon as possible though raymund.sison@delasalle.ph. When you propose a dataset, specify the following:3 Classification? (Y/N) Attribute characteristics (C/I/R): Data set characteristics (M/U/S/T/R): # of instances: # of attributes: Missing values? (Y/N) Report contents and grading: Description of the experiments (basically how the learning curves were produced) Decision tree and neural network model (nodes and weights) Learning curves for ID3 and MLN (with the source data in Excel tables, and .csv files in the appendix) Analysis of the learning curves 1 2 2 5 1 Assuming you are using cross-validation with N=10 folds, for each of the 10 folds you normally use 9/10 of the data for training (TRAINi, i=1..10), and 1/10 for testing (TESTi, i=1..10). To plot a learning curve, you also consider subsets of the training data. That is, for each fold you repeat the experiment by using 10%, 20%,…, 100% of TRAINi for training, while still using the entire test set (TESTi) for testing. The x-axis of the learning curve will be the number of training instances, while the y-axis will be the percentage of test items that are correct. Below are sample learning curves from (Russell & Norvig, 2003, p. 747). 1 of 2 2 Therefore, you might have to convert a dataset from C4.5 format to ARFF. These formats are described in the links in the excerpt above. If you still have problems after your conversion, let me know; I might have the ARFF file for the dataset you have chosen. Here are some common problems when converting a dataset to arff format: There's a missing attribute (this is a major error). The number of columns in the data should match the number of declared attributes in the arff file. There are unusual symbols in, or following, the names of attributes (e.g., a colon (e.g., Var1:Sub1), descriptors (e.g., Var1 (Hz)). Stick to alphanumeric characters and the dash (e.g., Var1-Sub1). Attributes declared as string. Weka or, more specifically, classifiers of Weka, don’t handle string attributes. Replace these with integers or enumerations. Weka takes the last attribute as the class by default. Override this by specifying which attribute Weka should treat as a class. 3 Here are some guidelines when choosing a dataset: Only choose datasets whose associated task is classification, because decision trees are for classification. Do not choose sequential or time-series datasets, because decision trees can't handle them. Do not choose relational datasets because these require predicate logic learners; decision trees and backprop networks only handle propositional logic. Do not choose datasets with too few instances because your learning curves will not be meaningful. Best to choose a dataset with >100 instances. However, do not choose datasets with too many instances (e.g., thousands of instances) because then the multilayer network learner will take a long time to build a model. Best not to choose a dataset with missing values. If you insist on choosing one with missing values, you must be careful how you treat these. You must study the literature about this dataset, see how missing values were handled, and include those papers and their results in your report. The papers that cited this dataset are listed at the bottom of the dataset page. 2 of 2