Assignment4

advertisement
Introduction to Artificial Intelligence (236501)
Home assignment #4 – Machine learning
Due date: Monday, January 31, 2005, at 12:00 noon.
Submit your work in pairs. Only typed (i.e., not handwritten) submissions will be accepted.
Warning: Absolutely no extensions will be granted! Late submissions will not be accepted.
This assignment deals with the utility of feature selection for machine learning algorithms. You
will work with two algorithms:
 ID3, which learns decision trees and inherently performs feature selection, by computing
information gain at each node of the tree being built.
 K Nearest Neighbors (KNN), which classifies each instance based on the labels of its k
nearest neighbors, and does not perform feature selection internally. KNN usually suffers in
the presence of too many features, hence explicit feature selection is often performed prior to
invoking KNN.
Your task is to compare the performance of ID3 to that of KNN, while the latter is used with and
without feature selection, and to draw conclusions on their relative merits.
Algorithms to compare
1) ID3 – regular ID3 with internal feature selection
2) KNN.all – KNN that uses all the features available
3) KNN.FS – KNN with a priori feature selection
Datasets
You will use the following two datasets (obtained from the UCI repository of machine learning
datasets):
1) Spambase Database
 The "spam" concept is diverse: advertisements for products/web sites, make money fast
schemes, chain letters, pornography... These are useful when constructing a personalized
spam filter.
 Number of Instances: 4601 (1813 Spam = 39.4%)
 Number of Attributes: 58 (57 continuous, 1 nominal class label)
2) Multiple Features Database
 This dataset consists of features of handwritten numerals (`0'--`9') extracted from a
collection of Dutch utility maps. 200 patterns per class (for a total of 2,000 patterns)
have been digitized in binary images. Digits are represented in terms of Fourier
coefficients, profile correlations, Karhunen-Loeve coefficients,pixel averages,Zernike
moments and morphological features.
 Number of Instances: 2000 (200 per class)
 Number of Attributes: 649
 Number of Classes:10 (= digits)
 Note: you don’t need to understand the physical meaning of the coefficients in order to
use this dataset 
The Spam dataset is in the so-called C4.5 data format, which is very common in machine
learning.
Here
is
a
brief
description
of
this
format
(adopted
from
http://www.cs.washington.edu/dm/vfml/appendixes/c45.htm). You can learn more about C4.5
data format at http://sdmc.lit.org.sg/GEDatasets/Format/UsedFormat.html and
http://www2.cs.uregina.ca/~dbd/cs831/notes/ml/dtrees/c4.5/c4.5.html, or you may want to go to
the source: J. Ross Quinlan, “C4.5: Programs for Machine Learning”, Morgan Kaufmann, 1993.
The dataset has two files – spambase.data and spambase.names. The .names file
describes the dataset, while the .data file contains the examples which make up the
dataset. The files contain series of identifiers and numbers with some surrounding syntax.
A | (vertical bar) means that the remainder of the line should be ignored as a comment.
Each identifier consists of a string of characters that does not include comma, question
mark or colon (unless escaped by a backslash). Periods may be embedded provided they
are not followed by a space. Embedded whitespace is also permitted but multiple
whitespace is replaced by a single space.
The .names file contains a series of entries that describe the classes, attributes and values
of the dataset. Each entry is terminated with a period, but the period can be omitted if it
would have been the last thing on a line. The first entry in the file lists the names of the
classes, separated by commas (and terminated by a period). Each successive line then
defines an attribute, in the order in which they will appear in the .data file, with the
following format: attribute-name : attribute-type . The attribute-name is an identifier as
above, followed by a colon, then the attribute-type (always “continuous” for this dataset).
The .data file contains labeled examples in the following format: one example per line,
attribute values separated by commas, class label is at the last position. In your
experiments, you’ll need to split the .data file into two, which will contain the training
and the testing data. Conventionally, these files have extensions .train and .test,
respectively.
The format of the second dataset (Multiple Features) is even simpler – the data is distributed as 6
files (named ‘mfeat-???’), each containing 2000 lines with one feature vector per line. Each
file corresponds to one set of coefficients (Fourier coefficients, profile correlations, KarhunenLoeve coefficients, pixel averages, Zernike moments, and morphological features). In each file
the 2000 patterns are stored in ASCII on 2000 lines. The first 200 patterns are of class `0',
followed by sets of 200 patterns for each of the classes `1' - `9'. Corresponding patterns in
different feature sets (files) correspond to the same original character.
Note: the two datasets are available for download at the course Web site. Each dataset contains a
file with extension “.DOCUMENTATION” that gives further information about the dataset and
its format.
Your tasks
1) Get acquainted with the datasets and their format
2) Study the code of ID3 (we also review it in the recitation)
3) Implement KNN. Note that you need to develop a metric that is appropriate for comparing
feature vectors of instances in each dataset. Theoretically, you may use the same metric for
each dataset, but you’d better have very good (and properly documented) reasons for doing
so.
4) Write a function that computes Information Gain, in order to use it as an external module
prior to the invocation of KNN. To achieve this aim, you may use the appropriate
functionality of the Lisp code for ID3.
5) For each dataset D, perform the following experiments using the cross-validation
scheme with N=10 folds
- Classify D using ID3
* Count the number of attributes that were used for splitting instances anywhere
in the tree; denote this number as M.
- For each k = 1, 5, 10, 25, 50
* Classify D using KNN.all
* For each feature selection level l = 0.01, 0.05, 0.1, 0.2, …, 0.9, 1.0
- classify D using KNN.FS with the corresponding fraction of best-scoring features
(judged by their IG), i.e., 1%, 5%, etc.
- classify D using KNN.FS with M best-scoring features (where M was determined
by using ID3 above)
- classify D using KNN.FS with exactly those M features that were used by ID3
(see above)
 For each experiment, measure the accuracy obtained in each cross-validation fold,
then average the accuracy over the 10 folds. This is the accuracy value that represents
the outcome of each experiment.
 Summarize your experimental results in tables and graphs
6) Compare the performance of ID3, KNN.all and KNN.FS based on the accuracy they yielded
in the above experiments. Draw conclusions about the utility of feature selection for different
classification algorithms and with different parameter settings (values of k, levels of feature
selection etc.).
7) Bonus (9¾ points): repeat the above experiments to build learning curves, i.e., perform each
experiment for increasingly large fractions of the training set, and plot the accuracy as a
function of the cardinality of the training set. Draw conclusions about the importance of
feature selection at the various points of the learning curve.
Plotting the learning curve: suppose you’re using cross-validation with N=10 folds.
Normally, for each of the 10 folds you use 9/10 of the data for training (TRAINi, i=1..10),
and 1/10 for testing (TESTi, i=1..10). To build a learning curve, you also consider subsets of
the training data. That is, for each fold you repeat the experiment by using 10%, 20%, …,
100% of TRAINi for training, while still using the entire test set (TESTi) for testing.
Notes
1. The analysis of experiments constitutes the main point of this assignment. Therefore, you
should discuss your experimental results in detail, explaining the nature of your results and
observations.
2. Submit all your code, which should be accompanied by both internal and external
documentation. The maturity level of your documentation should be appropriate for an
advanced course in Computer Science. Naturally, the entire assignment should be
implemented in the Lisp programming language.
3. Your entire submission should be in hard copy (no electronic submission is necessary for this
assignment).
Submit your work in pairs. In order to find a partner, you may take advantage of the “Find a
partner” mechanism available on the course Web page at http://webcourse.technion.ac.il/236501
(see buttons on the left).
Good luck!
Questions? Evgeniy Gabrilovich (gabr@cs.technion.ac.il)
Flames? /dev/null
Download