vectors classify

advertisement
SI 650 Information Retrieval
HW 2 - Text Classification – due Friday, February 26
Instructor: Dragomir Radev and Qiaozhu Mei
In this project, you will experiment with different machine learning methods for text
classification.
Here’s an overview of what you’ll do:
1. Extract feature vectors from a corpus of labeled documents.
2. Implement feature selection based on Chi-square.
3. Use SVM light to train a model using your feature vectors.
4. Implement a simple text classifier: either a Naïve Bayes, or a perceptron classifier.
(Ask the instructors if you would rather write a different one or implement an SVM
from scratch).
5. Run experiments using the two different models and different numbers of features and
training set size. Report and analyze the results.
Documents as Feature Vectors
Each document can be represented as in the following table:
DOC
WORD
COUNT
123
base
2
123
bat
5
123
ball
1
123
score
2
234
Puck
2
234
score
1
…
…
…
or as the following set of vectors:
[base bat ball score puck goal]
f(DOC123) = [ 2 5 1 2 0 0 ], class = baseball
f(DOC234) = [ 0 0 0 1 2 3 ], class = hockey
Perceptron Model
You need to learn the weights wi that correspond to each of the dimensions of the vector. Let
wt be the transposed matrix of weights. Then the decision boundary will be wt.f(x) = 0, or in
other words:
if wt.f(x) > 0: classify x as baseball
if wt.f(x) < 0: classify x as hockey
For example, if the weights w were: w = [1 1 1 0 -1 -1] ]
Then: wt.f(DOC123) = 1*2 + 1*5 + 1*1 + 0*2 + (-1)*0 + (-1)*0 = 8 Since 8>0, DOC123 is
about baseball
wt.f(DOC234) = 1*0 + 1*0 + 1*0 + 0*1 + (-1)*2 + (-1)*3 = -5 Since -5<0, DOC234 is
about hockey.
Note that you can also have an intercept (bias) w0 which doesn't correspond to a particular
word in the input document. In that case, the decision boundary is going to be wt.f(x) + w0 = 0.
Also note that the Perceptron model is a binary classifier. To build a multiclass classifier, you
should use the “one-vs-all” scheme. To do this, you should train N different binary classifiers,
each one trained to distinguish the examples in a single class from the examples in all
remaining classes. When it is desired to classify a new example, the N classifiers are run, and
the classifier which outputs the largest (most positive) value is chosen.
Naïve Bayes
The Naïve Bayes model is covered in detail in the book and lecture notes. Uses add one
(Laplace) smoothing or a more advanced method to avoid zero probabilities. Make sure to
use log probabilities instead of actual probabilities in order to avoid floating point underflow
problems.
Feature Selection
By default, you should use all words (after proper tokenization) that appear in the training
documents as features for your classifier. Then you will implement feature selection using the
Chi-square method and pick the top M most discriminative features (details below).
Resources
• Training/Testing data – We will use a subset of the 20 Newsgroups data set. This dataset
is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly
across 20 different newsgroups. The 20 newsgroups collection has become a popular data set
for experiments in text applications of machine learning techniques, such as text
classification and text clustering. We will only use data in 10 of the newsgroups, each
corresponding to a different topic.
The documents are divided into training and test sets for you. In the training/testing set you
will see ten directories '0' to '9'. These are the labels for the text files within them. You can
find the data at /data0/class/ir-w10/data/10newsgroup.
data/train/0
data/test/1
data/train/1
data/test/2
....
....
data/train/9
data/test/10
• SVM implementation – We will be using the multiclass version of SVMLight. It is
installed at /data0/class/ir-w10/tools/svm_light. It’s very straight forward to use. To train a
model, type:
svm_multiclass_learn -c 1.0 example_file model_file
C is a parameter that trades off margin size and training error. Try
different values and pick the one that results in the best performance
To classify test data based on a trained model, type:
svm_multiclass_classify example_file model_file output_file
(Of course, there are additional options that you should play with till you get the best
performance.)
• Feature extraction -You can use your code from HW1 or clairlib to extract the word
frequencies from each document or write your own extractors. Stemming and lowercasing
are reasonable things to do. For simplicity, you should output your feature vectors in the
SVM light file format, and write your classifier to accept them in the same format.
The following lines represents one training example and is of the following format:
<line> .=. <target> <feature>:<value> <feature>:<value> ...
<feature>:<value>
<target> .=. <integer>
<feature> .=. <integer>
<value> .=. <float>
The target value and each of the feature/value pairs are separated by a space character.
Feature/value pairs MUST be ordered by increasing feature number. Features with value zero
can be skipped. The target value denotes the class of the example via a positive (non-zero)
integer. So, for example, the line
3 1:0.43 3:0.12 9284:0.2 # abcdef
specifies an example of class 3 for which feature number 1 has the value 0.43, feature
number 3 has the value 0.12, feature number 9284 has the value 0.2, and all the other features
have value 0.
Deliverables
You can work in languages like Perl or java. It’s probably easiest for you if you continue
with your code from project 1. (Third-party libraries like Clairlib and Lucene would also be
okay to use.)
(Optional) Install script – If you use any third-party libraries that need special installation,
please include an ./install script. Note that you are NOT allowed to use third party
implementations of NB, perceptron or decision tree.

Feature extraction – Your feature extractor reads documents from a corpus and creates
feature vectors in the SVM light file format. It optionally performs Chi-square feature
selection.
./extract_features train_dir test_dir output [M feature_file]
- train_dir – Path to documents in the training set.
- test_dir – Path to documents in the testing set.
- output – You should produce labeled feature vectors in output.train and output.test for
the training and test sets, respectively.
- M– (optional) number of features to use for chi-square feature selection.
- feature_file– (optional) When feature selection is used, there is also another parameter,
feature_file. You should output the chi-square scores for all words to this file (e.g., each
line should have: word <tab> score).

Use SVM light to train a classifier based on your extracted features and test it on your
testing data. Include the best trained model (should be named svm_model) in your
deliverables.

Classifier -Implement Naïve Bayes, perceptron or decision tree classifier. To train a
model, you need to provide the following for us to call we will call:
- To learn the model:
./learn examples.train model_file
The model_file should contain parameters for your model (for instance, for the
perceptron this is a list of words and their weights).
- To classify examples in the test set, we will call:
./classify examples.test model_file output_file
The output_file should contain a tab-separated line for each document as follows:
docid score computed_class correct_class y|n
The last column outputs y if the computed_class matches the correct_class and n
otherwise.
Your classify program should also write, to STDOUT, the combined accuracy (over all
classes) on the testing set, in other words:
Accuracy = Number of correctly labeled test docs * 100/Total number of test docs
- Best Model – Submit the model that gave you the best performance (you should call it
best_model).

Run experiments with different numbers of features and different numbers of labeled
examples. The following table should be used to present a summary of your experiments.
You should run two sets of experiments: one with your classifier and one with SVM
Light.
10
Size of
Trainning
Set per
Class
100
Number of Features
1000
10000
100000
All
20
100
200
300
500
Note that if you use a perceptron, it may take a long time to converge, or not converge at all.
When this happens, you should restart with different random initial weights.

Discussion - Deliver a README file with a description of your classifier, experimental
results (table above) and a discussion of your results. This can be in txt, ps or pdf. A
significant portion of this assignment is a discussion of your observations. Some points
that you could consider are: - Comparing the accuracy and efficiency of your classifier with that of SVM Light
- Explaining your choice of classifier (Naïve Bayes/perceptron)
- Discussing the effect and implications of varying the number of features and number of
training examples on accuracy
- For the perceptron, how does the value of eta affect the convergence rate? What about
the number of features/training size?
- For Naïve Bayes, which smoothing method did you use, and why?
- These are just suggestions – feel free to add or change topics in your discussion as you
see fit.
Grading
- Experimental write-up: 40 pts
 Tables
 Discussion
-Working code with documentation: 60 pts
 Classifier / Classifier performance evaluated on an unseen test data.
 Chi-square feature selection
 Feature extraction
Extra credit
You can also attempt to incorporate one of the following techniques for up to 10 extra points.
Extra credit will not be considered unless the technique used is well documented and shown
to yield better results.
 Other feature selection algorithms
 Different methods for smoothing
 Your own implementation of a kernel to use with SVM
 Extraction of additional useful features (e.g., phrases)
Note: We encourage you to write code which is Clairlib compliant.
Submission
- Create a directory with your uniquename (mkdir XYZ).
- Copy the source code files and documentation into the directory and include all the other
files that are necessary for your program to run.
- Go back one level and tar the directory. Give your uniquename as the tar file name (tar cvzf XYZ.tar.gz XYZ).
- Send the tar file to . The subject of the mail should be "IR HW2 <uniquename>".
- Please write any necessary information to run your code into a Readme file and not in the
email.
Due Date: Friday, February 26, 2010 at noon.
Late Policy: Each day (or fraction of a day) that your homework is late, your grade will
decrease by 10 points.
If we cannot run the code as specified, we will not grade it. We will return it to you and
penalize you by 10 points for each time we have to go back to you for a new version.
If you have questions, email us at radev@umich.edu or qmei@umich.edu
Download