SI 650 Information Retrieval HW 2 - Text Classification – due Friday, February 26 Instructor: Dragomir Radev and Qiaozhu Mei In this project, you will experiment with different machine learning methods for text classification. Here’s an overview of what you’ll do: 1. Extract feature vectors from a corpus of labeled documents. 2. Implement feature selection based on Chi-square. 3. Use SVM light to train a model using your feature vectors. 4. Implement a simple text classifier: either a Naïve Bayes, or a perceptron classifier. (Ask the instructors if you would rather write a different one or implement an SVM from scratch). 5. Run experiments using the two different models and different numbers of features and training set size. Report and analyze the results. Documents as Feature Vectors Each document can be represented as in the following table: DOC WORD COUNT 123 base 2 123 bat 5 123 ball 1 123 score 2 234 Puck 2 234 score 1 … … … or as the following set of vectors: [base bat ball score puck goal] f(DOC123) = [ 2 5 1 2 0 0 ], class = baseball f(DOC234) = [ 0 0 0 1 2 3 ], class = hockey Perceptron Model You need to learn the weights wi that correspond to each of the dimensions of the vector. Let wt be the transposed matrix of weights. Then the decision boundary will be wt.f(x) = 0, or in other words: if wt.f(x) > 0: classify x as baseball if wt.f(x) < 0: classify x as hockey For example, if the weights w were: w = [1 1 1 0 -1 -1] ] Then: wt.f(DOC123) = 1*2 + 1*5 + 1*1 + 0*2 + (-1)*0 + (-1)*0 = 8 Since 8>0, DOC123 is about baseball wt.f(DOC234) = 1*0 + 1*0 + 1*0 + 0*1 + (-1)*2 + (-1)*3 = -5 Since -5<0, DOC234 is about hockey. Note that you can also have an intercept (bias) w0 which doesn't correspond to a particular word in the input document. In that case, the decision boundary is going to be wt.f(x) + w0 = 0. Also note that the Perceptron model is a binary classifier. To build a multiclass classifier, you should use the “one-vs-all” scheme. To do this, you should train N different binary classifiers, each one trained to distinguish the examples in a single class from the examples in all remaining classes. When it is desired to classify a new example, the N classifiers are run, and the classifier which outputs the largest (most positive) value is chosen. Naïve Bayes The Naïve Bayes model is covered in detail in the book and lecture notes. Uses add one (Laplace) smoothing or a more advanced method to avoid zero probabilities. Make sure to use log probabilities instead of actual probabilities in order to avoid floating point underflow problems. Feature Selection By default, you should use all words (after proper tokenization) that appear in the training documents as features for your classifier. Then you will implement feature selection using the Chi-square method and pick the top M most discriminative features (details below). Resources • Training/Testing data – We will use a subset of the 20 Newsgroups data set. This dataset is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering. We will only use data in 10 of the newsgroups, each corresponding to a different topic. The documents are divided into training and test sets for you. In the training/testing set you will see ten directories '0' to '9'. These are the labels for the text files within them. You can find the data at /data0/class/ir-w10/data/10newsgroup. data/train/0 data/test/1 data/train/1 data/test/2 .... .... data/train/9 data/test/10 • SVM implementation – We will be using the multiclass version of SVMLight. It is installed at /data0/class/ir-w10/tools/svm_light. It’s very straight forward to use. To train a model, type: svm_multiclass_learn -c 1.0 example_file model_file C is a parameter that trades off margin size and training error. Try different values and pick the one that results in the best performance To classify test data based on a trained model, type: svm_multiclass_classify example_file model_file output_file (Of course, there are additional options that you should play with till you get the best performance.) • Feature extraction -You can use your code from HW1 or clairlib to extract the word frequencies from each document or write your own extractors. Stemming and lowercasing are reasonable things to do. For simplicity, you should output your feature vectors in the SVM light file format, and write your classifier to accept them in the same format. The following lines represents one training example and is of the following format: <line> .=. <target> <feature>:<value> <feature>:<value> ... <feature>:<value> <target> .=. <integer> <feature> .=. <integer> <value> .=. <float> The target value and each of the feature/value pairs are separated by a space character. Feature/value pairs MUST be ordered by increasing feature number. Features with value zero can be skipped. The target value denotes the class of the example via a positive (non-zero) integer. So, for example, the line 3 1:0.43 3:0.12 9284:0.2 # abcdef specifies an example of class 3 for which feature number 1 has the value 0.43, feature number 3 has the value 0.12, feature number 9284 has the value 0.2, and all the other features have value 0. Deliverables You can work in languages like Perl or java. It’s probably easiest for you if you continue with your code from project 1. (Third-party libraries like Clairlib and Lucene would also be okay to use.) (Optional) Install script – If you use any third-party libraries that need special installation, please include an ./install script. Note that you are NOT allowed to use third party implementations of NB, perceptron or decision tree. Feature extraction – Your feature extractor reads documents from a corpus and creates feature vectors in the SVM light file format. It optionally performs Chi-square feature selection. ./extract_features train_dir test_dir output [M feature_file] - train_dir – Path to documents in the training set. - test_dir – Path to documents in the testing set. - output – You should produce labeled feature vectors in output.train and output.test for the training and test sets, respectively. - M– (optional) number of features to use for chi-square feature selection. - feature_file– (optional) When feature selection is used, there is also another parameter, feature_file. You should output the chi-square scores for all words to this file (e.g., each line should have: word <tab> score). Use SVM light to train a classifier based on your extracted features and test it on your testing data. Include the best trained model (should be named svm_model) in your deliverables. Classifier -Implement Naïve Bayes, perceptron or decision tree classifier. To train a model, you need to provide the following for us to call we will call: - To learn the model: ./learn examples.train model_file The model_file should contain parameters for your model (for instance, for the perceptron this is a list of words and their weights). - To classify examples in the test set, we will call: ./classify examples.test model_file output_file The output_file should contain a tab-separated line for each document as follows: docid score computed_class correct_class y|n The last column outputs y if the computed_class matches the correct_class and n otherwise. Your classify program should also write, to STDOUT, the combined accuracy (over all classes) on the testing set, in other words: Accuracy = Number of correctly labeled test docs * 100/Total number of test docs - Best Model – Submit the model that gave you the best performance (you should call it best_model). Run experiments with different numbers of features and different numbers of labeled examples. The following table should be used to present a summary of your experiments. You should run two sets of experiments: one with your classifier and one with SVM Light. 10 Size of Trainning Set per Class 100 Number of Features 1000 10000 100000 All 20 100 200 300 500 Note that if you use a perceptron, it may take a long time to converge, or not converge at all. When this happens, you should restart with different random initial weights. Discussion - Deliver a README file with a description of your classifier, experimental results (table above) and a discussion of your results. This can be in txt, ps or pdf. A significant portion of this assignment is a discussion of your observations. Some points that you could consider are: - Comparing the accuracy and efficiency of your classifier with that of SVM Light - Explaining your choice of classifier (Naïve Bayes/perceptron) - Discussing the effect and implications of varying the number of features and number of training examples on accuracy - For the perceptron, how does the value of eta affect the convergence rate? What about the number of features/training size? - For Naïve Bayes, which smoothing method did you use, and why? - These are just suggestions – feel free to add or change topics in your discussion as you see fit. Grading - Experimental write-up: 40 pts Tables Discussion -Working code with documentation: 60 pts Classifier / Classifier performance evaluated on an unseen test data. Chi-square feature selection Feature extraction Extra credit You can also attempt to incorporate one of the following techniques for up to 10 extra points. Extra credit will not be considered unless the technique used is well documented and shown to yield better results. Other feature selection algorithms Different methods for smoothing Your own implementation of a kernel to use with SVM Extraction of additional useful features (e.g., phrases) Note: We encourage you to write code which is Clairlib compliant. Submission - Create a directory with your uniquename (mkdir XYZ). - Copy the source code files and documentation into the directory and include all the other files that are necessary for your program to run. - Go back one level and tar the directory. Give your uniquename as the tar file name (tar cvzf XYZ.tar.gz XYZ). - Send the tar file to . The subject of the mail should be "IR HW2 <uniquename>". - Please write any necessary information to run your code into a Readme file and not in the email. Due Date: Friday, February 26, 2010 at noon. Late Policy: Each day (or fraction of a day) that your homework is late, your grade will decrease by 10 points. If we cannot run the code as specified, we will not grade it. We will return it to you and penalize you by 10 points for each time we have to go back to you for a new version. If you have questions, email us at radev@umich.edu or qmei@umich.edu