Homework 2 CS 5950: Machine Learning Thom Lake Western Michigan University July 2, 2013 Introduction This homework assignment is designed to help you get started working with data. It is due next Tuesday (2013.07.09). For this assignment we will be using a preprocessed subset of the 20 Newsgroups data. You can find a description and copy of the data we will be working with on the course page http://homepages.wmich.edu/~tcp8889/ml/links.html. Part 1: Working with Text Make a histogram showing the frequency of each topic. The frequencies should look similar (though not identical) to the image on the course page. Find the frequency of each word (the number of times it appears). Make a histogram showing the frequency of the 100 most frequent words. Make another histogram showing the frequency of the 101st to 200th word. What do you notice? Using the terminology from chapter 2 of MLPP how would you describe these distributions? Part 2: Our First Classifier The k-NN classifier is a simple classifier which can perform surprisingly well in many cases. Given C classes the k-NN classifier estimates the class posterior distribution as p(y = c|x) = 1 K X I y 0 = c c ∈ 1, . . . , C, x0 ∈Nk (x) where I {a} is the indicator function which takes value 1 if a is true and 0 otherwise, Nk (x) is the set of k nearest neighbors to x, and y 0 is the label of x0 . 1 Write code implementing the k-NN classifier for the 20 Newsgroup dataset using the distance function d(x, x0 ) = |Wx ∪ Wx0 | − |Wx ∩ Wx0 |, where Wx is the set of words appearing in document x. Split your dataset into a training, validation, and testing set. Use the validation set to find a good k and estimate the accuracy using a test set. Think of and implement possible improvements to our rather simple classification algorithm. Report your findings. Imagine you used the same data for training and testing and set k = 1. What would the estimated accuracy be? What to turn in Turn in a printed copy of the code you’ve written (monospace font, READABILITY COUNTS!) along with the results you’ve generated and any other exposition you wish to include. Please include your name, the date, and the assignment number. 2