Homework 2 CS 5950: Machine Learning Thom Lake July 2, 2013

advertisement
Homework 2
CS 5950: Machine Learning
Thom Lake
Western Michigan University
July 2, 2013
Introduction
This homework assignment is designed to help you get started working with data. It is due
next Tuesday (2013.07.09). For this assignment we will be using a preprocessed subset of the
20 Newsgroups data. You can find a description and copy of the data we will be working with
on the course page http://homepages.wmich.edu/~tcp8889/ml/links.html.
Part 1: Working with Text
Make a histogram showing the frequency of each topic. The frequencies should look similar
(though not identical) to the image on the course page.
Find the frequency of each word (the number of times it appears). Make a histogram showing
the frequency of the 100 most frequent words. Make another histogram showing the frequency
of the 101st to 200th word. What do you notice? Using the terminology from chapter 2 of
MLPP how would you describe these distributions?
Part 2: Our First Classifier
The k-NN classifier is a simple classifier which can perform surprisingly well in many cases.
Given C classes the k-NN classifier estimates the class posterior distribution as
p(y = c|x) =
1
K
X
I y 0 = c c ∈ 1, . . . , C,
x0 ∈Nk (x)
where I {a} is the indicator function which takes value 1 if a is true and 0 otherwise, Nk (x) is
the set of k nearest neighbors to x, and y 0 is the label of x0 .
1
Write code implementing the k-NN classifier for the 20 Newsgroup dataset using the distance
function
d(x, x0 ) = |Wx ∪ Wx0 | − |Wx ∩ Wx0 |,
where Wx is the set of words appearing in document x. Split your dataset into a training, validation, and testing set. Use the validation set to find a good k and estimate the accuracy using
a test set. Think of and implement possible improvements to our rather simple classification
algorithm. Report your findings.
Imagine you used the same data for training and testing and set k = 1. What would the
estimated accuracy be?
What to turn in
Turn in a printed copy of the code you’ve written (monospace font, READABILITY COUNTS!)
along with the results you’ve generated and any other exposition you wish to include. Please
include your name, the date, and the assignment number.
2
Download