Machine Learning Homework Gaining familiarity with Weka, ML tools and algorithms Goals for this homework 1. Learn how to use Weka, a collection of ML algorithms implemented in Java 2. Apply a few ML techniques to some standard datasets to see what happens 3. Learn how to connect to Weka’s API, so you can include calls to ML techniques in your own code. Step 1: Install Weka, get data • Download and install Weka on some machine that you will be able to use for a while. Lab machines will not have this, but if you don’t have access to another machine, let me know, and I will try to get this installed for you on a lab machine. http://www.cs.waikato.ac.nz/ml/weka/ • Also, download a dataset. For this homework, I will ask you to use the University of California-Irvine’s repository of machine learning datasets, and I will focus on this dataset: http://archive.ics.uci.edu/ml/datasets/Communities+and+Crime • You can browse all of the datasets on UCI’s machine learning repository here: http://archive.ics.uci.edu/ml/datasets.html You won’t need these for this assignment, but you may be interested to see what datasets people have used in the past for ML research. Step 2: Format the data For the Communities and Crime dataset, you should open the file “communities.names” and scroll down to the section that looks like: @relation crimepredict @attribute state numeric … @data Copy this whole section, and paste it into the top of the file called “communities.data”. Rename the file “communities.data” to “communities.arff”. This puts the data into a format that Weka can easily understand. Step 3: Load data into Weka • Start Weka. You should see a menu like the one on the right. • Click on “Explorer”. • In the “Preprocess” tab, click on “Open file.” • Find the “communities.arff” file that you created, and open it. How it should look when you load the data into Weka: First task: Get familiar with the data (and Weka) • Click on the “visualize” tab. • Use the scatterplots under this tab to try to get a feel for what this data contains. • We will use ViolentCrimePerPop as the Y variable that we will try to predict in this dataset. What other variables seem to make a difference for predicting this variable, based on the plots you see? Question 1: Write down the name of three different features that you think each have a significant predictive relationship with ViolentCrimePerPop. For each one, briefly (1 sentence or less) describe the relationship. Task 2: Determining relationships between variables Focus on the variables PctFam2Par and PctNotHSGrad. Both seem to have some correlation with ViolentCrimePerPop. Question 2: Based on the plots in the visualize tab, see if you can determine whether there is a correlation between PctFam2Par and PctNotHSGrad or not. Explain what evidence you have found to support your conclusion. Prepping for a regression • First, you will need to remove non-numeric attributes from the data, since most of Weka’s regression algorithms can’t handle such attributes. • Click on the “preprocess” tab. • You should see a list of all 128 attributes (including ViolentCrimePerPop) on the left. • Click the check box next to “communityName”. • Click the button called “remove” at the bottom of the screen. • You should be all set. Task 3: Running a regression experiment • Click on the “classify” tab. • At the top, click the button called “choose”. You will see a list of many classifiers that are built in to Weka. Many of these are greyed-out, since they can’t do regression. The non-grey ones are available for our experiment. • Under “functions”, select “linear regression”. • Under “test options”, select “percentage split”, and set the percentage to 66. • Make sure that “(Num) ViolentCrimePerPop” shows up in the dropdown list below the test options. • Click “start”. Question: When the classifier finishes, copy the results from the “Classifier output” box to a text file called “linear-regression-results.txt”. You will email this to your TA when you’re done. Task 4: Running a more complicated regression model • This time we’ll try a Support Vector Machine. • Click the “choose” box again, and under functions, select “SMOreg”. • Use the same “test options” as before. • Click start. (It may take 20-30 seconds to finish training.) Question: Which model performed better in this experiment? How can you tell? Cite two pieces of evidence that tell you why SMOreg was better than linear regression, or vice versa. Task 5: Running a clustering experiment • • • • • • • • • • • Click on the “cluster” tab at the top. Click the “choose” button, and select “SimpleKMeans”. To the right of the “choose” button is a textbox that says “SimpleKMeans –N 2 …” Click anywhere in the textbox. It should bring up a new window. In the new window, under “numClusters”, change it from 2 to 10. Click “Ok”. Set the “cluster mode” to “use training set”. Click “start”. When this finishes, in the “Result-list” text area, right-click the most recentlyappeared line of text. Select “Visualize cluster assignments” from the popup menu. In the new window, change the “X” variable to “Cluster (Nom)”. Change the “Y” variable to “ViolentCrimesPerPop (Num)”. Question: Did the K-means clustering algorithm do a good job of separating the data into clusters that have different violent crime rates? What evidence from the chart you just created supports your conclusion? (2 sentences max.) Task 6: Weka API • Write a Java class that runs a SMO regression on the communities dataset. DO NOT write the code to do the SMO regression; instead, call Weka’s API to make this happen. Submit your Java class to your TA. Extra Credit Task 1 (1 point): Principal Components Analysis We’re going to run a PCA on this dataset, and save it as a new dataset. • Click on the “select attributes” tab. • Click “choose”, select “principalComponents”. • In the popup window, click “Yes” to automatically selecting the “Ranker”. • Click “start”. • In the “result-list” window, right-click the most recently-appeared line, and select “save transformed data…”. Save this data as a file called “transformed-communities.arff”. Question: How many of the eigenvectors from this PCA have an eigenvalue greater than 1? What percentage of the total variance of the data does this subset of the eigenvectors represent? Extra Credit Task 2 (2 points): Linear regression over PCA-transformed data • Load the new dataset (transformed by PCA) into Weka. • Run a linear regression with this new dataset. Again, use 66% of the data for training. Question: How do the results for this linear regression compare with SMO regression and the previous linear regression? Cite two evaluation metrics in your comparison. • Analyze the results some more: right-click on the most recently-produced line under “Result list”, and select “visualize classifier errors”. Question: When the regression made errors, did it more often predict too high of a value, or too low of a value? To turn in You should turn in a single zip archive called <your-name>.zip. It should contain: • a text file called “answers.txt” with the answers to all of the questions in this homework • your Java source code for Task 5 (don’t include the Weka jar, just your own code that references Weka). Please include a comment in the code to explain to the TA how he should get the code to read in data from his own file. • “linear-regression-results.txt” • If you did the extra credit, “transformed-communities.arff” Email or otherwise transfer this zip file to your TA.