ADDIS ABABA UNIVERSITY BUSINESS INTELLIGENCE, DATA WAREHOUSING, AND DATA MINING CIT 828: Summer 2009 HW Assignment 4 Due: July 13th 2009 Note: In the sections that follow, there are 6 things to submit. PART 0: LOADING DATA FILES INTO WEKA By now you have installed and run Weka in Assignments 1, 2, and 3. The data you have been using for the hands-on assignments (and will use for one of the data mining competition datasets) comes from the KDD-Cup 1998 Competition. Feel free to research what other researchers have done to solve this problem (not mandatory). 1 I have provided two datasets for you on the course website, TRAIN4.arff and TEST4.arff. These data are sampled from a much larger dataset. When you are building your classifier to predict target_b, you should remove the target_d from both the training and test set arffs. When you are building your numeric predictor to predict target_d, you should remove the target_b from both the training and test set arffs. PART I: CALCULATING RESULTING PROFIT This dataset includes responses from people who donated to a fundraiser after receiving notification about the fundraiser in the mail. In this example, we will assume the notification the potential donors received in the mail cost $.85 to mail. Build a decision tree (J48) classifier on your training set and apply it to your test set. Right click on the run in the results section on the classification tab and select Visualize Classifier Errors. Click on save and save the new arff file as RESULTS4.arff. Remember that this new arff file will include your predictions (the second to last attribute and column in the data portion of your arff file). Question 1: Please copy and paste the text in the classifier output window (summary, detailed accuracy by class, confusion matrix) into your assignment. Now open the results file in excel to calculate profit. You will need both the predictions in the RESULTS4.arff file and the Dollar amounts (target_d) in the TEST4.arff file to calculate profit. NOTE: select ‘Delimited’ and then ‘Comma’ when opening the .arff file in excel. profit = (- mailing cost * number of mailings) + sum (response amounts in dollars for true positives) You can get the dollar amounts from target_d. A true positive is someone who was mailed-to and actually responded. Calculate the profit if you mailed to everyone in the test set. Now calculate the profit if you mailed only to the people that your decision tree model labeled positive. Question 2: Compare the profit for the “decision tree” and “mail to everyone” models? Would you prefer to use your decision tree model? PART II: GENERATING CLASS PROBABILITY ESTIMATES The goal of this part of the assignment is to generate class probability estimates so that you may combine them with dollar amounts to calculate expected revenue. You will build a decision tree classifier again. However, this time, you will produce class probability estimates (CPEs) in addition to the predicted class labels. I have provided two options for generating CPEs for you to select from. It is only necessary for you to use one method to get the probability estimates for this assignment. Option1: Command Line NOTE: In order to get class probability estimates (CPEs) from a classifier to be combined with a numeric prediction to calculate expected response or to rank examples by the CPEs, you may want to use the “command line”. You will find an example of a command line command bellow. The example will produce a file called “filename.probs” that includes the probability estimate for each test example. You can get to the command line in Windows by typing cmd in the “Run” dialogue box (shown below for Windows Vista). You can get to the Run dialogue box by typing Run in the search field on the start menu. 1 http://www.kdnuggets.com/meetings/kdd98/kdd-cup-98.html#data NOTE: THE PROBABILITIES WILL BE THE CPEs OF THE PREDICTED CLASS. ASSUMING YOUR POSITIVE CLASS IS 1 AND NEGATIVE CLASS IS 0: IF YOU WANT TO PUT EVERYTHING IN TERMS OF CPES FOR THE CLASS =1, YOU WILL HAVE TO TAKE 1-CPE WHEN THE PREDICTED CLASS =0. If you want to use the command line prompt option please feel free to ask me about it before the night before the assignment is due so that I can help you get started. You need to make sure to tell the command the exact location of the training and test files as well as the location of the Weka jar file. The easiest way is for you to copy your training and test .arff files to the Weka directory, and then use the following command line: java -cp weka.jar weka.classifiers.trees.J48 -t TRAIN4b.arff -T TEST4b.arff -p 0 >filename.probs You can open the resulting file e.g. in Excel using the ‘fixed width’ option in the import wizard. If you rather just use the user interface, follow the steps below with option 2. Option 2: User Interface You may use the user interface to generate class probability estimates for your problem. Specify "Output predictions" under Classify->More options Then run your decision tree classifier again. Now, in addition to the results summary you are used to, you will see predictions listed with the following form: === Predictions on test set === inst#, actual, predicted, error, probability distribution 1 ? 2:tested_p + 0.124 *0.876 2 ? 1:tested_n + *0.845 0.155 ... 14 ? 1:tested_n + *0.95 0.05 15 ? 2:tested_p + 0.256 *0.744 The "+" indicates an error between actual classification and predicted classification. (Since in the example above all labels were ? there is always an error.) NOTE: in the data mining competition you will be given a test set without any class labels (the class labels will all have the class label of ?. It will be your job to label the test examples). The "*" flag indicates the highest probability --- ie, the associated class value would be returned. The highest probability is the probability that the example belongs to the predicted class (not necessarily the “positive” class). You should right click on the J48 run in the results buffer tab and then click on save results buffer. Give the results file a name. You can edit this file to keep only the probability estimates. You can e.g. open the file in a text editor, remove the lines above and below the predictions on the test set, replace all “*” and “+” characters with empty spaces “”, save the file as a text file, and then open the resulting file in excel using the import wizard with the space delimited option. Alternatively you can open the file in excel directly and use e.g. the function ‘MID’. You will have to manipulate the data a little bit to get it to a form that you can work with. Ultimately you want to have a column with the CPEs for the positive class. So, you need a column with the CPE score for Class=1 for all of your text examples in the same order that they appear in the classifier output. PART III: GENERATING NUMERIC PREDICTIONS Now build a numeric predictor using linear regression for target_d. You should go back to the original TRAIN4.arff and TEST4.arff files. You should remove the binary target (target_b) for both files; make sure you select ‘supplied test set’ in the test set options. Store the numeric predictions in a file (You should know how to store your model predictions to a file now). Question 3: copy and paste the text in you classifier output (Evaluation on test set == Summary ==) PART IV: CALCULATING EXPECTED REVENUE Combine the CPEs you found in PART II with the Results arff file you created in PART III in an Excel Spreadsheet. The idea is to make sure the rows corresponding to the same test examples match up. Now, for each test example, you should have two predictions available, the CPE and numeric prediction. Multiply the two values corresponding to each test example to get the expected revenue for each test case. Let’s say we want to use a new model based on an “expected revenue cutoff”. Our new model will classify everything with expected revenue ≥ $0.85 a 1 and everything < $0.85 a 0. Recall: profit = - mailing cost * number of mailings + sum (response amounts for true positives) Calculate profit with your new model. Question 4: Compare the profit for the “expected revenue cutoff” and “mail to everyone” models? Which model would you prefer to use? Question 5: Cut and paste the first 20 lines of your Excel Spreadsheet to your Assignment (so that I can see that you generated both predictions and calculated expected revenue) Question 6: List one reason why your expected revenue model may not perform well. NOTE: If we were performing this profit analysis in the real-world, we would evaluate our methods using cross-validation. However, for this assignment, we just want you to get the mechanics down for generating class probability estimates. KDD-CUP Data Set (from the website) The data set for the KDD Cup 1998 was generously provided by the Paralyzed Veterans of America (PVA). PVA is a not-for-profit organization that provides programs and services for US veterans with spinal cord injuries or disease. With an in-house database of over 13 million donors, PVA is also one of the largest direct mail fund raisers in the country. Participants in the CUP will demonstrate the performance of their tool by analyzing the results of one of PVA's recent fund raising appeals. This mailing was dropped in June 1997 to a total of 3.5 million PVA donors. It included a gift "premium" of personalized name & address labels plus an assortment of 10 note cards and envelopes. All of the donors who received this mailing were acquired by PVA through premium-oriented appeals like this. The analysis data set will include: A subset of the 3.5 million donors sent this appeal A flag to indicate respondents to the appeal and the dollar amount of their donation PVA promotion and giving history Overlay demographics, including a mix of household and area level data.