Assignment 4

advertisement
ADDIS ABABA UNIVERSITY
BUSINESS INTELLIGENCE, DATA WAREHOUSING, AND
DATA MINING
CIT 828: Summer 2009
HW Assignment 4
Due: July 13th 2009
Note:
In the sections that follow, there are 6 things to submit.
PART 0: LOADING DATA FILES INTO WEKA
By now you have installed and run Weka in Assignments 1, 2, and 3. The data you have been using for the hands-on
assignments (and will use for one of the data mining competition datasets) comes from the KDD-Cup 1998 Competition.
Feel free to research what other researchers have done to solve this problem (not mandatory).
1
I have provided two datasets for you on the course website, TRAIN4.arff and TEST4.arff. These data are sampled from a
much larger dataset.
When you are building your classifier to predict target_b, you should remove the target_d from both the training and test set
arffs.
When you are building your numeric predictor to predict target_d, you should remove the target_b from both the training and
test set arffs.
PART I: CALCULATING RESULTING PROFIT
This dataset includes responses from people who donated to a fundraiser after receiving notification about the fundraiser in the
mail. In this example, we will assume the notification the potential donors received in the mail cost $.85 to mail.
Build a decision tree (J48) classifier on your training set and apply it to your test set. Right click on the run in the results
section on the classification tab and select Visualize Classifier Errors. Click on save and save the new arff file as
RESULTS4.arff. Remember that this new arff file will include your predictions (the second to last attribute and column in
the data portion of your arff file).
Question 1: Please copy and paste the text in the classifier output window (summary, detailed accuracy by class,
confusion matrix) into your assignment.
Now open the results file in excel to calculate profit. You will need both the predictions in the RESULTS4.arff file and the
Dollar amounts (target_d) in the TEST4.arff file to calculate profit. NOTE: select ‘Delimited’ and then ‘Comma’ when
opening the .arff file in excel.
profit = (- mailing cost * number of mailings) + sum (response amounts in dollars for true positives)
You can get the dollar amounts from target_d. A true positive is someone who was mailed-to and actually responded.
Calculate the profit if you mailed to everyone in the test set.
Now calculate the profit if you mailed only to the people that your decision tree model labeled positive.
Question 2: Compare the profit for the “decision tree” and “mail to everyone” models? Would you prefer to use your
decision tree model?
PART II: GENERATING CLASS PROBABILITY ESTIMATES
The goal of this part of the assignment is to generate class probability estimates so that you may combine them with dollar
amounts to calculate expected revenue. You will build a decision tree classifier again. However, this time, you will produce
class probability estimates (CPEs) in addition to the predicted class labels.
I have provided two options for generating CPEs for you to select from. It is only necessary for you to use one method to get
the probability estimates for this assignment.
Option1: Command Line
NOTE: In order to get class probability estimates (CPEs) from a classifier to be combined with a numeric prediction to
calculate expected response or to rank examples by the CPEs, you may want to use the “command line”. You will find an
example of a command line command bellow. The example will produce a file called “filename.probs” that includes the
probability estimate for each test example. You can get to the command line in Windows by typing cmd in the “Run”
dialogue box (shown below for Windows Vista). You can get to the Run dialogue box by typing Run in the search field on
the start menu.
1
http://www.kdnuggets.com/meetings/kdd98/kdd-cup-98.html#data
NOTE: THE PROBABILITIES WILL BE THE CPEs OF THE PREDICTED CLASS. ASSUMING YOUR POSITIVE
CLASS IS 1 AND NEGATIVE CLASS IS 0: IF YOU WANT TO PUT EVERYTHING IN TERMS OF CPES FOR THE
CLASS =1, YOU WILL HAVE TO TAKE 1-CPE WHEN THE PREDICTED CLASS =0.
If you want to use the command line prompt option please feel free to ask me about it before the night before the assignment is
due so that I can help you get started. You need to make sure to tell the command the exact location of the training and test
files as well as the location of the Weka jar file. The easiest way is for you to copy your training and test .arff files to the Weka
directory, and then use the following command line:
java -cp weka.jar weka.classifiers.trees.J48 -t TRAIN4b.arff -T TEST4b.arff -p 0 >filename.probs
You can open the resulting file e.g. in Excel using the ‘fixed width’ option in the import wizard. If you rather just use the user
interface, follow the steps below with option 2.
Option 2: User Interface
You may use the user interface to generate class probability estimates for your problem. Specify "Output predictions" under
Classify->More options Then run your decision tree classifier again. Now, in addition to the results summary you are used to,
you will see predictions listed with the following form:
=== Predictions on
test set ===
inst#, actual, predicted, error, probability distribution
1 ? 2:tested_p + 0.124 *0.876
2 ? 1:tested_n + *0.845 0.155 ...
14 ? 1:tested_n + *0.95 0.05
15 ? 2:tested_p + 0.256 *0.744
The "+" indicates an error between actual classification and predicted classification. (Since in the example above all labels
were ? there is always an error.) NOTE: in the data mining competition you will be given a test set without any class labels
(the class labels will all have the class label of ?. It will be your job to label the test examples).
The "*" flag indicates the highest probability --- ie, the associated class value would be returned. The highest probability is
the probability that the example belongs to the predicted class (not necessarily the “positive” class).
You should right click on the J48 run in the results buffer tab and then click on save results buffer. Give the results file a name.
You can edit this file to keep only the probability estimates. You can e.g. open the file in a text editor, remove the lines above
and below the predictions on the test set, replace all “*” and “+” characters with empty spaces “”, save the file as a text file,
and then open the resulting file in excel using the import wizard with the space delimited option. Alternatively you can open
the file in excel directly and use e.g. the function ‘MID’. You will have to manipulate the data a little bit to get it to a form that
you can work with. Ultimately you want to have a column with the CPEs for the positive class. So, you need a column
with the CPE score for Class=1 for all of your text examples in the same order that they appear in the classifier output.
PART III: GENERATING NUMERIC PREDICTIONS
Now build a numeric predictor using linear regression for target_d. You should go back to the original TRAIN4.arff and
TEST4.arff files. You should remove the binary target (target_b) for both files; make sure you select ‘supplied test set’ in the
test set options. Store the numeric predictions in a file (You should know how to store your model predictions to a file now).
Question 3: copy and paste the text in you classifier output (Evaluation on test set == Summary ==)
PART IV: CALCULATING EXPECTED REVENUE
Combine the CPEs you found in PART II with the Results arff file you created in PART III in an Excel Spreadsheet. The idea
is to make sure the rows corresponding to the same test examples match up. Now, for each test example, you should have two
predictions available, the CPE and numeric prediction. Multiply the two values corresponding to each test example to get the
expected revenue for each test case.
Let’s say we want to use a new model based on an “expected revenue cutoff”. Our new model will classify everything with
expected
revenue ≥ $0.85 a 1 and everything < $0.85 a 0.
Recall:
profit = - mailing cost * number of mailings + sum (response amounts for true positives)
Calculate profit with your new model.
Question 4: Compare the profit for the “expected revenue cutoff” and “mail to everyone” models? Which model would
you prefer to use?
Question 5: Cut and paste the first 20 lines of your Excel Spreadsheet to your Assignment (so that I can see that you
generated both predictions and calculated expected revenue)
Question 6: List one reason why your expected revenue model may not perform well.
NOTE: If we were performing this profit analysis in the real-world, we would evaluate our methods using cross-validation.
However, for this assignment, we just want you to get the mechanics down for generating class probability estimates.
KDD-CUP Data Set (from the website)
The data set for the KDD Cup 1998 was generously provided by the Paralyzed Veterans of America (PVA). PVA is a
not-for-profit organization that provides programs and services for US veterans with spinal cord injuries or disease. With an
in-house database of over 13 million donors, PVA is also one of the largest direct mail fund raisers in the country.
Participants in the CUP will demonstrate the performance of their tool by analyzing the results of one of PVA's recent fund
raising appeals. This mailing was dropped in June 1997 to a total of 3.5 million PVA donors. It included a gift "premium" of
personalized name & address labels plus an assortment of 10 note cards and envelopes. All of the donors who received this
mailing were acquired by PVA through premium-oriented appeals like this. The analysis data set will include:
A subset of the 3.5 million donors sent this appeal
A flag to indicate respondents to the appeal and the dollar amount of their donation
PVA promotion and giving history
Overlay demographics, including a mix of household and area level data.
Download