Tucker Chambers CS105 Project 12/10/2010 Data Mining for Insights Into Movie-Advertisement Placement

advertisement
Tucker Chambers
CS105 Project
12/10/2010
Data Mining for Insights Into Movie-Advertisement Placement
I. INTRODUCTION
For my project I performed data-mining techniques on a dataset from the perspective a movie
producer to discover the best places to put movie ads. I analyzed a several attributes regarding media
habits (radio, internet, TV, newspaper, etc.) and their relation to the frequency that people go to the
movies. By analyzing this data it could hypothetically enable a movie producer to discover the types of
media exposure that predict frequent movie-going, which would allow more effective and efficient ad
placement. To make my analysis, I used Weka, a python program, and multiple data graphics.
II. DATASET DESCRIPTION
The dataset I analyzed is the Fast Food dataset from STARS website. It can be obtained at the
following URL: http://stars.ac.uk/showSubject1.php?subjectID=5. This dataset was originally designed
to analyze the factors that may affect the amount of fast food that people eat per month, such as
exposure to different kinds of media and TV watching habits. I didn’t find this data arrangement to be
particularly interesting, so I modified the data to analyze the factors which affect how often people
attend movies.
III. DATA PREPARATION
1. Manual Modifications: I began by making several manual modifications to the original Fast
Food dataset. It was necessary to make several high-level changes to the layout and structure of
the database that were not feasible to include in my automated data preparation process.




Removed attributes that only applied to original Fast Food dataset: All fast food / month,
First mentioned, Last bought, Peer influence, Brand importance
Removed Household income. Although this attribute may have been useful and
predictive, I removed it for two reasons:
i. From the perspective a movie producer, when ads are placed you do not
necessarily know the income of those people which are exposed to the ads, so it
isn’t helpful to the goals of my project if this is predictive.
ii. This attribute had a very high number of missing values; including income would
have forced me to sacrifice a large portion of my data by removing such rows
Removed Watch TV Cur Aff. I am not sure what type of show Cur Aff refers to. It might
be “current affairs,” but even so I’m not sure what type of show would qualify as a
“current affairs” show. There are no explanations of the attributes on the source website,
so there’s no way to know. I don’t want to include anything which is ambiguous or
confusing as that would make it difficult for me to properly analyze my results.
Removed Watch morning TV for similar reasons. Does this refer to watching TV in the
morning hours, watching a morning news show, or watching morning talk shows? This is



unclear and could signify a number of different things. Again, I don’t want to include
anything that is ambiguous.
Removed the ID attribute. This is simply a numeric identifier of the row and would be in
no way predictive of movie-going – it would only confuse my algorithms
Shifted the Go to movies to the last column since it is my output
See Appendix A for a list of the remaining attributes
2. Automated Data Preparation: I used a Python program to remove lines with missing or
problematic data. There were many problematic lines, and it would have been tedious and timeconsuming to remove them manually, so it made sense to automate this process.





All lines with empty or unknown values were removed (the program searched for any
instances of “*”, “Don’t know”, “Unknown” and zero-length strings)
All apostrophes were replaced with nothing, as this is problematic in Weka
Replaced all instances of “Not at all” with “Never” to make it easier to handle
After this process was completed, there were 385 lines remaining.
See Appendix B for the Python data preparation program that was used
3. Data-Mining Preparation: The final step in the preparation was to prepare the dataset for datamining in Weka. After ensuring that my dataset was in a CSV file, I loaded the data into Weka
and performed the following:


Randomized the data using Weka’s randomize algorithm – this ensures a good balance
across the data so that the results are not lopsided once I split off the test examples.
Split the data into training and test examples. I used a 75/25 split (i.e. I used about 75%
of the data for training examples, and about 25% for test examples). This resulted in 290
training examples and 95 test examples.
IV. DATA ANALYSIS
The goal of the analysis was to develop a model that accurately predicts the movie-going behavior of
people based on media exposure and several demographic attributes. This analysis could potentially be
used by movie producers to determine the most effective types of media, places, and demographic
groups to advertise their movies to. To find the best model, I explored a number of different
classification-learning algorithms. The following tables summarize the results of each algorithm on the
training and test data:
Algorithm
PART
OneR
NBTree
JRip
J48
Training Data Accuracy
87.93%
67.93%
78.28%
68.97%
77.24%
Test Data Accuracy
54.74%
58.95%
62.11%
64.21%
65.26%





PART
o Explanation of Algorithm: This algorithm creates a decision list using the separate-andconquer strategy. The decision list is often very long and complicated, which can lead to
high accuracy on training data, but low accuracy on new examples.
o Analysis: On the training data PART created a very long and complicated decision list
based on many attributes which led to overfitting. The model created by PART was
highly accurate on the test data (87.93%), but due to overfitting it was the most
inaccurate on the test examples (54.74%).
OneR
o Explanation of Algorithm: This algorithm finds the one attribute that makes the fewest
prediction errors (i.e. the attribute that on its own is the most predictive) and uses only
that one attribute to predict the output.
o Analysis: On the training data OneR found that the attribute with the least errors was
(interestingly enough) Pass billboards. This model predicted that if you pass billboards
regularly or occasionally, you will see movies occasionally, and if you never pass
billboards, you will never see movies. This model had 67.93% accuracy on the test data
and 58.95% accuracy on the training data.
NBTree
o Explanation of Algorithm: This algorithm generates a decision tree using the Naive
Bayes classifiers at the each of the leaves.
o Analysis: On the training data, the model generated by NBTree had a high accuracy of
78.28%, but on the test data it performed less well (probably due to overfitting) with an
accuracy of 62.11%.
JRip
o Explanation of Algorithm: This algorithm implements a propositional rule learner, and
then executes the Repeated Incremental Pruning to Produce Error Reduction (RIPPER).
Basically JRip creates a long and complicated set of accurate rules, and then “prunes” a
number of rules from the list to prevent overfitting and improve accuracy and previously
unseen examples.
o Analysis: The JRip algorithm produced a surprisingly simple model with two rules:
 if Age = 55-70 and Watch TV cooking = Yes then Go to movies=Never
 else Go to movies=Occasionally
This model was 68.97% accurate on the test data, and 64.21% accurate on the training
data. Of all of the algorithms, JRip lost the least accuracy when moving from training to
test examples. Unfortunately the test accuracy was still too low to be chosen.
J48
o Explanation of Algorithm: Like JRip, this model generates a decision tree, and then
prunes it to prevent overfitting. However, compared with JRip, J48 seems to do much
less pruning (i.e. the final tree is more complicated) since J48 ended up with 33 leaves,
and JRip ended up with just 2 rules.
o Analysis: The pruned tree produced by J48 had 33 leaves, with an overall size of 48. I
found that the best leaves (i.e. the attributes with the highest goodness score) chosen by
this algorithm were very interesting. The first two leaves chosen were Internet and
Education, following by Watch TV music and Pass billboards. This model had 77.24%
accuracy on the training set, and 65.26% accuracy on the test set. This was the highest
accuracy on the test examples. See Appendix C for the full decision tree.
V. RESULTS
In my analysis of the data I used models produced by the PART, OneR, NBTree, JRip, and J48
algorithms. The specific analyses of these algorithms are described above, so I have provided a brief,
general summary of all five algorithms, and then an in depth-discussion of the chosen algorithm J48.
Algorithm Results



The OneR algorithm produced a model that was far too simple (OneR is only based on one rule)
to generalize well with unseen examples. The Pass billboards attribute on its own is not able to
accurately predict unseen test examples.
The PART and NBTree models both used overly complicated and precise decisions lists/trees
that worked well on the training data but did not generalize well on the test data. Both of these
models seemed to overfit to the training data, and thus were much less accurate on the test data.
The JRip and J48 algorithms both produced decision lists/trees and then pruned them to prevent
overfitting. Of all the models, these two lost the least accuracy when moving from the training
set to the test set (most likely due to this careful pruning that prevents overcomplicating).
However, in the end the J48 algorithm ended up being the most accurate on the test set by a
small margin, with an accuracy of 65.26%
J48 Results and Implementation
Since the J48 algorithm had the highest accuracy on the test data, it is the best classificationlearning algorithm to use for the purposes of this project. According to this model, the best attributes
(i.e. highest goodness score) to use to predict moving going include internet usage, followed by type of
education, and then Watch TV-Music, Pass billboards, and hours of TV watched on weekdays/weekends
(see graphic below and Appendix C for the full tree).
For the most part, the fact that these were chosen as the best attributes does make some intuitive
sense in the real world. For example, the data suggests that internet users are much more likely to see
movies than non-users. If you use the internet this may suggest that you are willing and able to expose
yourself to various types of media such as movies.
However, several of the chosen attributes chosen are quite interesting and do not make much
intuitive sense. For example, the data suggests that your exposure to music television and billboards are
highly decisive factors in whether you see movies. There is no clear explanation for this. Furthermore, it
is surprising that age and sex are not even present in the J48 tree – it would have made intuitive sense if
these factors predicted movie-going (it’s a common conception that the young see more movies).
To illustrate the potential usefulness of this algorithm in predicting movie-going, I have created
several graphical representations for some of the most useful attributes according to the J48 model. With
these graphics, you can see how powerful some of these attributes are for movie-going predictions.
Internet Usage – Movie Frequency: This suggests that internet usage predicts at least occasional
movie-going, while non-users do not go to movies as often. This could potentially inform a movieproducer that the internet is an effective medium for advertising a movie.
Hours Watching TV – Movie Frequency: This suggests that the more TV that someone watches, the
more likely they are to see movies (at least occasionally), while those who watch less TV (or none at all)
are less likely to see movies. This could potentially inform a movie-producer that television is an
effective medium for movie advertising, especially for those who watch TV frequently.
Education – Movie Frequency: This suggests that those who have only completed Secondary
education and/or Tech-college are more likely to see movies at least occasionally, while those who have
completed graduate or post-graduate work may be less likely to see movies. This could inform a movieproducer that it may be more effective to direct advertising towards movie-goers with less education.
Pass Billboards – Movie Frequency: This suggests that those who pass billboards could be more likely
to see movies (at least occasionally) than those who do not pass billboards. This could inform a movie
producer that billboard ads may be effective to reach consumers who occasionally go to see movies.
VI. CONCLUSION
Classification learning algorithms can be a very useful tool in many applications such as movie
advertisements. In this project I analyzed a dataset of attributes regarding movie-goers and determined
the best algorithm to predict movie-going behavior. In my analysis I determined that the pruned
decision-list algorithm J48 creates the most accurate and effective model for predicting whether
someone goes to movies regularly, occasionally, or never. Further, through results of the J48 algorithm,
and through the use of visuals, I discovered that internet usage, hours watching TV, education and
billboard exposure may be predictive factors for people that go to movies at least occasionally. Datamining projects such as this could be invaluable in applications such as movie advertising, as they allow
companies to use their limited resources to reach consumers in the most effective way possible. The key
point to remember that no model is perfect, and that these models and predictive attributes are merely
suggestions – in real life there is always variation beyond what is predicted. Yet in the expensive world
of movie-making, every dollar counts, and a suggestion is better than nothing.
APPENDIX A: Dataset Attributes
All attributes are nominal
























AGE
SEX
INTERNET
REGION
EDUCATION
READ NEWSPAPER
PASS BILLBOARDS
LISTEN TO RADIO
WATCH TV NEWS
WATCH TV SOAPS
WATCH TV SPORT
WATCH TV CHAT
WATCH TV QUIZ
WATCH TV DRAMA
WATCH TV GARDEN
WATCH TV COOKING
WATCH TV COMEDY
WATCH TV DOCUMENTARY
WATCH TV FILMS
WATCH TV REALITY
WATCH TV MUSIC
TV HRS WEEKDAY
TV HRS WEEKEND
GO TO MOVIES
(15-17, 18-24, 25-35, 36-54, 55-70)
(Male/Female)
(User/Non-user)
(North/South/Midlands)
(Secondary/Tech college/Graduate/Post-graduate)
(Regularly/Occasionally/Never)
(Regularly/Occasionally/Never)
(Regularly/Occasionally/Never)
(Yes/No)
(Yes/No)
(Yes/No)
(Yes/No)
(Yes/No)
(Yes/No)
(Yes/No)
(Yes/No)
(Yes/No)
(Yes/No)
(Yes/No)
(Yes/No)
(Yes/No)
(Under 1, 1 to 2, 3 to 5, 5 to 10, Over 10)
(Under 1, 1 to 2, 3 to 5, 5 to 10, Over 10)
(Regularly/Occasionally/Never)
APPENDIX B: Data Preparation Program
#
#
#
#
#
perpareData.py
Tucker Chambers (tuckerc@bu.edu)
CS105 Final Project
12/10/2010
Prepares a dataset for data-mining in Weka
import string
inFileName = "FastFood.csv"
outFileName = "GoToMovies.csv"
# This is the original Fast Food dataset
food_file = open(inFileName, 'r')
# This will be the new movies dataset
movie_file = open(outFileName, 'w')
for line in food_file:
testLine = string.replace(line.lower(), " ", "")
# Exclude lines with missing data
if not (len(testLine) == 0 or "*" in testLine or "don'tknow" in testLine):
# Remove apostrophies that cause errors
line = string.replace(line, "'", "")
# Replace "not at all" with "never" -- easier to use
line = string.replace(line, "Not at all", "Never")
# Write the modified line to movies file
movie_file.write(line)
food_file.close()
movie_file.close()
print "Data preparation complete.", outFileName, "is ready for analysis."
APPENDIX C: J48 Pruned Tree
Internet = User: Occasionally (184.0/52.0)
Internet = Non-user
|
Education = Secondary
|
|
Watch TV music = No
|
|
|
TV hrs weekend = 3 to 5: Never (12.0/4.0)
|
|
|
TV hrs weekend = 5 to 10
|
|
|
|
Region = North
|
|
|
|
|
Watch TV comedy = Yes: Occasionally (4.0)
|
|
|
|
|
Watch TV comedy = No: Never (3.0)
|
|
|
|
Region = Midlands: Occasionally (4.0)
|
|
|
|
Region = South: Never (5.0/2.0)
|
|
|
TV hrs weekend = 1 to 2: Never (2.0)
|
|
|
TV hrs weekend = 2 to 3: Occasionally (5.0)
|
|
|
TV hrs weekend = Over 10: Never (3.0/1.0)
|
|
|
TV hrs weekend = Under 1: Never (1.0)
|
|
Watch TV music = Yes
|
|
|
Watch TV documentary = Yes
|
|
|
|
Watch TV garden = Yes: Occasionally (14.0)
|
|
|
|
Watch TV garden = No
|
|
|
|
|
Watch TV drama = Yes: Occasionally (3.0)
|
|
|
|
|
Watch TV drama = No: Regularly (3.0)
|
|
|
Watch TV documentary = No
|
|
|
|
Watch TV garden = Yes: Never (2.0)
|
|
|
|
Watch TV garden = No
|
|
|
|
|
Listen to radio = Regularly: Occasionally (3.0)
|
|
|
|
|
Listen to radio = Never: Never (2.0)
|
|
|
|
|
Listen to radio = Occasionally: Occasionally (6.0/1.0)
|
Education = Graduate: Occasionally (1.0)
|
Education = Tech college
|
|
Pass billboards = Regularly
|
|
|
Watch TV reality = No: Occasionally (12.0/2.0)
|
|
|
Watch TV reality = Yes: Regularly (6.0/2.0)
|
|
Pass billboards = Occasionally
|
|
|
TV hrs weekend = 3 to 5: Occasionally (3.0)
|
|
|
TV hrs weekend = 5 to 10
|
|
|
|
TV hrs weekday = 5 to 10: Never (4.0/1.0)
|
|
|
|
TV hrs weekday = 3 to 5: Occasionally (0.0)
|
|
|
|
TV hrs weekday = 2 to 3: Occasionally (0.0)
|
|
|
|
TV hrs weekday = 1 to 2: Occasionally (0.0)
|
|
|
|
TV hrs weekday = Over 10: Occasionally (2.0)
|
|
|
|
TV hrs weekday = Under 1: Occasionally (0.0)
|
|
|
TV hrs weekend = 1 to 2: Occasionally (0.0)
|
|
|
TV hrs weekend = 2 to 3: Occasionally (0.0)
|
|
|
TV hrs weekend = Over 10: Occasionally (0.0)
|
|
|
TV hrs weekend = Under 1: Occasionally (0.0)
|
|
Pass billboards = Never: Occasionally (2.0/1.0)
|
Education = Postgraduate: Never (4.0)
Number of Leaves: 33
Size of the tree: 48
Download