Recommendation Engine and Data Analytics: Netflix

advertisement
Recommendation Engine and Data Analytics:
Netflix
E.Lance
July 31, 2013
Abstract
Collaborative filtering techniques are heavily used in industry, generally as recommendation systems such
as Netflix’s Cinematch or Amazon’s e-basket. The objective of collaborative systems is to suggest possible
elements of interest to its users and profit from accurate suggestions. A recommendation engine was built
using a variety of collaborative filtering techniques. The recommendation engine was tailored for a specific
data structure. This paper focuses on Netflix as our case study. Underlying customer behavior patterns are
investigated with the objective of increasing profitability margins.
Keywords: K-nearest neigbohrs, Recommendatio engines, Cosine similarity.
1
Introduction
A collaborative filtering engine was built with the
objective of predicting individual user preferences
and suggesting multiple elements of interest to a
particular user. In theory, these techniques can
be applied to any type of market basket. For
our purpose we will demonstrate the use of these
techniques in the media industry, specifically with
movie ratings and movie recommendations. In this
paper we also investigate profitability as a function
of customer behavior.
The data set used consists of ratings from more than
500,000 users, producing a data base of more than Figure 1: Euclidian distance example in 3 space. Cal100 million rows. Due to the size of the data set, culation of distance from origin for an arbitrary point
a MySQL server was used to store the information. in 3 space.
The analysis was conducted in R using RMySQL
package. There are more than 17,000 titles available,
with ratings ranging from 1 to 5.
Figure 1 shows a simple example of Euclidian distance in 3 space.
2
Theory
The model uses 3 main techniques in order to derive a
prediction; k-nearest neighbors, cosine similarity and
Pearson’s correlation coefficient. The first technique
is used to cluster users by distance to the target user.
Closeness is measured by the Euclidian distance in N
dimensions. Cosine similarity treats users as vectors
and it is a measure of the orthogonality between two
vectors. This technique will be use as a complement
to the correlation coefficient. Pearson’s correlation
is used in conjunction with k-nearest neighbors to
assess similarity between users.
2.1
2.2
After Euclidian distances have been established, and
users have been sorted by distance, correlation coefficients are calculated. Correlation coefficients and
distance are used to weight each user’s contribution
toward the prediction as follows:
Rji
1
Uwi = PN
,
Eij
j Rj
Euclidian Metrics
it follows Ed in N dimensions can be expressed as:
N X
N
X
12
(xji+1 − xji )2 .
j
(3)
where Uwi is the weight assign to a particular user,
Rij is the correlation coefficient
between a particuP
lar user and target user,
Rj is the sum of all the
correlation coefficients of the users who are closest
to the target user and Eij is the distance between a
particular user and the target user. Uwi represents
the degree of contribution a specific user will have
and it its strength depends on both the correlation
coefficient and Euclidian distance. This technique is
also known as k-nearest neighbor.
The extension of Euclidian distance from 2 to N dimensions is straight forward. In its simplest form,
Euclidian distance can be expressed as:
p
(1)
E2d = (x2 − x1 )2 + (y2 − y1 )2 ,
EN d =
Correlation Metrics
(2)
i
2
3
Obstacles
and sub-clusters were created. Figure 2 shows a visual implementation of the solution. Users were clusThe size of the data set imposes software and tered according to their distance from the natural
hardware constraints on the analysis. It is not reference frame.
possible, and highly inefficient, to load the data set
onto memory and perform iterations. The methods
Data
described in the theory section can not be applied, as
is, to large data sets. Modifications were made to the
data structure in order to facilitate the application
of collaborative filtering techniques.
Cluster 1
3.1
1.1
Cluster 4
Cluster 5
1.2
1.3
1.4
1.5
Figure 2: Clustering by rating frequency.
4
Data Migration
Data Analysis
In this section we discuss the analysis done on the
data set along with the assumptions made. The objective is to find useful patterns in the data that could
result in profitability improvement. Understanding
customers’ behavior is a key component for driving
revenue, we will try to achieve this with our data set.
It is worth to remark that these assumptions are
valid, a complete data set would make it possible to
obtain the information necessary to calculate the end
result of our assumptions.
The original data consisted of about 18 thousand text
files, one file per movie title. Each text file contained;
user identification number, ratings, movie title and
date of ratings. A script was coded in R to migrate
the data. Using the RMySQL package, the script
migrated the data from text files to a locally hosted
database. Three tables were created to store; movie
titles, ratings and clustered groups respectively.
3.3
Cluster 3
Large Data set
Due to the size of our data set it was not possible
to load all the data into R. In order to facilitate the
analysis, an instance of MySQL server was installed
along with the implementation of the RMySQL package. This package allows R to interact with a data
base.
3.2
Cluster 2
Natural Reference Frame
4.1
The main problem behind the recommendation engine is the absence of a unique frame of reference to
which each element, existing or new, can refer to. It
is not efficient to run iterations for each customer to
find its nearest neighbors. For this reason, the concept of global reference frame was introduced, which
substantially reduces clustering and recommendation
processing time.
A global reference frame was created out of the frequency distribution of ratings across all movies. The
result from the frequency collection were five natural
reference frames. The procedure was then repeated
Basic Statistics
Some assumptions were made in the calculation of
these statistics. First, due to the lack of descriptive
data we assumed the subscription length, for each
customer, to be the difference between the earliest
rating date and the oldest rating date. The result of
this operation was assumed to be the length of the
subscription, in months, for each user.
The second assumption involves the revenue obtained
from each user. We assumed each user was fully subscribed to the service, paying $15.98 per month. The
price was obtained directly from the service provider.
3
The third assumption involved the total number of tomers. Table 3 and 4 show basic statistics for this
movie titles viewed by each user. We assumed every customer subset.
user only rated titles after viewing them. This asParameter
Mean(µ) Totals
sumption made possible to calculate the total numSubscription Length
27
ber of titles seen by each user and the revenue per
Revenue/Customer
$436.80
$87,644,979
view. Table 1 and 2 show the basic statistics of our
Titles Viewed
311
62,537,849
dataset. Note that high ratings prevail in the data
Rpv
$4.28
set. A rating above or equal to 3 means customer
Rpv 1st Qu.
$1.02
satisfaction. We will use this concept later on to link
Rpv 3rd Qu.
$4.79
individual customer satisfaction and revenue.
Days between Activity 8.21
Average Rating
3.4
Parameter
Mean (µ) Total
Total Customers
200,642
Subscription Length 14
Revenue/Customer
$225.75
$108,403,095
Titles Viewed
209
100,480,507
Table 3: Statistics for customers with subscription
Revenue Per View
$2.75
lengths greater than 12 months. Values are per customer. Note subscription length is in months
Table 1: Basic statistics. Note that the revenue is
estimated by taking the difference from earliest and
oldest rating date, then it is multiplied by the service
Rating
% of Total
fee. Values are per customer
1
%6.25
2
%13.85
3
%29.52
4
%32.61
Rating
% of Total
5
%22.18
1
%5.20
Total
Ratings
65,100,476
2
%12.19
Total
Customers
213,046
3
%30.24
4
%32.28
5
%20.09
Table 4: Frequency distribution of ratings.
Total Ratings 100,480,507
5
Table 2: Frequency distribution of ratings.
4.2
5.1
Customer Segmentation
Customer Profitability
Profitable vs Unprofitable
By looking at the statistics in table 3, we can see that
the mean revenue per view is $4.28. The revenue per
view is a measure of how profitable a customer is.
Passive customers are far more profitable than active
customers, mainly by the difference of their respective
service consumption. From this observation we have
derived the metric revenue per view Rpv to be:
We uncovered the existence of two type of customers;
profitable and less profitable. We focus on the latter
to try and establish a correlation between the available data and customer profitability.
For this reason we have selected customers who have
been subscribed for at least 12 months. We believe
12 months is a reasonable period of time so that patterns can start to emerge in the activity of a user’s
account. The resulting subset contains 200,642 cus-
Rpv =
4
Rit
,
Vit
(4)
where Rit is the total revenue and Vit is the total views
for user i. We categorize a customer as unprofitable
if revenue per view Rpv < $1. In table 5 we display
the basic statistics for this customer subset. Using
the same idea, we categorize customers as profitable
if Rpv > $4.82. Table 6 shows the basic statistics for
this group. The choice of Rpv limits came from the
distribution seen in table 3.
Parameter
Subscription Length
Revenue/Customer
Titles Viewed
Revenue Per View
Days between Activity
Average Rating
Total Customers
Mean(µ)
24
$390.7
754
$0.60
1.15
3.5
48,598
customer subset has a mean of only 52 viewed titles.
The first clue to correlate customer satisfaction
came from the average ratings. As we can see, from
table 5 and 6, these numbers seem to indicate unprofitable customers tend to be slightly more satisfied than profitable customers. We will explore this
in the next section.
Totals
$18,989,577
36,645,498
5.2
Customer Satisfaction and Profitability
Table 5: Unprofitable customers. Values are per customer. Note, subscription length is in months. Total
From tables 5 and 6 we can see a clear difference
revenue is calculated from initial subscription date to
between the customer sets. We believed customer
latest rating date.
satisfaction had to be an important factor driving
revenue. In order to follow this lead, we divided
our data set into two sets; happy and unhappy
Parameter
Mean(µ) Totals
customers.
Customers were classified as happy
Subscription Length
28
if their average rating ≥ 3, and unhappy if their
Revenue/Customer
$449.5
$22,430,870
average rating< 3.
Titles Viewed
52
2,619,939
Revenue Per View
$11.73
In table 7 we can see a clear trend and differDays Between Activity 22.4
ence between customers who are, on average,
Average Rating
3.2
satisfied or unsatisfied. The average revenue per cusTotal Customers
49,905
tomer, between the two groups, is very close. This is
significant considering that the unsatisfied customer
subset represents 17% of the total customers.
Table 6: Profitable customers. Values are per customer. Note, subscription length is in months. Total
revenue is calculated from initial subscription date to
Unsatisfied customers tend to watch movies less
latest rating date.
frequently than satisfied customers. This is important because very active customers consume services
that might represent costs to the service provider.
The revenue per view from unsatisfied customers is
almost three fold, of that of satisfied customers. Figure 3 displays a plot of the average rating and the
revenue per view, there is a clear pattern; customers
who are less satisfied are more profitable.
From tables 5 and 6 we can see significant differences between the two subsets. First, the days
between activity, which is a measure of how often
a customer is active. We also see a significant
difference in titles viewed, the unprofitable customer
subset has a mean of 754, while the profitable
5
Parameter
MAE
Titles in common
Value
24%
10
300
Table 8: Error measurment trial run.
100
and dissuading them with subtle suggestions could
shift many of the active customers into passive customers.
0
Revenue Per View
500
Revenue per View vs Average Rating
1
2
3
4
6
5
Recommendation Engine
Average Rating
In order to train our engine, we created a subset
consisting of 100 of the most popular movies and
Figure 3: Customer satisfaction and profitability. As 100,000 users selected at random. Movies and users
customers tend to a higher level of satisfaction, the were selected at random and fed into the prediction
revenue per view decreases significantly.
algorithm. The number of k-nearest neighbors was
then increased, from 5 to 25. Figure three displays
the means by which we calibrated our engine. Table
8 displays the parameters used in this trial run. It
Mean
Unsatisfied Satisfied
is worth to note that in this run k-nearest neighbors
Total Views
220
331
had an important constraint, they all must have at
Rating
2.64
3.59
least 10 movies in common with the target user.
Subscription Months
26
28
Revenue Per View
$8.31
$3.44
The error was measured by
Days Between Activity 16
7
Revenue/Customer
Other Statistics
Number of Customers
% of Total
Total Revenue
$412.90
$441.84
ŷ
ê = | − 1|
y
(5)
34,814
%17
$14,374,793
165,828
where ŷ is the predicted value, and y is the actual
%82
value. Mean absolute error was calculated by
$73,270,186
N
X
eˆi
M AE =
.
(6)
Table 7: Customer segmentation by satisfaction.
N
i
After reviewing results from figure 4, we followed a
similar approach to investigate the error associated
The results seem to be counter-intuitive, but they are with the titles in common the k-nearest neighbors
supported by the research. The assumptions made must have with the target user. From figure 4 we
about the data are valid, as any researcher with un- can see the error stays stable, below %10, when the
restricted access to data can, in practice, derive the k-nearest neighbors size is in the vicinity of 12. For
metrics used in this paper. Focusing on the less active this trial run, we held the k-nearest neighbors size
customers, to retain them as is, can be a strategy to =13 and vary the number of titles in common. Figincrease profits. Focusing on very active customers ure 5 displays this trial run.
5.3
Conclusion
6
made the best prediction were selected according to
the results of the previous section. For our validation
run we used 5 k-nearest neighbors with a constraint
of at least 40 movies in common with the target user.
Titles and users were selected at random and passed
as arguments to the algorithm. The absolute mean
error M EA = %15.
30
20
7
10
%Error
40
Neighbors vs % Error
Large data sets impose challenges and constraints to
the techniques a researcher may use. Many of the
solutions we have found, were ad hoc. Different data
sets may pose different problems that call for unique
solutions.
5
10
15
20
Neighbors
Figure 4: Percent error agains number of k-nearest
neighbors.
40
0
20
%Error
60
80
Titles in Common vs %Error
10
20
30
40
Titles in Common
Figure 5: Titles in common, with target user, and
percent error.
6.1
Validation: Results
Our validation set consisted of 100,000 random customers and 100 random movies. The parameters that
7
Conclusion
Download