Recommendation Engine and Data Analytics: Netflix E.Lance July 31, 2013 Abstract Collaborative filtering techniques are heavily used in industry, generally as recommendation systems such as Netflix’s Cinematch or Amazon’s e-basket. The objective of collaborative systems is to suggest possible elements of interest to its users and profit from accurate suggestions. A recommendation engine was built using a variety of collaborative filtering techniques. The recommendation engine was tailored for a specific data structure. This paper focuses on Netflix as our case study. Underlying customer behavior patterns are investigated with the objective of increasing profitability margins. Keywords: K-nearest neigbohrs, Recommendatio engines, Cosine similarity. 1 Introduction A collaborative filtering engine was built with the objective of predicting individual user preferences and suggesting multiple elements of interest to a particular user. In theory, these techniques can be applied to any type of market basket. For our purpose we will demonstrate the use of these techniques in the media industry, specifically with movie ratings and movie recommendations. In this paper we also investigate profitability as a function of customer behavior. The data set used consists of ratings from more than 500,000 users, producing a data base of more than Figure 1: Euclidian distance example in 3 space. Cal100 million rows. Due to the size of the data set, culation of distance from origin for an arbitrary point a MySQL server was used to store the information. in 3 space. The analysis was conducted in R using RMySQL package. There are more than 17,000 titles available, with ratings ranging from 1 to 5. Figure 1 shows a simple example of Euclidian distance in 3 space. 2 Theory The model uses 3 main techniques in order to derive a prediction; k-nearest neighbors, cosine similarity and Pearson’s correlation coefficient. The first technique is used to cluster users by distance to the target user. Closeness is measured by the Euclidian distance in N dimensions. Cosine similarity treats users as vectors and it is a measure of the orthogonality between two vectors. This technique will be use as a complement to the correlation coefficient. Pearson’s correlation is used in conjunction with k-nearest neighbors to assess similarity between users. 2.1 2.2 After Euclidian distances have been established, and users have been sorted by distance, correlation coefficients are calculated. Correlation coefficients and distance are used to weight each user’s contribution toward the prediction as follows: Rji 1 Uwi = PN , Eij j Rj Euclidian Metrics it follows Ed in N dimensions can be expressed as: N X N X 12 (xji+1 − xji )2 . j (3) where Uwi is the weight assign to a particular user, Rij is the correlation coefficient between a particuP lar user and target user, Rj is the sum of all the correlation coefficients of the users who are closest to the target user and Eij is the distance between a particular user and the target user. Uwi represents the degree of contribution a specific user will have and it its strength depends on both the correlation coefficient and Euclidian distance. This technique is also known as k-nearest neighbor. The extension of Euclidian distance from 2 to N dimensions is straight forward. In its simplest form, Euclidian distance can be expressed as: p (1) E2d = (x2 − x1 )2 + (y2 − y1 )2 , EN d = Correlation Metrics (2) i 2 3 Obstacles and sub-clusters were created. Figure 2 shows a visual implementation of the solution. Users were clusThe size of the data set imposes software and tered according to their distance from the natural hardware constraints on the analysis. It is not reference frame. possible, and highly inefficient, to load the data set onto memory and perform iterations. The methods Data described in the theory section can not be applied, as is, to large data sets. Modifications were made to the data structure in order to facilitate the application of collaborative filtering techniques. Cluster 1 3.1 1.1 Cluster 4 Cluster 5 1.2 1.3 1.4 1.5 Figure 2: Clustering by rating frequency. 4 Data Migration Data Analysis In this section we discuss the analysis done on the data set along with the assumptions made. The objective is to find useful patterns in the data that could result in profitability improvement. Understanding customers’ behavior is a key component for driving revenue, we will try to achieve this with our data set. It is worth to remark that these assumptions are valid, a complete data set would make it possible to obtain the information necessary to calculate the end result of our assumptions. The original data consisted of about 18 thousand text files, one file per movie title. Each text file contained; user identification number, ratings, movie title and date of ratings. A script was coded in R to migrate the data. Using the RMySQL package, the script migrated the data from text files to a locally hosted database. Three tables were created to store; movie titles, ratings and clustered groups respectively. 3.3 Cluster 3 Large Data set Due to the size of our data set it was not possible to load all the data into R. In order to facilitate the analysis, an instance of MySQL server was installed along with the implementation of the RMySQL package. This package allows R to interact with a data base. 3.2 Cluster 2 Natural Reference Frame 4.1 The main problem behind the recommendation engine is the absence of a unique frame of reference to which each element, existing or new, can refer to. It is not efficient to run iterations for each customer to find its nearest neighbors. For this reason, the concept of global reference frame was introduced, which substantially reduces clustering and recommendation processing time. A global reference frame was created out of the frequency distribution of ratings across all movies. The result from the frequency collection were five natural reference frames. The procedure was then repeated Basic Statistics Some assumptions were made in the calculation of these statistics. First, due to the lack of descriptive data we assumed the subscription length, for each customer, to be the difference between the earliest rating date and the oldest rating date. The result of this operation was assumed to be the length of the subscription, in months, for each user. The second assumption involves the revenue obtained from each user. We assumed each user was fully subscribed to the service, paying $15.98 per month. The price was obtained directly from the service provider. 3 The third assumption involved the total number of tomers. Table 3 and 4 show basic statistics for this movie titles viewed by each user. We assumed every customer subset. user only rated titles after viewing them. This asParameter Mean(µ) Totals sumption made possible to calculate the total numSubscription Length 27 ber of titles seen by each user and the revenue per Revenue/Customer $436.80 $87,644,979 view. Table 1 and 2 show the basic statistics of our Titles Viewed 311 62,537,849 dataset. Note that high ratings prevail in the data Rpv $4.28 set. A rating above or equal to 3 means customer Rpv 1st Qu. $1.02 satisfaction. We will use this concept later on to link Rpv 3rd Qu. $4.79 individual customer satisfaction and revenue. Days between Activity 8.21 Average Rating 3.4 Parameter Mean (µ) Total Total Customers 200,642 Subscription Length 14 Revenue/Customer $225.75 $108,403,095 Titles Viewed 209 100,480,507 Table 3: Statistics for customers with subscription Revenue Per View $2.75 lengths greater than 12 months. Values are per customer. Note subscription length is in months Table 1: Basic statistics. Note that the revenue is estimated by taking the difference from earliest and oldest rating date, then it is multiplied by the service Rating % of Total fee. Values are per customer 1 %6.25 2 %13.85 3 %29.52 4 %32.61 Rating % of Total 5 %22.18 1 %5.20 Total Ratings 65,100,476 2 %12.19 Total Customers 213,046 3 %30.24 4 %32.28 5 %20.09 Table 4: Frequency distribution of ratings. Total Ratings 100,480,507 5 Table 2: Frequency distribution of ratings. 4.2 5.1 Customer Segmentation Customer Profitability Profitable vs Unprofitable By looking at the statistics in table 3, we can see that the mean revenue per view is $4.28. The revenue per view is a measure of how profitable a customer is. Passive customers are far more profitable than active customers, mainly by the difference of their respective service consumption. From this observation we have derived the metric revenue per view Rpv to be: We uncovered the existence of two type of customers; profitable and less profitable. We focus on the latter to try and establish a correlation between the available data and customer profitability. For this reason we have selected customers who have been subscribed for at least 12 months. We believe 12 months is a reasonable period of time so that patterns can start to emerge in the activity of a user’s account. The resulting subset contains 200,642 cus- Rpv = 4 Rit , Vit (4) where Rit is the total revenue and Vit is the total views for user i. We categorize a customer as unprofitable if revenue per view Rpv < $1. In table 5 we display the basic statistics for this customer subset. Using the same idea, we categorize customers as profitable if Rpv > $4.82. Table 6 shows the basic statistics for this group. The choice of Rpv limits came from the distribution seen in table 3. Parameter Subscription Length Revenue/Customer Titles Viewed Revenue Per View Days between Activity Average Rating Total Customers Mean(µ) 24 $390.7 754 $0.60 1.15 3.5 48,598 customer subset has a mean of only 52 viewed titles. The first clue to correlate customer satisfaction came from the average ratings. As we can see, from table 5 and 6, these numbers seem to indicate unprofitable customers tend to be slightly more satisfied than profitable customers. We will explore this in the next section. Totals $18,989,577 36,645,498 5.2 Customer Satisfaction and Profitability Table 5: Unprofitable customers. Values are per customer. Note, subscription length is in months. Total From tables 5 and 6 we can see a clear difference revenue is calculated from initial subscription date to between the customer sets. We believed customer latest rating date. satisfaction had to be an important factor driving revenue. In order to follow this lead, we divided our data set into two sets; happy and unhappy Parameter Mean(µ) Totals customers. Customers were classified as happy Subscription Length 28 if their average rating ≥ 3, and unhappy if their Revenue/Customer $449.5 $22,430,870 average rating< 3. Titles Viewed 52 2,619,939 Revenue Per View $11.73 In table 7 we can see a clear trend and differDays Between Activity 22.4 ence between customers who are, on average, Average Rating 3.2 satisfied or unsatisfied. The average revenue per cusTotal Customers 49,905 tomer, between the two groups, is very close. This is significant considering that the unsatisfied customer subset represents 17% of the total customers. Table 6: Profitable customers. Values are per customer. Note, subscription length is in months. Total revenue is calculated from initial subscription date to Unsatisfied customers tend to watch movies less latest rating date. frequently than satisfied customers. This is important because very active customers consume services that might represent costs to the service provider. The revenue per view from unsatisfied customers is almost three fold, of that of satisfied customers. Figure 3 displays a plot of the average rating and the revenue per view, there is a clear pattern; customers who are less satisfied are more profitable. From tables 5 and 6 we can see significant differences between the two subsets. First, the days between activity, which is a measure of how often a customer is active. We also see a significant difference in titles viewed, the unprofitable customer subset has a mean of 754, while the profitable 5 Parameter MAE Titles in common Value 24% 10 300 Table 8: Error measurment trial run. 100 and dissuading them with subtle suggestions could shift many of the active customers into passive customers. 0 Revenue Per View 500 Revenue per View vs Average Rating 1 2 3 4 6 5 Recommendation Engine Average Rating In order to train our engine, we created a subset consisting of 100 of the most popular movies and Figure 3: Customer satisfaction and profitability. As 100,000 users selected at random. Movies and users customers tend to a higher level of satisfaction, the were selected at random and fed into the prediction revenue per view decreases significantly. algorithm. The number of k-nearest neighbors was then increased, from 5 to 25. Figure three displays the means by which we calibrated our engine. Table 8 displays the parameters used in this trial run. It Mean Unsatisfied Satisfied is worth to note that in this run k-nearest neighbors Total Views 220 331 had an important constraint, they all must have at Rating 2.64 3.59 least 10 movies in common with the target user. Subscription Months 26 28 Revenue Per View $8.31 $3.44 The error was measured by Days Between Activity 16 7 Revenue/Customer Other Statistics Number of Customers % of Total Total Revenue $412.90 $441.84 ŷ ê = | − 1| y (5) 34,814 %17 $14,374,793 165,828 where ŷ is the predicted value, and y is the actual %82 value. Mean absolute error was calculated by $73,270,186 N X eˆi M AE = . (6) Table 7: Customer segmentation by satisfaction. N i After reviewing results from figure 4, we followed a similar approach to investigate the error associated The results seem to be counter-intuitive, but they are with the titles in common the k-nearest neighbors supported by the research. The assumptions made must have with the target user. From figure 4 we about the data are valid, as any researcher with un- can see the error stays stable, below %10, when the restricted access to data can, in practice, derive the k-nearest neighbors size is in the vicinity of 12. For metrics used in this paper. Focusing on the less active this trial run, we held the k-nearest neighbors size customers, to retain them as is, can be a strategy to =13 and vary the number of titles in common. Figincrease profits. Focusing on very active customers ure 5 displays this trial run. 5.3 Conclusion 6 made the best prediction were selected according to the results of the previous section. For our validation run we used 5 k-nearest neighbors with a constraint of at least 40 movies in common with the target user. Titles and users were selected at random and passed as arguments to the algorithm. The absolute mean error M EA = %15. 30 20 7 10 %Error 40 Neighbors vs % Error Large data sets impose challenges and constraints to the techniques a researcher may use. Many of the solutions we have found, were ad hoc. Different data sets may pose different problems that call for unique solutions. 5 10 15 20 Neighbors Figure 4: Percent error agains number of k-nearest neighbors. 40 0 20 %Error 60 80 Titles in Common vs %Error 10 20 30 40 Titles in Common Figure 5: Titles in common, with target user, and percent error. 6.1 Validation: Results Our validation set consisted of 100,000 random customers and 100 random movies. The parameters that 7 Conclusion