Database Characteristics

advertisement
Movie Advisor
Tel Aviv University
Faculty of Engineering
M.Sc. Project
Doron Harlev
Supervisor: Dr. Dana Ron
February 12, 2016
Movie Advisor Project
2/51
February 12, 2016
Table of Contents
Introduction ................................................................................................................................ 3
Database Characteristics ............................................................................................................ 4
Data Sets .................................................................................................................................... 7
Evaluation Criteria ..................................................................................................................... 8
Mean Average Error (MAE) .................................................................................................. 8
Coverage ................................................................................................................................ 8
Base Algorithms......................................................................................................................... 9
All Average ............................................................................................................................ 9
Movie Average....................................................................................................................... 9
User Average ......................................................................................................................... 9
User Movie Average .............................................................................................................. 9
Summary .............................................................................................................................. 10
Prediction Methods .................................................................................................................. 11
Pearson R Algorithm............................................................................................................ 11
Base Algorithm ................................................................................................................ 11
Additions To The Base Pearson R Algorithm ................................................................. 13
Mean Square Difference (MSD) Algorithm ........................................................................ 17
Genre Based Algorithms ...................................................................................................... 19
Genre Statistics ................................................................................................................ 20
Base Algorithm ................................................................................................................ 20
Genre Algorithm .............................................................................................................. 21
Hybrid Genre Algorithm .................................................................................................. 23
Algorithm Comparison ........................................................................................................ 25
MAE Vs. Coverage For All Algorithms .......................................................................... 25
Database Alteration .......................................................................................................... 26
Conclusions .............................................................................................................................. 31
Reference ................................................................................................................................. 32
Appendix I: Implementation .................................................................................................... 33
Introduction .......................................................................................................................... 33
Matlab Code ......................................................................................................................... 34
General ............................................................................................................................. 34
Database Statistics ........................................................................................................... 36
Base Algorithms............................................................................................................... 37
Pearson r Algorithm ......................................................................................................... 38
MSD ................................................................................................................................. 41
Genre ................................................................................................................................ 43
Movie Advisor Project
3/51
February 12, 2016
Introduction
This paper will analyze existing and proposed algorithms designed to provide predictions of a
users’ rating of a particular movie. The predictions will rely on the users’ own rating of other
movies as well as the scores provided by neighboring users. The ability to predict a users’
rating may prove useful in many contexts. One such context may be enhancing a users’
experience in an online store by recommending some items while helping avoid others. The
problem set described in this paper is studied in the field of Recommender Systems or
Collaborative Filtering.
This paper will start by examining the statistical properties of the chosen database. Once the
base and test data sets have been characterized, evaluation criteria such as Mean Average
Error (MAE) and coverage will be explained. These criteria will then be used to measure the
performance of trivial prediction methods applied to the base and test data sets. The
performance of trivial methods will provide a reference for more sophisticated algorithms.
The first of these algorithms makes use of the Pearson R coefficient. The Pearson R
coefficient provides a correlation measure between two vectors and can be used to provide a
distance metric between two users. Use of the Pearson R correlation coefficient is quite
common in the field of collaborative filtering, and results obtained with this method will be
used to gauge the performance of other algorithms. The Pearson R algorithm will be further
enhanced in accordance with improvements proposed by other papers as well this one.
Another baseline algorithm to be examined is the Mean Square Difference (MSD) algorithm.
Although generally considered inferior to the Pearson R algorithm, elements in the MSD
algorithm will prove valuable in newly proposed algorithms.
The database used in this paper also provides genre information for all movies. The statistics
of genre information in the database will be analyzed. Novel prediction algorithms relying
on genre information will then be proposed. As a reference, a new trivial prediction
algorithm will be presented and its performance compared to previous trivial methods.
Several genre-based prediction algorithms will then be proposed and analyzed with respect to
the test and base data sets.
Finally, all algorithms will be compared. This comparison will include a different
instantiation of the initial data set, as well as altered versions of the entire database.
Movie Advisor Project
4/51
February 12, 2016
Database Characteristics
The database used for the project is the GroupLens database available on the internet at
http://www.cs.umn.edu/Research/GroupLens/. The database contains 100,000 ratings on a
scale of 1-5. The ratings are made for 1682 movies by 943 users. If the database were to be
viewed as a matrix with users designating rows and movies designating columns, the matrix
would be extremely sparse with values in merely 6% of its entries. Despite its apparent
sparseness, the database is sufficient in allowing the analysis of the prediction algorithms
discussed in this paper.
The mean score provided for all entries in the database is 3.53, the median is 41 and the
standard deviation is 1.13. Figure 1 depicts a histogram of all user ratings:
3.5
x 10
4
3
2.5
2
1.5
1
0.5
0
1
2
3
Score
4
5
Figure 1: Historgram of all movie scores
1
The median is actually 4-, that is to say, a score of 4 is included in the upper half of the median.
Movie Advisor Project
5/51
February 12, 2016
From the histogram and average score we learn that users tend to rate movies they liked more
than movies they disliked.
The average number of scores given to each movie is around 60 and the median number of
scores is 57. Figure 2 depicts a histogram of the number of scores given to each movie:
700
600
500
400
300
200
100
0
0
100
200
300
400
Number of Scores for Movie
500
600
Figure 2: Histogram of number of scores given to each movie
The histogram shows that while some movies have a significant number of ratings of more
than 500, many others have 10 ratings or less.
The average number of scores given by each user is 106 and the median is 65. The minimum
number of scores given by each user is 20. Figure 3 depicts a histogram of the number of
scores given by users:
300
250
200
150
100
50
0
0
100
200
300
400
500
Number of Scores Per User
600
700
Figure 3: Histogram of number of scores given by each user
800
Movie Advisor Project
6/51
February 12, 2016
This histogram shows a distribution similar to the ratings of movies. While some users rated
as many as 700 hundred movies, many others rated the minimum allowed.
Figure 4 depicts a graphical representation of the database:
Figure 4: Graphical representation of rated movies
Each rated item in the database is designated by a blue dot. The parabolic decline of rated
movies as a function of user ID implies that when the database was established, each new
user was provided with a more elaborate list of movies. It is also evident that movies with
higher ID’s average less ratings than those with lower ID’s.
Movie Advisor Project
7/51
February 12, 2016
Data Sets
To assess the prediction methods, the data set is divided into two parts: base and test. The
test data set contains 5 entries for 10% of the users yielding a total of 470 entries. The base
data set contains all remaining entries (99,530).
The test data set is used as a reference for the accuracy of the predictions. Since the division
into base and test data sets is made randomly, the performance of the tested algorithms will
vary depending on the specific division that was made. To stabilize the performance
throughout the paper, all results are calculated for a specific base/test division. This issue
will be further addressed in the final analysis.
Movie Advisor Project
8/51
February 12, 2016
Evaluation Criteria
Two types of evaluation criteria will be used in this paper.
Mean Average Error (MAE)
Once predictions for the test data set are obtained, Ei is defined as the difference between
1 n
prediction, Si, and the actual score given by the user, Ri. The MAE is MAE   Si  Ri ,
n i 1
where n is the length of the predicted data set. Other error coefficients such as standard
deviation and ROC may be used. However, related papers show that MAE is consistent with
other measures and proves to be the most widely used.
Coverage
As will later be shown, predictions cannot be made for all movies in the test data set.
Coverage is a measure of the percentage of movies in the test data set that can be predicted.
In general, the smaller the MAE the smaller the coverage becomes.
The importance of coverage depends on the application using the recommender system.
Some applications may require predictions for most items in the database, while others may
choose to compromise coverage for improved accuracy.
Movie Advisor Project
9/51
February 12, 2016
Base Algorithms
Four base algorithms are analyzed in this section. These algorithms are “basic” in that they
rely on trivial aspects of the user’s entries to provide predictions.
All Average
In this method the average rating of all entries in the base data set is calculated. This
calculated value is then used as the predicted score for all values in the test data set. This
method yields an MAE of 1.046 and coverage of 100%.
Movie Average
In this method the average score given to each movie in the base data set is calculated. This
average score is then used to predict values for all occurrences of the movie in the test data
set. This method yields a vast improvement over the previous one. The MAE drops to 0.887.
Since not all movies in the base data set contain scores, some values in the test dataset cannot
be predicted. The coverage in this method is 99.8%.
User Average
In this method the average score given by each user in the base dataset is calculated. This
average score is then used to predict values for the user in the test data set. This method
yields an improvement over the previous one. The MAE drops to 0.849. The coverage in
this method is 100% since all users made at least 20 predictions.
Figure 5 depicts a histogram of the error using the user average base algorithm.
60
50
40
30
20
10
0
-3
-2
-1
0
Error
1
2
3
Figure 5: Histogram of error using User Movie Average prediction
User Movie Average
This method is the most sophisticated of the basic methods. The method combines user
average and movie average to produce a prediction. In order to produce a prediction for user
a and movie i we perform the following calculation:
1
Sa ,i  ra   ru ,i  ru 
n u
Movie Advisor Project
10/51
February 12, 2016
where ra is user a’s average rating, u is an index running through all users who rated and the
movie, and n is the number of users who rated the movie.
This method yields yet another improvement over the previous one. The MAE drops to 0.83
while the coverage is 99.8%, identical to method 2.
Summary
It is clear that any desirable prediction method must improve upon the results obtained in this
section. Table 1 summarizes the results:
Method
All Average
Movie Average
User Average
User Movie Average
MAE
1.046
0.887
0.849
0.830
Coverage [%]
100
99.8
100
99.8
Table 1: Summary of base algorithms results
Movie Advisor Project
11/51
February 12, 2016
Prediction Methods
The following prediction methods attempt to provide further improvement over the base
algorithms. The improvement is possible through the use of additional information. One
form of additional information is to intelligently weigh other users’ ratings. The Pearson R
and MSD methods use this approach to improve prediction. Another form of additional
information is to make use of genre information provided with the database. Methods using
this approach will be presented in the ensuing discussion.
Pearson R Algorithm
The Pearson R algorithm relies on the Pearson R coefficient to produce a correlation metric
between users. This correlation is then used to weigh the score of each relevant user. The
Pearson R algorithm is used widely in the study of recommender systems, and is used as a
reference in this paper.
Base Algorithm
The Pearson R correlation between users a and u is defined as:
 r
m
Pa ,u 
i 1
a ,i
 ra   ru ,i  ru 
 a  u
where m is the number of movies that both users rated, ra ,i is the score user a gave movie i,
and ra is the average score user a gave all movies.
Since the base data set is sparse, when calculating the Pearson R coefficient, most users have
only a few overlapping scored movies. In calculating the coefficient, only movies ranked by
both users are taken into account. This affects the sum in the numerator, and the variance in
the denominator.
Figure 6 depicts a histogram of the average number of mutually rated movies (MRM) for all
users in the test data set:
40
35
30
25
20
15
10
5
0
0
10
20
30
40
50
Average Number of Mutually Rated Movies
60
70
Figure 6: Histogram of average number of MRM for all users in the test data set
Movie Advisor Project
12/51
February 12, 2016
The mean MRM is 17.6 and the median is 12. We note that many users have an average
MRM below 10, while some have a relatively high average MRM of 60. Users who provide
more entries will tend to have a higher average MRM.
Figure 7 depicts a histogram of the Pearson R coefficient between a particular user and all
other users:
150
100
50
0
-1
-0.5
0
0.5
Pearson R Coefficient for user 104
1
Figure 7: Historgram of Pearson R coefficient
It is evident that most users are slightly positively correlated. While there is such a thing as
two people with similar tastes, it is highly unlikely to find two users with opposite tastes.
Opposite tastes would require users to consistently dislike what the other user likes and vice
versa. This observation is further discussed at Shardanand [1].
Once the Pearson R correlation between a user and all other users is obtained, the predicted
movie score is calculated as:
 r
n
S a ,i  ra 
u 1
u ,i
 ru   Pa ,u
n
P
u 1
a ,u
This approach is very similar to that of the user movie average discussed earlier. The
difference is that the Pearson R coefficient is used to weigh the score given by each movie,
whereas in the user movie average algorithm all users are weighted equally.
Applying the Pearson R base algorithm to the test data set yields a MAE of 0.79 with
coverage of 99.8%. The coverage is slightly limited, because not all movies in the test data
set have a score in the base data set. Since no limitations have been made on the predictions,
the coverage is identical to the movie average and user movie average base algorithms.
Movie Advisor Project
13/51
February 12, 2016
Figure 8 depicts a histogram of the prediction error using the Pearson R base algorithm.
120
100
80
60
40
20
0
-3
-2
-1
0
Error
1
2
3
Figure 8: Histogram of prediction error for Pearson R base algorithm
The MAE obtained is clearly an improvement over all base methods.
Note that in order to normalize the weights, all user scores are divided by a sum of the
absolute value of the correlations. Since the Pearson R correlation may be either positive or
negative, adding all correlations may produce low values in the denominator. These low
values may cause the prediction to be either too low or too high, at times exceeding a value of
5. Using the absolute value has a drawback in that it produces an unbalanced normalization
factor. Results obtained without the absolute value were extensively tested and shown to
produce substantially poorer results. In further discussion, only the absolute value approach
will be applied.
Additions To The Base Pearson R Algorithm
Several modifications can be made to the base Pearson R algorithm to improve its
performance. These modifications have to do with setting thresholds in order to reduce the
effects of ‘noisy’ data.
Pearson R Thresholding
One modification, initially suggested by Shardanand [1], is to avoid making use of users who
are not highly correlated. To implement this approach we modify Pa ,u as follows:
 Pa ,u  L
if Pa ,u  L
 1 L

P  L
|
Pa ,u   a ,u
if Pa ,u  L
 1 L
otherwise
0


where L is the Pearson R threshold. After calculating Pa| ,u we simply replace it with Pa ,u and
produce the expected score.
Movie Advisor Project
14/51
February 12, 2016
Figure 9 shows the effect of increasing the Pearson R threshold on the MAE and coverage:
Users TH=3, Herlock TH=1
0.95
Pearson R Algorithm
User Average
MAE
0.9
0.85
0.8
0.75
0
0.05
0.1
0.15
0.2
0.25
0.3
Pearson R Threshold
0.35
0.4
0.45
0.5
0
0.05
0.1
0.15
0.2
0.25
0.3
Pearson R Threshold
0.35
0.4
0.45
0.5
Coverage
1
0.9
0.8
0.7
Figure 9: MAE and Coverage as a function of Pearson R threshold
The results were obtained with two additional thresholds, users and Herlock. The users
threshold is the number of correlated users required to make a prediction. When the
threshold is not met, the algorithm will not make a prediction for the item in the test data set.
For consistency, all methods in this paper will be examined with a threshold of 3 users. A
Herlock threshold of 1 is identical to no threshold at all. This threshold will be explained
later.
The Pearson R threshold initially reduces the MAE (MAE=0.7867 at P TH=0.1) slightly
without compromising coverage significantly. Later on, the Pearson R threshold has the
effect of increasing the MAE and decreasing coverage – both effects undesirable. The
decreasing coverage is caused by an increased amount of movies that cannot be predicted due
to an insufficient number of correlated users.
The user average plot, present in the upper chart, is the MAE of the user average algorithm
calculated on the predicted test data set. The test data set experiences changes as the
coverage changes. These changes cause slight variations in the results user average method.
The plot of user average is adds information about the type of movies being omitted with
decreasing coverage. This measure will prove more significant in later discussion.
Significance Weighting
Herlocker et. al. [2] suggested that it may be advisable to make use of the number of mutually
rated movies (MRM). Recall Figure 6, which shows a histogram of the average MRM for all
users in the test data set. Herlocker suggested that weighing should be applied to decrease
the correlation coefficient for neighbors with a small MRM. The weighting scheme
suggested alters the Pearson R coefficient as follows:
H

if MRM  H
 Pa ,u 
|
Pa ,u  
MRM
 Pa ,u
if MRM  H
Movie Advisor Project
15/51
where H is a given threshold (designated Herlock threshold in the paper).
February 12, 2016
Figure 10 depicts the effect of increasing the Herlocker threshold with a Pearson R threshold
of 0.1.
Users TH=3, Pearson TH=0.1
0.95
Pearson R Algorithm
User Average
MAE
0.9
0.85
0.8
0.75
0
20
40
60
80
Herlocker Threshold
100
120
140
0
20
40
60
80
Herlocker Threshold
100
120
140
Coverage
1
0.8
0.6
0.4
Figure 10: MAE and coverage as a function of Herlocker threshold, Pearson threshold=0.1
As we can see, an increase in the threshold reduces coverage, while also slightly improving
the MAE.
In an attempt to improve Herlocker’s method, a slightgly altered weighing scheme was
applied:
2

 H 
 if MRM  H
P  
Pa| ,u   a ,u  MRM 
P
if MRM  H
 a ,u
This slight alteration has the effect of truncating users with low MRM’s faster, while still
making use of ‘borderline’ users.
Movie Advisor Project
16/51
February 12, 2016
Figure 11 depicts the effect of increasing the Herlocker threshold with a Pearson R threshold
of 0.1. Both squared and none squared approaches are shown.
Users TH=3, Pearson TH=0.10
0.9
H/MRM
2
(H/MRM)
User Average
0.88
0.86
0.84
MAE
0.82
0.8
0.78
0.76
0.74
0.72
0.7
0.5
0.6
0.7
Coverage
0.8
0.9
1
Figure 11: MAE Vs. Coverage with two types of Herlocker thresholding
As we can see, the performance of both square and none-square approaches is relatively
similar. In future discussion, only the none-square approach will be used.
Another alteration to Herlocker’s threshold was attempted. Instead of a rigid threshold, a
threshold based on each users’ average MRM was calculated for each user. The threshold is
a constant multiplied by the average MRM for each user. The logic behind this approach is
that some users tend to rate more movies than others, and would therefore tend to have higher
MRM’s. Using a constant threshold for all users’ MRM does not take into account this
variance.
Movie Advisor Project
17/51
Figure 12 depicts the results obtained with this method.
February 12, 2016
Users TH=3, Pearson TH=0.100
MAE
0.9
Pearson R Algorithm
User Average
0.85
0.8
0.75
0
2
4
6
8
Mult. Threshold
10
12
0
2
4
6
8
Mult. Threshold
10
12
Coverage
0.9
0.8
0.7
0.6
0.5
Figure 12: MAE and coverage as a function of multiplication threshold
As we can see, this method performs worse than Herlocker’s initial proposal and will be
discarded.
Mean Square Difference (MSD) Algorithm
The MSD algorithm is very similar to the Pearson R algorithm, except that it relies on mean
square distance as a distance metric, instead of Pearson R correlation. The distance between
two users is calculated as:
 r
m
Da ,u 
i 1
a ,i
 ra   ru ,i  ru 
2
m
where m denotes the number of MRM between users a and u. Once the distance vector Da is
obtained we calculate Pa as:
 L  Da ,u
if Da ,u  L

Pa ,u   L
0
otherwise
where L is a threshold for the distance measure. The larger the threshold, the more distant
the users we take into account to produce a prediction. Once Pa is obtained, the score is
calculated identically to the Pearson R algorithm. Note that all distances are positive, so the
absolute value on the distance measure is not necessary.
Movie Advisor Project
18/51
February 12, 2016
Figure 13 depicts the effect of decreasing the L threshold on the MAE and coverage:
Users TH=3
0.85
0.84
MSD Method
User Average
0.83
MAE
0.82
0.81
0.8
0.79
0.78
0.77
0.5
0.6
0.7
0.8
0.9
1
Coverage
Figure 13: MAE Vs. Coverage for MSD algorithm
We note that this algorithm does not perform as well as Pearson R at all coverages.
We also note a phenomenon of improved user average MAE with decreased coverage. This
effect was not present with the Pearson R algorithm, where user average performance was
relatively constant with decreasing coverage. This decrease implies that the algorithm tends
to make predictions for items that are closer to the users’ average.
Figure 14 may provide help in understanding this effect:
1.1
Discarded User Average
1.05
MAE
1
0.95
0.9
0.85
0.8
0.75
0.5
0.6
0.7
0.8
0.9
1
Coverage
Figure 14: MAE Vs. Coverage of discarded items for user average algorithm
The figure depicts the MAE of the user average algorithm for items that were omitted with
the MSD algorithm for a given coverage. To understand this curve we need to examine it in
Movie Advisor Project
19/51
February 12, 2016
three parts. At high high coverage (>0.95) very few items are present in the calculation
yielding very noisy results. At mid to high coverage (0.8-0.95) we note that the MSD
algorithm actually omits items with user average MAE higher than items that are not omitted.
At lower coverages the MSD omits items with better user average MAE, so the MAE begins
to decrease.
It is possible that movies that are difficult to predict with MSD are esoteric movies that few
people rated. Since few people rated these movies, the MSD algorithm eventually omits
them. Since the movies are esoteric, a users score for these movies will tend to vary greatly
from their mean score.
Genre Based Algorithms
The GroupLens database provides additional information about each movie in the database.
The following is a description of the information provided2:
u.item
-- Information about the items (movies); this is a tab separated
list of
movie id | movie title | release date | video release date |
IMDb URL | unknown | Action | Adventure | Animation |
Children's | Comedy | Crime | Documentary | Drama | Fantasy |
Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi |
Thriller | War | Western |
The last 19 fields are the genres, a 1 indicates the movie
is of that genre, a 0 indicates it is not; movies can be in
several genres at once.
Since the genres will be referenced as numbers in the ensuing discussion the following table
provides a summary of movie names and numbers.
#
1
2
3
4
5
Name
Unknown
Action
Adventure
Animation
Children’s
#
6
7
8
9
10
Name
Comedy
Crime
Documentary
Drama
Fantasy
#
11
12
13
14
15
Name
Film-Noir
Horror
Musical
Mystery
Romance
#
16
17
18
19
Name
Sci-Fi
Thriller
War
Western
Table 2: Summary of genre numbers
The genre algorithm presented in this paper attempts to make use of this additional
information to improve the prediction accuracy. Only genre information will be used – dates
are omitted.
2
The description is copied verbatim from the README file downloaded with the database.
Movie Advisor Project
20/51
February 12, 2016
Genre Statistics
Average Number of User Ratings
Figure 15 depicts average scores and the number of user ratings for each genre:
40
30
20
10
0
2
4
6
8
10
12
Genre Number
14
16
18
20
2
4
6
8
10
12
Genre Number
14
16
18
20
Average Score
0.4
0.2
0
-0.2
-0.4
Figure 15: Average scores and number of user rating for all genres
In order to produce the average number of user ratings for each genre all entries in the
database are taken into account. For each entry, counters for all genres present in the movie
are increased. Once all entries have been searched, the counters are divided by the number of
users.
The average score is calculated in a similar manner. Initially, a new matrix of user ratings is
produced. This new matrix subtracts the users’ average score from all their ratings. Once the
new matrix is obtained, all entries are searched again. For each item a counter for each genre
present is increased, and the value from the calculated matrix is added to an accumulator.
Note that since user averages were subtracted from the matrix, the accumulators may contain
negative values. Once all entries have been searched, the accumulator for each genre is
divided by counter for that genre.
Not surprisingly, the most popular genres are action, comedy and drama. The negative
average score received by both comedy and action implies that although they are popular they
are generally disliked. Drama is the only genre that is both widely rated and liked.
Base Algorithm
Once we obtain genre information, we take aim at another base algorithm. We start by
calculating how much a user likes or dislikes a particular genre. To do so we calculate a
vector R  ra  ra for all movies the user entered. For each movie we add the score in vector
R to all genres associated with the movie. After going through all movies, we divide the sum
obtained for each genre with the number of movies that the genre appeared in. This
calculation is similar to the one performed to obtain average genre scores for the entire
database, except that it is performed for each user. We thus obtain a vector with an average
score for each genre for every user. For future reference this this vector is designated Ga ,
and the matrix for all users G. A positive score indicates that a user likes a particular genre.
Movie Advisor Project
21/51
February 12, 2016
To produce a prediction for a particular item, the base algorithm averages all of the users’
scores of genres present in the movie and adds the result to the users’ average. The following
equation shows this calculation:
n
S a ,i  ra 
G
a ,u
u 1
n
where n is the number of genres present in the movie.
This approach yields the following results for the base and test data sets:
MAE
0.836
Coverage [%]
98.72
Table 3: Result for the genre base algorithm
Recalling Table 1, it should be noted that this approach provides an improvement over the
user average base algorithm. This implies that genre information is useful in making
predictions.
Genre Algorithm
In this section a novel prediction algorithm based on genre information is proposed. This
approach makes use of other users’ scores for the movie as well as genre information. At
first the matrix G is calculated for all users. This matrix is then used to calculate the mean
square distance (MSD) between users, as shown previously. Once distances for a particular
user are obtained, they are used to weigh movie scores made for the movie by other users.
Figure 16 depicts the performance of the genre algorithm. In order to reduce the MAE while
lowering our coverage, the distance threshold, L, is decreased.
Users TH=3
0.85
Genre Method
User Average
MAE
0.8
0.75
0.7
0.5
0.6
0.7
Coverage
0.8
0.9
1
Figure 16: MAE Vs. Coverage for genre algorithm using MSD
The performance of this algorithm is better than that of Pearson R at low coverages, and
comparable at high coverages. Nonetheless, the calculations necessary to produce
predictions are much simpler. In order to produce each prediction with the Pearson R
algorithm, it is necessary to perform calculations on the entire data base containing
Movie Advisor Project
22/51
February 12, 2016
6
16829431.610 items. If the algorithm used is efficient, the calculation complexity may
be reduced to the number of entries, 100,000. To produce a single result with the genre
algorithm, only 9431918,000 need to be taken into account. In practice, predicting with
the genre algorithm proved to be much faster.
As with the MSD algorithm, a phenomenon of decreased user average MAE with decreased
coverage is encountered. The three area in the MSD curve are present in the genre algorithm
as well. The explanation for this phenomenon is identical to that of MSD. An attempt to
provide further explanation by analyzing average genre scores of omitted items proved
inconclusive.
Figure 17 depicts the user average MAE for omitted items as a function of coverage.
1.15
Discarded User Average
MAE
1.1
1.05
1
0.95
0.5
0.6
0.7
0.8
0.9
Coverage
Figure 17: MAE Vs. Coverage of discarded items for user average algorithm
1
Movie Advisor Project
23/51
February 12, 2016
Another implementation of the genre algorithm may rely on Pearson R as the distance
measure. The result obtained with this approach is depicted in Figure 18:
Users TH=3, Pears TH=0.1
0.86
Genre Pears Method
User Average
0.85
MAE
0.84
0.83
0.82
0.81
0.8
0.5
0.6
0.7
0.8
0.9
1
Coverage
Figure 18: MAE Vs. Coverage for genre algorithm using Pearson correlation
As wee can see, the performance of this approach is generally worse than the MSD approach
and does not improve with reduced coverage. The approach will be discarded from ensuing
discussion.
Hybrid Genre Algorithm
While the genre algorithm makes use of user information to calculate the distance between
the user and other users, it does not make use of the genres the users themselves like. In this
section a hybrid algorithm that takes into account the weighted score by other users as well as
the weighted score of the genres in the movie itself is proposed. The following equation
shows how the prediction is calculated:
 r
n
S a ,i  ra   
u 1
u ,i
 ru   Pa ,u
n
P
u 1
n
 1    
G
u 1
a ,u
n
a ,u
where  designates the weight ratio between other users’ rating of the movie, and the users’
own average rating of the genre. When =0 the algorithm resorts to the base genre algorithm,
and when =1 the proposed genre algorithm is used solely.
Movie Advisor Project
24/51
February 12, 2016
This approach produces further improvement. Figure 19 shows the MAE as a function of 
for two thresholds.
Users TH=3
0.83
L=0.9, coverage=98.1
L=0.5, coverage=71.3
0.82
0.81
MAE
0.8
0.79
0.78
0.77
0.76
0.75
0.74
0
0.2
0.4
0.6
0.8
1
Ratio
Figure 19: MAE Vs.  for hybrid genre algorithm
An examination of both curves shows that they have an absolute minimum at approximately
=0.65. At this  mostly other users’ opinions are taken into account, but a strong weight is
still given to the genres present in the movie.
To test the robustness of the value of the optimal , the same plot was obtained for a different
instance of the base and test data sets.
Figure 20 depicts the result obtained:
Users TH=3
0.84
0.83
L=0.9, coverage=95.5
L=0.5, coverage=78.1
MAE
0.82
0.81
0.8
0.79
0.78
0.77
0
0.2
0.4
0.6
0.8
Ratio
Figure 20: MAE Vs.  for hybrid genre algorithm, different data-set
1
Movie Advisor Project
25/51
As can be seen, the optimal  is still obtained around 0.65.
February 12, 2016
Figure 21 compares the genre algorithm to the hybrid genre algorithm with =0.65.
Users TH=3, rat=0.65
0.82
Hybrid Method
Genre Method
0.8
MAE
0.78
0.76
0.74
0.72
0.7
0.68
0.5
0.6
0.7
Coverage
0.8
0.9
1
Figure 21: MAE Vs. Coverage of genre and hybrid genre ( =0.65) algorithms
The figure shows that the hybrid genre algorithm performs better than the genre algorithms at
all coverages. The additional complexity of the hybrid algorithm is negligible.
Algorithm Comparison
MAE Vs. Coverage For All Algorithms
Figure 22 shows a comparison of the main algorithms discussed in this paper.
0.82
0.8
MAE
0.78
0.76
0.74
Hybrid Genre
MSD
Pearson R
0.72
0.7
0.5
0.6
0.7
0.8
0.9
Coverage
Figure 22: MAE Vs. Coverage for hybrid genre, MSD, and Pearson R algorithms
1
Movie Advisor Project
26/51
February 12, 2016
It is clear that the hybrid genre algorithm performs better at low coverages, while Pearson R
performs better at high coverages. It should be noted that the hybrid genre algorithm is the
only algorithm able to significantly improve its performance with decreased coverage.
Database Alteration
In this section the robustness of the results will be tested against several alterations to the data
set.
Instantiation
As noted in earlier discussion, results may differ if a different instance of division into base
and test data sets is used. Table 4 shows results obtained with base algorithms for new
division of base/test data sets:
Method
All Average
Movie Average
User Average
User Movie Average
Genre Base
MAE
0.948
0.832
0.85
0.785
0.85
Coverage [%]
100
99.8
100
99.8
98.7
Table 4: Base algorithm results for new instance
As Table 4 shows, the user average method produces results similar to the genre base
method. This suggests, that it is more difficult for genre based algorithms to produce
accurate results for this division of test and base data sets.
Figure 23 depicts a comparison of the main algorithms for the new instance of test and base
data sets:
0.84
Hybrid Genre
MSD
Pearson R
0.82
MAE
0.8
0.78
0.76
0.74
0.72
0.55
0.6
0.65
0.7
0.75
0.8
Coverage
0.85
0.9
0.95
1
Figure 23: MAE Vs. Coverage for Hybrid Genre, MSD, and Pearson R algorithms
In this instance, the Pearson R algorithm seems superior at most coverages. The hybrid genre
algorithms still manages to improve accuracy with reduced coverage, yielding the lowest
MAE at coverages below 0.65. The general behavior of the three algorithms is still
maintained.
Movie Advisor Project
27/51
February 12, 2016
User Truncation
In this section 4/5 of the users are randomly removed from the database. The test data set is
comprised of 5 scores for 20% of the remaining users, yielding 190 entries. Since the data
sets have been altered, the results for base algorithms are recalculated. As Table 5 shows, it
is generally simpler to predict for the new data sets:
Method
All Average
Movie Average
User Average
User Movie Average
Genre Base
MAE
0.989
0.818
0.807
0.750
0.807
Coverage [%]
100
99.5
100
99.5
99.5
Table 5: Base algorithm results for data base with 1/5 of users
As Table 5 indicates, genre information is once more at a disadvantage since it fails to
improve the accuracy of the user average method.
Figure 24 shows a comparison of the main algorithms discussed in this paper with respect to
the new data sets.
0.78
MAE
0.76
0.74
0.72
Hybrid Genre
MSD
Pearson R
0.7
0.5
0.6
0.7
0.8
0.9
1
Coverage
Figure 24: MAE Vs. Coverage for tested algorithms with 1/5 of users in data sets
The results of this section are somewhat of an anomaly with respect to all other results. The
MSD algorithms seems to perform best at almost all coverages, while Pearson R and hybrid
genre switch rolls with regards to performance at low and high coverages.
Movie Truncation
In this section 2/3 of the movies are removed from the database leaving 336 movies. The test
data set is comprised of 5 scores for 10% of the remaining users, producing 470 entries.
Since the data sets have been altered, the results for base algorithms are recalculated. As
Table 6 shows, it is generally simpler to predict for the new data sets:
Method
All Average
Movie Average
MAE
0.931
0.786
Coverage [%]
100
100
Movie Advisor Project
28/51
0.855
0.771
0.884
User Average
User Movie Average
Genre Base
February 12, 2016
100
100
98.9
Table 6: Base algorithm results for data base with 1/3 of movies
Once more, the genre algorithm is at a disadvantage doing more harm than good with respect
to the user average base algorithm.
Figure 25 shows a comparison of the main algorithms discussed in this paper with respect to
the new data sets.
0.84
Hybrid Genre
MSD
Pearson R
0.82
0.8
MAE
0.78
0.76
0.74
0.72
0.7
0.5
0.6
0.7
0.8
0.9
1
Coverage
Figure 25: MAE Vs. Coverage for tested algorithms with 1/3 of movies in data sets
Again, the results are generally in line with previous data sets. The Pearson R algorithm
seems to perform very well consistently. One of the reasons behind this improved
performance may be the truncation of movies with generally less predictions. Recall Figure 4
which indicated that movies with higher ID’s tended to have more predictions since they
were introduced to the data base later.
Movie Advisor Project
29/51
February 12, 2016
Figure 26 shows a histogram of the number of scores given to each movie:
50
45
40
35
30
25
20
15
10
5
0
0
100
200
300
400
Number of scores for movie
500
600
Figure 26: Histogram of number of scores given to each movie
Comparing this figure to Figure 2, we learn that most movies in the new data set generally
tend to have more predictions. The median of the number of scores for each movie has risen
from 57 to 129. This reduced sparseness of the database may improve the performance of the
Pearson R algorithm.
Database reduction
The purpose of this section is to test to performance of the algorithms with respect to a more
sparse database. At first, 2/3 of the entries for each user are randomly removed. From the
remaining entries, one score from 20% of users is used as the test data set, yielding 189
entries. Since the data sets have been altered, the results for base algorithms are recalculated.
Table 7 summarizes the results obtained with the base algorithms:
Method
All Average
Movie Average
User Average
User Movie Average
Genre Base
MAE
0.998
0.868
0.807
0.809
0.778
Coverage [%]
100
98.9
100
98.9
98.4
Table 7: Base algorithm results for data base with 1/3 of entries
Movie Advisor Project
30/51
February 12, 2016
Figure 27 shows a comparison of the main algorithms discussed in this paper with respect to
the new data sets.
Hybrid Genre
MSD
Pearson R
0.85
MAE
0.8
0.75
0.7
0.65
0.5
0.6
0.7
0.8
0.9
1
Coverage
Figure 27: MAE Vs. Coverage for tested algorithms with 1/3 of entries
Results obtained with the diluted database show the hybrid genre algorithms to be superior at
most coverages. This superiority may arise from the fact that the hybrid genre algorithm uses
as little as 19 features to represent each user. These 19 features may quickly stabilize, even
with a very sparse data set.
Movie Advisor Project
31/51
February 12, 2016
Conclusions
Several prediction algorithms were analyzed using the GroupdLens database. Existing
methods such as Pearson R and MSD were presented and compared to proposed algorithms
based on genre information. The Pearson R algorithm performs quite well under variations
of the database. It is always able to yield improvement over trivial methods. The best of the
proposed algorithms, hybrid genre, was shown to generally excel at low coverages, and
increased sparseness. Besides performing better at specific problem sets, the hybrid genre
algorithm is substantially less complex. Coupled with its ability to perform well in sparse
environments, the hybrid genre algorithm may prove to be very practical in real world
applications. One practical approach may be to intentionally dilute the database in order to
reduce computational complexity.
Further research may be conducted to improve the performance of the hybrid genre
algorithm. One direction may be to collapse several genres into one, or to omit unpopular
genres altogether. This direction was shown to improve results while working on this paper,
but was not presented. The logic behind this approach is that unpopular genres are given less
of an opportunity to affect results for the popular genres, and therefore reducing ‘noise’.
Another direction may be the complete opposite of this one. Several genres may be
designated as new genres when appearing together. For example, people may have a specific
affection for action-comedies.
Movie Advisor Project
32/51
February 12, 2016
Reference
1.
Upendra Shardanand, “Social Information Filtering for Music
Recommendation”, MIT EECS M. Eng. Thesis, also TR-94-04, Learning and
Common Sense Group, MIT Media Laboratory, 1994.
2.
Herlocker, J.L., Konstan, J.A., Borchers, A., Riedl, J., 1999. An algorithmic
framework for performing collaborative filtering. Proceedings of the 1999
Conference on Research and Development in Information Retrieval.
Movie Advisor Project
33/51
February 12, 2016
Appendix I: Implementation
Introduction
All results in this paper were calculated and plotted using Matlab 5.3. To improve efficiency,
databases were defined as sparse matrixes and for loops were avoided as much as possible by
using look up tables and matrix multiplication.
Pearson R algorithms were rather slow to produce results. An attempt to compile m files into
mex files showed no significant improvement and was discarded. The following lists the
matlab code used to generate results in this paper.
Matlab Code
General
List2Mat.m
This m-file is used to convert a list of entries into a sparse matrix.
function mat=list2mat(list)
% USAGE: mat=list2mat(list)
% This function converts a list (UID, MID, R) into a sparse matrix U x M = R
mat=0;
mat=sparse(mat);
l=length(list);
h=waitbar(0,'Processing List2Mat');
for i=1:l,
if mod(i,2000)==0 waitbar(i/l,h); end
mat(list(i,1),list(i,2))=list(i,3);
end
close(h)
Mkdata.m
This m-file is used to generate base and test data sets.
function [base_mat, test_list]=mkdata(mat_base,varargin)
%
%
%
%
%
%
%
%
%
%
%
This function resizes an input sparse matrix and divides
the data into a base matrix (UxM<-r) and a test list (uid mid r).
Base and test data are divided according to parameters r and tnum.
USAGE:[base_mat, test_list]=mkdata(mat_base,r,tnum,mnum,unum)
base_mattest_listmat_basertnum-
base matrix
test list
input sparse matrix containg all data
ratio of users tested (default= 0.1)
number of movies tested per user (default = 5)
Movie Advisor Project
% mnum% unum-
35/51
February 12, 2016
number of movies in output data (default= base_mat)
number of users in output data (default = base_mat)
r=0.1; % percent of users tested
[unum, mnum]=size(mat_base); % default matrix size, untruncated
tnum=5;
if nargin>=2
r=varargin{1};
end
if nargin>=3
tnum=varargin{2};
end
if nargin>=4
mnum=varargin{3};
end
if nargin==5
unum=varargin{4};
end
if (nargin<1 | nargin>5)
error('Improper number of input arguements');
end
base_mat=mat_base(1:unum,1:mnum); % truncate matrix if necessary
n_u_tst=round(r*unum); % number of users tested
p=randperm(unum);
% randomize user index
for i=1:n_u_tst,
idx=find(base_mat(p(i),:)~=0); % indices of rated movies of user
ns=length(idx);
% number of scored items
rs=randperm(ns); % randomize scored movie index
for j=1:tnum,
% select 5 random movies for each user
test_list((i-1)*tnum+j,:)=[p(i) idx(rs(j)) mat_base(p(i),idx(rs(j)))];
base_mat(p(i),idx(rs(j)))=0;
end
end
[tmp,index]=sort(test_list,1); % sort _test_list_ according to user number
Movie Advisor Project
36/51
February 12, 2016
test_list=test_list(index(:,1),:);
Database Statistics
This m-file is used to generate general statistics about the GroupLens database.
Stats.m
% This code is used to generate statistics for the entire MovieLens database
[uid, mid, r]=textread('u.data','%u %u %u %*u');
list=[uid mid r];
clear uid mid r
rnum=length(list);
mat_base= list2mat(list);
% load data_load <- used to load file and avoid processing of above...
% calcualte mean and median
fprintf('\nMean Value is %1.2f, median is %1.2f\n',mean(list(:,3)),median(list(:,3)));
% plot score historgram
hist(list(:,3),[1 2 3 4 5]);
xlabel('Score')
% number of movie ratings, histogram and mean
mr=full(sum(mat_base~=0));
fprintf('The average number of ratings per movie is %2.1f\n',mean(mr));
hist(mr,50)
xlabel('Number of Scores for Movie')
% number of movie ratings
mu=full(sum(mat_base~=0,2));
fprintf('The average number of ratings per user is %2.1f\n',mean(mu));
hist(mu,50)
xlabel('Number of Scores Per User')
Movie Advisor Project
37/51
February 12, 2016
% view score matrix
spy(mat_base)
xlabel('Movie ID')
ylabel('User ID')
Base Algorithms
Basepr.m
This m-file is used to calculate results for the base algorithms.
function [ave_pred, u_pred, m_pred, um_pred] =basepr(mat_base,list_test);
% This function takes a rating matrix (UxM,R) and a list of observations (uid, mid, r)
% and returns a list of predictions using several methods
% USAGE: [ave_pred, u_pred, m_pred, um_pred]=basepr(mat_base,list_test)
% ave_pred- average score in rating matrix
% u_predaverage user score predicted for each movie
% m_predaverage movie score predicted for each movie (0 is not available)
% um_pred- average user and movie score predicted for each movie (0 is not available)
[unum,mnum]=size(mat_base);
user_mean=sum(mat_base,2)./sum(mat_base~=0,2);
% vector of mean scores of each user
rated=sum(mat_base~=0)~=0;
% is the movie rated at all?
mov_ave= rated.*sum(mat_base)./(sum(mat_base~=0)+~rated);
% mean score of each movie, 0 if not rated
mrated=rated(list_test(:,2)); % vecotor designating rated movies for list_test
mrated_i=find(mrated~=0); % indices of movies predicted with mean score of movie
ave_pred=full(sum(mov_ave)./sum(mov_ave~=0));
% average rating of all movies
u_pred=full(user_mean(list_test(:,1)));
% predict movie score based on user's average scores
m_pred=full(mov_ave(list_test(:,2)));
% predict movie score based on movie's average scores
Movie Advisor Project
38/51
February 12, 2016
mat_ave=(mat_base-repmat(user_mean,1,mnum)).*(mat_base~=0);
% rating matrix minus mean of each user, 0 for unrated movies
mat_ave=rated.*sum(mat_ave)./(sum(mat_ave~=0)+~rated);
um_pred=full(user_mean(list_test(:,1))+mat_ave(list_test(:,2))');
r=list_test(:,3);
% output mean and STD for each method...
fprintf('\nAll Average\tMAE=%1.3f\tstd=%1.3f\n',mean(abs(r-ave_pred)),std(r-ave_pred))
fprintf('User average\tMAE=%1.3f\tstd=%1.3f\n',mean(abs(u_pred-r)),std(u_pred-r))
fprintf('Movie average\tMAE=%1.3f\tstd=%1.3f\tPredicted %2.1f\n',...
full(sum(mrated.*abs(m_pred-r'))./sum(mrated)),full(std(m_pred(mrated_i)-r(mrated_i)')),...
(length(r)-length(find(m_pred==0)))/length(r)*100);
fprintf('UM average\tMAE=%1.3f\tstd=%1.3f\tPredicted %2.1f\n',...
full(sum(mrated'.*abs(um_pred-r))./sum(mrated)),full(std(um_pred(mrated_i)-r(mrated_i))),...
(length(r)-length(find(m_pred==0)))/length(r)*100);
Pearson r Algorithm
Pearsnn.m
This m-file is used to produce predictions using the Pearson R algorithms. The m-file also outputs several statistics to the standard output.
function list_pred=pearsnn(mat_base,list_test, varargin);
% This function takes a rating matrix (UxM,R) and a list of observations (uid, mid, r)
% and returns a list of predictions using the pearson r coefficient.
%
% USAGE: list_pred=pearsnn(mat_base,list_test,pears_th, users_th, herlck_th)
% list_pred- a list of [uid mid r pred u_ave]
%
% mat_base- base dataset in the form of a sparse matrix
% list_test- test list in the form of [uid mid r], not presumed to be sorted
% Optional Parameters:
% pears_th- pearson threshhold (default=0.1)
% users_th- minimal number of correlated users to make a prediction (default=3)
% herlck_th- number of matching rated movies between users (default=25)
herlck_th=25;
Movie Advisor Project
39/51
February 12, 2016
pears_th=0.1; % default pearson r threshhold
users_th=3;
% default minimum number of 'affecting users'
if nargin>=3
pears_th=varargin{1};
end
if nargin>=4
users_th=varargin{2};
end
if nargin==5
herlck_th=round(varargin{3});
end
if (nargin<2 | nargin>5)
error('Improper number of input arguements');
end
h=waitbar(0,'Processing Pears...');
[tmp,index]=sort(list_test,1); % sort _list_test_ according to user number
list_test=list_test(index(:,1),:);
[unum,mnum]=size(mat_base);
num_test=size(list_test,1);
prev_user=0;
count=1;
user_mean=sum(mat_base,2)./sum(mat_base~=0,2); % vector of mean score of each user
mat_ave=(mat_base-repmat(user_mean,1,mnum)).*(mat_base~=0);
% rating matrix minus mean of each user, 0 for unrated movies
for i=1:num_test, % calculate expected score for each test item
user=list_test(i,1);
mid=list_test(i,2);
if prev_user~=user; % the distance is recaulculated whenever a new user is encounterd
prev_user=user;
ref=repmat(mat_ave(user,:),unum,1);
% repeated matrix of tested user minus mean, 0 for unrated movies
rated=(mat_ave~=0).*(ref~=0); % mutually rated movies...
Movie Advisor Project
40/51
February 12, 2016
srated=sum(rated,2); % number of mutually rated movies
cv=sqrt(sum((mat_ave.^2).*rated,2).*sum((ref.^2).*rated,2))+(sum(rated,2)==0);
% cov between the _user_ and others, 1 if no match (to avoid division by 0)
pr=(sum(rated,2)~=0).*sum(ref.*mat_ave.*rated,2)./cv;
% pearson r coef. between _user_ and others, if there is no overlap pr=0
% should be [-1,1], higher absolute values signify stronger correlation
% herlck_th=3*full(round(sum(srated)./sum(srated~=0))); <- unused 'adaptive herlck'
pr=srated/herlck_th.*pr.*(srated<=herlck_th)+pr.*(srated>herlck_th);
%pr=(srated/herlck_th).^2.*pr.*(srated<=herlck_th)+pr.*(srated>herlck_th);
% add herlock, used to lessen the influence of users with small srated...
w=(abs(pr)>=pears_th).*sign(pr).*(abs(pr)-pears_th)/(1-pears_th);
% weight used instead of pearson r coef, used to incorporate threshholding
end
mscr=full(mat_ave(:,mid)); % vector of movie scores minus user average
sumw=abs(w)'*(mscr~=0); % sum of w for correlated users who scored movie
rel_users=full(sum((w~=0).*(mscr~=0)));
%
if sumw~=0 % predict only if other matching users rated the movie...
if rel_users>users_th
score= user_mean(user)+full(w'*mscr./sumw); % expected score
score=(score<=1)+score*((score<5)&(score>1))+5*(score>=5); % truncate exceptional values
list_pred(count,:)=[user mid list_test(i,3) score user_mean(user)];
count=count+1;
end
if mod(i,25)==0 waitbar(i/num_test,h); end
end
close(h)
if exist('list_pred')
fprintf('\nPearson TH=%.3f, Herlock TH=%i, Users TH=%i\n',pears_th, herlck_th, users_th);
fprintf('Coverage=%.4f, MAE Average=%.4f, MAE Pearson=%.4f\n', ...
(count-1)/num_test, ...
mean(abs(list_pred(:,3)-list_pred(:,5))), ...
mean(abs(list_pred(:,3)-list_pred(:,4))));
else
fprintf('No predictions made');
end
Movie Advisor Project
41/51
February 12, 2016
MSD
Msd.m
This m-file is used to produce predictions using the MSD algorithm. The m-file also outputs several statistics to the standard output.
function [list_pred, list_npred]=msd(mat_base,list_test, varargin);
%
%
%
%
%
%
%
%
%
%
%
%
%
This function takes a rating matrix (UxM,R) a list of observations (uid, mid, r)
and returns a list of predictions using the mean square difference.
USAGE: [list_pred, list_npred]=msd(mat_base,list_test, g, L, users_th)
list_pred- a list of predicted values [uid mid r pred u_ave]
list_npred- a list of values that were not predicted [uid mid r u_ave]
mat_base- base dataset in the form of a sparse matrix
list_test- test list in the form of [uid mid r], not presumed to be sorted
Optional Parameter:
Lmsd threshold (default = 0.2)
users_thminimal number of related users to make a prediction (default=3)
users_th=3;
L=.8;
if nargin>=3
L=varargin{1};
end
if nargin>=4
users_th=varargin{2};
end
if (nargin<2 | nargin>4)
error('Improper number of input arguements');
end
h=waitbar(0,'Processing msd...');
[tmp,index]=sort(list_test,1); % sort _list_test_ according to user number
list_test=list_test(index(:,1),:);
Movie Advisor Project
42/51
February 12, 2016
[unum,mnum]=size(mat_base);
num_test=size(list_test,1);
prev_user=0;
count=1;
countn=1;
user_mean=sum(mat_base,2)./sum(mat_base~=0,2); % vector of mean score of each user
mat_ave=(mat_base-repmat(user_mean,1,mnum)).*(mat_base~=0);
% rating matrix minus mean of each user, 0 for unrated movies
for i=1:num_test, % calculate expected score for each test item
user=list_test(i,1);
mid=list_test(i,2);
if prev_user~=user; % the distance is recaulculated whenever a new user is encounterd
prev_user=user;
ref=repmat(mat_ave(user,:),unum,1);
% repeated matrix of tested user minus mean, 0 for unrated movies
rated=(mat_ave~=0).*(ref~=0); % mutually rated movies...
srated=sum(rated,2); % number of mutually rated movies
pr=sqrt(sum((mat_ave-ref).^2.*rated,2)./(srated+(srated==0)));
w=(srated~=0).*(pr<L).*(L-pr)/L;
% weight used instead of pearson r coef, used to incorporate threshholding
end
mscr=full(mat_ave(:,mid)); % vector of movie scores minus user average
sumw=w'*(mscr~=0); % sum of w for correlated users who scored movie
rel_users=full(sum((w~=0).*(mscr~=0)));
%
if sumw~=0 % predict only if other matching users rated the movie...
if rel_users>users_th
score= user_mean(user)+full(w'*mscr./sumw); % expected score
score=(score<=1)+score*((score<5)&(score>1))+5*(score>=5); % truncate exceptional values
list_pred(count,:)=[user mid list_test(i,3) score user_mean(user)];
count=count+1;
else
list_npred(countn,:)=[user mid list_test(i,3) user_mean(user)];
countn=countn+1;
end
Movie Advisor Project
43/51
February 12, 2016
if mod(i,25)==0 waitbar(i/num_test,h); end
end
close(h)
if exist('list_pred')
fprintf('\nL TH=%.3f, Users TH=%i\n',L, users_th);
fprintf('Coverage=%.4f, MAE Average=%.4f, MAE msd=%.4f\n', ...
(count-1)/num_test, ...
mean(abs(list_pred(:,3)-list_pred(:,5))), ...
mean(abs(list_pred(:,3)-list_pred(:,4))));
else
fprintf('No predictions made');
end
if exist('list_npred')
fprintf('ommited values, MAE Average=%1.4f\n',mean(abs(list_npred(:,3)-list_npred(:,4))))
end
Genre
Basegen.m
This m-file is used to generate results for the baseline genre algorithm.
function g_pred=basegen(mat_base,list_test, g);
% base genre prediction algorithm
[tmp,index]=sort(list_test,1); % sort _list_test_ according to user number
list_test=list_test(index(:,1),:);
[unum,mnum]=size(mat_base);
num_test=size(list_test,1);
user_mean=sum(mat_base,2)./sum(mat_base~=0,2); % vector of mean score of each user
mat_ave=(mat_base-repmat(user_mean,1,mnum)).*(mat_base~=0);
% rating matrix minus mean of each user, 0 for unrated movies
ugen=mat_ave*g;
ugenum=(mat_ave~=0)*g;
ugenum1=ugenum+(ugenum==0);
ugen=ugen./ugenum1;
%
%
%
%
create a matrix of average user scores for each genre
count the number of rated movies in genre for normalization
avoid divide by 0;
normalize
Movie Advisor Project
44/51
February 12, 2016
% modify user average by adding an average of the users average rating for
% each genre in the movie
gen_ave=ugen(list_test(:,1),:).*g(list_test(:,2),:);
% multiply users' average genre scores with genres in the movie
sgen=sum(gen_ave~=0,2)~=0;
% avoid divide by 0 for movies that can't be predected
idx=find(sgen);
g_pred=full(user_mean(list_test(:,1))+sum(gen_ave,2)./(sum(gen_ave~=0,2)+~sgen));
% predict user's rating based solely on his average ratings and the movie's genres
r=list_test(:,3);
fprintf('Genre average\tMAE=%1.3f\tCoverage=%2.2f\n',...
mean(abs(g_pred(idx)-r(idx))),length(idx)./num_test*100);
Peargen.m
This m-file is used to produce predictions using the pearson genre algorithm. The m-file also outputs several statistics to the standard output..
function list_pred=peargen(mat_base,list_test, g, varargin);
%
%
%
%
%
%
%
%
%
%
%
%
%
This function takes a rating matrix (UxM,R) a list of observations (uid, mid, r)
and a list of genres and returns a list of predictions using the pearson r coefficient.
USAGE: list_pred=peargen(mat_base,list_test, g, pears_th, users_th, herlck_th)
list_pred- a list of [uid mid r pred u_ave]
mat_base- base dataset in the form of a sparse matrix
list_test- test list in the form of [uid mid r], not presumed to be sorted
glist of genres
Optional Parameters:
pears_th- pearson threshhold (default=0.1)
users_th- minimal number of correlated users to make a prediction (default=3)
herlck_th- number of matching rated movies between users (default=30)
herlck_th=30;
pears_th=0.1; % default pearson r threshhold
users_th=3;
% default minimum number of 'affecting users'
Movie Advisor Project
45/51
February 12, 2016
if nargin>=4
pears_th=varargin{1};
end
if nargin>=5
users_th=varargin{2};
end
if nargin==6
herlck_th=round(varargin{3});
end
if (nargin<3 | nargin>6)
error('Improper number of input arguements');
end
h=waitbar(0,'Processing Peargen...');
[tmp,index]=sort(list_test,1); % sort _list_test_ according to user number
list_test=list_test(index(:,1),:);
[unum,mnum]=size(mat_base);
num_test=size(list_test,1);
prev_user=0;
count=1;
user_mean=sum(mat_base,2)./sum(mat_base~=0,2); % vector of mean score of each user
mat_ave=(mat_base-repmat(user_mean,1,mnum)).*(mat_base~=0);
% rating matrix minus mean of each user, 0 for unrated movies
ugen=mat_ave*g;
ugenum=(mat_ave~=0)*g;
ugenum1=ugenum+(ugenum==0);
ugen=ugen./ugenum1;
%
%
%
%
create a matrix of average user scores for each genre
count the number of rated movies in genre for normalization
avoid divide by 0;
normalize
for i=1:num_test, % calculate expected score for each test item
user=list_test(i,1);
mid=list_test(i,2);
if prev_user~=user; % the distance is recaulculated whenever a new user is encounterd
prev_user=user;
ref=repmat(ugen(user,:),unum,1);
% repeated matrix of tested users genre prefs
Movie Advisor Project
46/51
February 12, 2016
rated=(ugen~=0).*(ref~=0); % mutually rated genres...
srated=sum(rated,2); % number of mutually rated genres
cv=sqrt(sum((ugen.^2).*rated,2).*sum((ref.^2).*rated,2))+(sum(rated,2)==0);
% cov between the _user_ and others, 1 if no match (to avoid division by 0)
pr=(sum(rated,2)~=0).*sum(ref.*ugen.*rated,2)./cv;
% pearson r coef. between _user_ and others, if there is no overlap pr=0
% should be [-1,1], higher absolute values signify stronger correlation
pr=(srated/herlck_th).*pr.*(srated<=herlck_th)+pr.*(srated>herlck_th);
% add herlock, used to lessen the influence of users with small srated...
w=(abs(pr)>=pears_th).*sign(pr).*(abs(pr)-pears_th)/(1-pears_th);
% weight used instead of pearson r coef, used to incorporate threshholding
end
mscr=full(mat_ave(:,mid)); % vector of movie scores minus user average
sumw=abs(w)'*(mscr~=0); % sum of w for correlated users who scored movie
rel_users=full(sum((w~=0).*(mscr~=0)));
%
if sumw~=0 % predict only if other matching users rated the movie...
if rel_users>users_th
score= user_mean(user)+full(w'*mscr./sumw); % expected score
score=(score<=1)+score*((score<5)&(score>1))+5*(score>=5); % truncate exceptional values
list_pred(count,:)=[user mid list_test(i,3) score user_mean(user)];
count=count+1;
end
if mod(i,25)==0 waitbar(i/num_test,h); end
end
close(h)
fprintf('\nPearson TH=%.3f, Herlock TH=%i, Users TH=%i\n',pears_th, herlck_th, users_th);
fprintf('Coverage=%.4f, MAE Average=%.4f, MAE Pearson=%.4f\n', ...
(count-1)/num_test, ...
mean(abs(list_pred(:,3)-list_pred(:,5))), ...
mean(abs(list_pred(:,3)-list_pred(:,4))));
Genmsd.m
This m-file is used to produce predictions using the genre MSD algorithm. The m-file also outputs several statistics to the standard output..
function [list_pred, list_npred]=genmsd(mat_base,list_test, g, varargin);
% This function takes a rating matrix (UxM,R) a list of observations (uid, mid, r)
Movie Advisor Project
%
%
%
%
%
%
%
%
%
%
%
%
47/51
February 12, 2016
and a list of genres and returns a list of predictions using mean square distance.
USAGE: [list_pred, list_npred]=genmsd(mat_base,list_test, g, L, users_th)
list_pred- a list of predicted values [uid mid r pred u_ave]
list_npred- a list of values that were not predicted [uid mid r u_ave]
mat_base- base dataset in the form of a sparse matrix
list_test- test list in the form of [uid mid r], not presumed to be sorted
glist of genres
Optional Parameter:
users_th- minimal number of correlated users to make a prediction (default=10)
Lrms threshold
users_th=10;
L=.4;
if nargin>=4
L=varargin{1};
end
if nargin>=5
users_th=varargin{2};
end
if (nargin<3 | nargin>5)
error('Improper number of input arguements');
end
h=waitbar(0,'Processing Peargen...');
[tmp,index]=sort(list_test,1); % sort _list_test_ according to user number
list_test=list_test(index(:,1),:);
[unum,mnum]=size(mat_base);
num_test=size(list_test,1);
prev_user=0;
count=1;
countn=1;
user_mean=sum(mat_base,2)./sum(mat_base~=0,2); % vector of mean score of each user
mat_ave=(mat_base-repmat(user_mean,1,mnum)).*(mat_base~=0);
Movie Advisor Project
48/51
February 12, 2016
% rating matrix minus mean of each user, 0 for unrated movies
ugen=mat_ave*g;
ugenum=(mat_ave~=0)*g;
ugenum1=ugenum+(ugenum==0);
ugen=ugen./ugenum1;
%
%
%
%
create a matrix of average user scores for each genre
count the number of rated movies in genre for normalization
avoid divide by 0;
normalize
for i=1:num_test, % calculate expected score for each test item
user=list_test(i,1);
mid=list_test(i,2);
if prev_user~=user; % the distance is recaulculated whenever a new user is encounterd
prev_user=user;
ref=repmat(ugen(user,:),unum,1);
% repeated matrix of tested users genre prefs
rated=(ugen~=0).*(ref~=0); % mutually rated genres...
srated=sum(rated,2); % number of mutually rated genres
pr=sqrt(sum((ugen-ref).^2.*rated,2)./srated);
w=(pr<L).*(L-pr)/L;
% weight used instead of pearson r coef, used to incorporate threshholding
end
mscr=full(mat_ave(:,mid)); % vector of movie scores minus user average
sumw=w'*(mscr~=0); % sum of w for correlated users who scored movie
rel_users=full(sum((w~=0).*(mscr~=0)));
%
if sumw~=0 % predict only if other matching users rated the movie...
if rel_users>users_th
score= user_mean(user)+full(w'*mscr./sumw); % expected score
score=(score<=1)+score*((score<5)&(score>1))+5*(score>=5); % truncate exceptional values
list_pred(count,:)=[user mid list_test(i,3) score user_mean(user)];
count=count+1;
else
list_npred(countn,:)=[user mid list_test(i,3) user_mean(user)];
countn=countn+1;
end
if mod(i,25)==0 waitbar(i/num_test,h); end
end
close(h)
if exist('list_pred')
Movie Advisor Project
49/51
February 12, 2016
fprintf('\nL TH=%.3f, Users TH=%i\n',L, users_th);
fprintf('Coverage=%.4f, MAE Average=%.4f, MAE Genre=%.4f\n', ...
(count-1)/num_test, ...
mean(abs(list_pred(:,3)-list_pred(:,5))), ...
mean(abs(list_pred(:,3)-list_pred(:,4))));
fprintf('ommited values, MAE Average=%1.4f\n',mean(abs(list_npred(:,3)-list_npred(:,4))))
else
fprintf('No predictions made');
end
Hgenmsd.m
This m-file is used to produce predictions using the hybrid genre algorithm. The m-file also outputs several statistics to the standard output..
function [list_pred, list_npred]=hgenmsd(mat_base,list_test, g, varargin);
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
This function takes a rating matrix (UxM,R) a list of observations (uid, mid, r)
and a list of genres and returns a list of predictions using the pearson r coefficient.
The algorithm used is a hybrid of Genre algorithm, using with weighting a users'
own prefrences.
USAGE: [list_pred, list_npred]=hgenmsd(mat_base,list_test, g, L, rat, users_th)
list_pred- a list of predicted values [uid mid r pred u_ave]
list_npred- a list of values that were not predicted [uid mid r u_ave]
mat_base- base dataset in the form of a sparse matrix
list_test- test list in the form of [uid mid r], not presumed to be sorted
glist of genres
Optional Parameter:
Lmsd threshold (default=.7)
ratratio between weight given to other users and user (default=0.65)
users_th- minimal number of correlated users to make a prediction (default=3)
% obtain parameters
users_th=3;
L=.7;
rat=.65;
if nargin>=4
Movie Advisor Project
50/51
February 12, 2016
L=varargin{1};
end
if nargin>=5
rat=varargin{2};
end
if nargin>=6
users_th=varargin{3};
end
if (nargin<3 | nargin>6)
error('Improper number of input arguements');
end
h=waitbar(0,'Processing hgenmsd...');
[tmp,index]=sort(list_test,1); % sort _list_test_ according to user number
list_test=list_test(index(:,1),:);
[unum,mnum]=size(mat_base);
num_test=size(list_test,1);
prev_user=0;
count=1;
countn=1;
user_mean=sum(mat_base,2)./sum(mat_base~=0,2); % vector of mean score of each user
mat_ave=(mat_base-repmat(user_mean,1,mnum)).*(mat_base~=0);
% rating matrix minus mean of each user, 0 for unrated movies
ugen=mat_ave*g;
ugenum=(mat_ave~=0)*g;
ugenum1=ugenum+(ugenum==0);
ugen=ugen./ugenum1;
%
%
%
%
create a matrix of average user scores for each genre
count the number of rated movies in genre for normalization
avoid divide by 0;
normalize
for i=1:num_test, % calculate expected score for each test item
user=list_test(i,1);
mid=list_test(i,2);
if prev_user~=user; % the distance is recaulculated whenever a new user is encounterd
prev_user=user;
ref=repmat(ugen(user,:),unum,1);
% repeated matrix of tested users genre prefs
Movie Advisor Project
51/51
February 12, 2016
rated=(ugen~=0).*(ref~=0); % mutually rated genres...
srated=sum(rated,2); % number of mutually rated genres
pr=sqrt(sum((ugen-ref).^2.*rated,2)./(srated+(srated==0)));
w=(srated~=0).*(pr<L).*(L-pr)/L;
% weight used instead of pearson r coef, used to incorporate threshholding
end
mscr=full(mat_ave(:,mid)); % vector of movie scores minus user average
sumw=w'*(mscr~=0); % sum of w for correlated users who scored movie
rel_users=full(sum((w~=0).*(mscr~=0)));
%
if sumw~=0 % predict only if other matching users rated the movie...
if rel_users>users_th
idx=find(g(mid,:)~=0);
u_gen_mid=sum(ugen(user,idx))./length(idx);
score= user_mean(user)+rat*full(w'*mscr./sumw)+(1-rat)*u_gen_mid; % expected score
score=(score<=1)+score*((score<5)&(score>1))+5*(score>=5); % truncate exceptional values
list_pred(count,:)=[user mid list_test(i,3) score user_mean(user)];
count=count+1;
else
list_npred(countn,:)=[user mid list_test(i,3) user_mean(user)];
countn=countn+1;
end
if mod(i,25)==0 waitbar(i/num_test,h); end
end
close(h)
if exist('list_pred')
fprintf('\nL TH=%.3f, Users TH=%i, ratio=%.2f\n',L, users_th,rat);
fprintf('Coverage=%.4f, MAE Average=%.4f, MAE Hybrid Genre=%.4f\n', ...
(count-1)/num_test, ...
mean(abs(list_pred(:,3)-list_pred(:,5))), ...
mean(abs(list_pred(:,3)-list_pred(:,4))));
fprintf('ommited values, MAE Average=%1.4f\n',mean(abs(list_npred(:,3)-list_npred(:,4))))
else
fprintf('No predictions made');
end
Download