www.ijecs.in International Journal Of Engineering And Computer Science ISSN:2319-7242 Volume 4 Issue 2 February 2015, Page No. 10351-10355 Survey on Evaluation of Recommender Systems Shraddha Shinde1, Mrs. M. A. Potey2 1 Savitribai Phule Pune University, Department of Computer Engg. DYP Akurdi, India. shinde.shraddha97@gmail.com 2 Savitribai Phule Pune University, Department of Computer Engg. DYP Akurdi, India. mapotey@gmail.com Abstract: Recommender Systems (RSs) can be found in many modern applications and that expose the user to a huge collections of items and helps user to decide on appropriate items, and ease the task of finding preferred items in the collection. Recently Recommender systems are gaining popularity in both commercial and research community, where many algorithms have been used for providing recommendations. There are many evaluation metrics used for comparing this recommendation algorithms. The literature on recommender system evaluation offers a large variety of evaluation metrics and choose most appropriate evaluation metric among them. Error-based metrics likes MAE, RMSE and Ranking-based metrics like Precision and Recall are discussed in this paper. Using this evaluation metrics evaluate the performance of recommender systems. Keywords: Recommender Systems, Evaluation, Metrics. 1. Introduction Recently Recommender Systems (RSs) can be found in many modern applications that expose the user to a huge collections of items and typically help users to provide a list of recommended items they might prefer, or supply guesses of how much the user might prefer each item. Recommender systems help users to decide on appropriate items, and ease the task of finding preferred items in the collection. Recommender Systems are designed to suggest users the items that best fit the user needs and preferences. Recommender systems typically produce a list of recommendations in one of two ways - through collaborative or content-based filtering. Among Recommendation Systems techniques, the two most popular categories are content-based filtering and collaborative filtering and their hybridization is called hybrid approach. Knowledge based is the third category of Recommender Systems. These recommender systems deal with two types of entities, users and items. Collaborative filtering approaches building a model from a user's past behavior as well as similar decisions made by other users; then use that model to predict items (or ratings for items) that the user may have an interest in. Content-based filtering approaches utilize a series of discrete characteristics of an item in order to recommend additional items with similar properties. Hybrid RS combine both content and collaborative approaches. Gediminas Adomavicius et.al.[1] have presented a overview of the field of recommender systems and describes the current generation of recommendation methods that are usually classified into the following three main categories: contentbased, collaborative, and hybrid recommendation approaches. They have also described various limitations of current recommendation methods and discusses possible extensions that can improve recommendation capabilities and make recommender systems applicable to an even broader range of applications. These extensions include, among others, improvement of understanding of users and items, incorporation of the contextual information into the recommendation process, support for multi-criteria ratings, and provision of more flexible and less intrusive types of recommendations. Recently Recommender systems are gain popularity in both commercial and research community, where many algorithms have been used for providing recommendations. These algorithms typically perform differently in various domains and tasks. Therefore, it is important from the research perspective, as well as from a practical view, to be able to decide on an algorithm that matches the domain and the task of interest. The standard way to make such decisions is by comparing a number of algorithms online using some evaluation metric. Actually, many evaluation metrics have been suggested for comparing recommendation algorithms and, most researchers who suggest new recommendation algorithms also compare the performance of the new algorithm to a set of existing approaches. Such evaluations are typically performed by applying some evaluation metric that provides a ranking of the candidate algorithms (usually using numeric scores). Many evaluation metrics have been used to rank recommendation algorithms, some measuring similar features, but some measuring drastically different quantities. For example, methods such as the Root of the Mean Square Error (RMSE) measure the distance between predicted preferences and true preferences over items, while the Recall method computes the portion of favored items that were suggested. Shraddha Shinde, IJECS Volume 4 Issue 2 February, 2015 Page No.10351-10355 Page 10351 1.1 Evaluation Metrics Evaluation metrics for recommender systems can be divided into four major classes: 1) Predictive accuracy metrics, 2) Classification accuracy metrics, 3) Rank accuracy metrics and 4) Non-accuracy metrics. Predictive accuracy or rating prediction metrics are very popular for the evaluation of non-binary ratings. It is most appropriate for usage scenarios in which an accurate prediction of the ratings for all items is of high importance. The most important representatives of this class are mean absolute error (MAE), mean squared error (MSE), root mean squared error (RMSE) and normalized mean absolute error (NMAE). MSE and RMSE use the squared deviations and thus emphasize larger errors in comparison to the MAE metric. MAE and RMSE describe the error in the same units as the computed values, while MSE yields squared units. NMAE normalizes the MAE metric to the range of the respective rating scale in order to make results comparable among recommenders with varying rating scales. These predictive accuracy error metrics are frequently used for the evaluation of recommender systems since they are easy to compute and understand. They are wellstudied and are also applied in many contexts other than recommender systems. However, they do not necessarily correspond directly to the most popular usage scenarios for recommender systems. Classification accuracy metrics try to assess the successful decision making capacity (SDMC) of recommendation algorithms. They measure the amount of correct and incorrect classifications as relevant or irrelevant items that are made by the recommender system and are therefore useful for user tasks such as finding good items. SDMC metrics ignore the exact rating or ranking of items as only the correct or incorrect classification is measured. This type of metric is particularly suitable for applications in e-commerce that try to convince users of making certain decisions such as purchasing products or services. The common basic information retrieval (IR) metrics like recall and precision are calculated from the number of items that are either relevant or irrelevant and either contained in the recommendation set of a user or not. The common basic information retrieval (IR) metrics are calculated from the number of items that are either relevant or irrelevant and either contained in the recommendation set of a user or not. A rank accuracy or ranking prediction metric measures the ability of a recommender to estimate the correct order of items concerning the user's preference, which is called the measurement of rank correlation in statistics. Therefore this type of measure is most adequate if the user is presented with a long ordered list of items recommended to him. A rank prediction metric uses only the relative ordering of preference values so that is independent of the exact values that are estimated by a recommender. Most rank accuracy metrics such as Kendall's tau and Spearman's rho compare two total orderings. Evaluating recommender systems requires a definition of what constitutes a good recommender system, and how this should be measured. There is mostly consensus on what makes a good recommender system and on the methods to evaluate recommender systems. The more recommendation approaches are used, the more important their evaluation becomes to determine the best performing approaches and their individual strengths and weaknesses. To determining the `best' recommender approach there are three main evaluation methods, namely user studies, online evaluations, and offline evaluations are used to measure the recommender systems quality. In user studies, users explicitly rate recommendations generated by different algorithms and the algorithm with the highest average rating is considered the best algorithm. User studies typically ask their participants to quantify their overall satisfaction with the recommendations. It is important to note that user studies measure user satisfaction at the time of recommendation. Users do not measure the accuracy of a recommender system because users do not know, at the time of the rating, whether a given recommendation really was the most relevant. In online evaluations, recommendations are shown to real users of the system during their session. Users do not rate recommendations but the recommender system observes how often a user accepts a recommendation. Acceptance is most commonly measured by click-through rate (CTR), i.e. the ratio of clicked recommendations. Offline evaluations use pre-compiled offline datasets from which some information has been removed. Subsequently, the recommender algorithms are analyzed on their ability to recommend the missing information. There are three types of offline datasets, which is define as (1) true-offline-datasets, (2) user-offline-dataset, and (3) expert-offline-datasets. 2. Literature Review The literature on recommender system evaluation offers a large variety of evaluation metrics and using this evaluation metrics evaluate the performance of recommender systems. Error-based metrics likes MAE, RMSE and Ranking-based metrics like Precision and Recall are discussed in detail. 2.1 Accuracy Evaluation Metrics of Recommendation Tasks Asela Gunawardana et.al.[4] have been suggested many evaluation metrics like MAE, RMSE, and Precision ,Recall for comparing recommendation algorithms. The decision on the proper evaluation metric is often critical, as each metric may favor a different algorithm and review the proper construction of offline experiments for deciding on the most appropriate algorithm. They have also discussed the three important tasks of recommender systems, and classify a set of appropriate well known evaluation metrics for each task and demonstrate how using an improper evaluation metric can lead to the selection of an improper algorithm for the task of interest and discussed other important considerations when designing offline experiments. 2.2 Evaluating Recommendation Systems Asela Gunawardana et.al.[6] have discussed how to compare recommenders based on a set of properties that are relevant for the application and focused on comparative studies, where a few algorithms are compared using some evaluation metric, rather than absolute benchmarking of algorithms. They have also described experimental settings appropriate for making choices between algorithms and reviewed three types of experiments, starting with an offline setting, where recommendation approaches are compared without user Shraddha Shinde, IJECS Volume 4 Issue 2 February, 2015 Page No.10351-10355 Page 10352 interaction, then reviewing user studies, where a small group of subjects experiment with the system and report on the experience, and finally describe large scale online experiments, where real user populations interact with the system. 2.3 Comparative Recommender System Evaluation Alejandro Bellogín et.al.[2] have compared common recommendation algorithms as implemented in three popular recommendation frameworks like Apache Mahout (AM) [21], LensKit (LK) [22] and MyMediaLite (MML) [23]. To provide a fair comparison, they have complete control of the evaluation dimensions being benchmarked: dataset, data splitting, evaluation strategies, and metrics. They have also included results using the internal evaluation mechanisms of these frameworks. Their analysis points to large differences in recommendation accuracy across frameworks and strategies, i.e. the same baselines may perform orders of magnitude better or worse across frameworks. Their results show the necessity of clear guidelines when reporting evaluation of recommender systems to ensure reproducibility and comparison of results. Once the training and test splits are available, produced recommendations using three common state-of-the-art methods from the recommender systems literature: user-based collaborative filtering (CF), item-based CF and matrix factorization. These algorithms are calculated using following formulas. Item-based recommenders look at each item rated by the target user, and find items similar to that item. The rating prediction computed by item-based strategies is generally estimated as follows: ~ r (u, i) C sim(i, j )r (u, j ) Matrix factorization recommenders use dimensionality reduction in order to uncover latent factors between users and items, e.g. by Singular Value Decomposition (SVD), and probabilistic Latent Semantic Analysis. Generally, these systems minimize the regularized squared error on the set of known ratings to learn the factor vectors pu and qi. min (r (u, i) q q*, p* T i pu ) ( qi 2 2 pu 2 ) ( u ,i ) 2.4 Precision-Oriented Evaluation of Recommender Systems There is considerable methodological divergence in the way precision-oriented metrics are being applied in the Recommender Systems field, and as a consequence, the results reported in different studies are difficult to put in context and compare. Alejandro Bellogín et.al.[3] have aim to identify the involved methodological design alternatives, and their effect on the resulting measurements, with a view to assessing their suitability, advantages, and potential shortcomings. They have compared five experimental methodologies, broadly covering the variants reported in the literature and in experiments with three state-of-the-art recommenders, four of the evaluation methodologies are consistent with each other and differ from e me ics, in e ms f he c mpa a ive ec mmende s’ performance measurements. The other methodology considerably overestimates performance, and suffers from a strong bias towards known relevant items. In this the accuracy of recommendations has been evaluated by measuring the error between predicted and known ratings, by metrics such as the Mean Average Error (MAE), and the Root Mean Squared Error (RMSE). jSi Where Si is the set of items similar to i. Si is generally replaced by Iu – all items rated by user u – since unrated items are assumed to be rated with a zero. The user-based CF algorithm can be found in the literature under two different formulations: sim(u, v)(r (v, i) ~r (v)) ~ r (u, i) ~ r (u) C Precision-oriented metrics measure the amount of relevant and non-relevant retrieved items. Alejandro Bellogín et.al.[3] have presented a general methodological framework for evaluating recommendation lists, covering different evaluation approaches. The considered approaches essentially differ from each other in the amount of unknown relevance that is added to the test set. They have used Precision, Recall, and Normalized Discounted Cumulative Gain (NDCG) as representative precision oriented metrics. vN k ( u ) And ~ r (u, i) C sim(u, v)r (v, i) vN k ( u ) Where Nk(u) is a neighbourhood of the k most similar users ( ) is he se ’s ave age a ing. N e ha b h va ia i ns a e referred to using the same name, making it difficult to a priori know which version is used. The user similarity sim(u, v) is often computed using the Pearson correlation coefficient or the cosine similarity between the user preference vectors. They have included here a shrunk version of these similarities: sims (a, b) n a ,b n a ,b s sim(a, b) where na,b is the number of use s wh the shrinking parameter. a ed b h i ems, and λs is 2.5 Evaluating Performance of Recommender Systems Francois Fouss et.al.[7] have focused specifically on the “acc acy” f ec mmenda i n alg i hms. G d recommendation (in terms of accuracy) has, however, to be coupled with other considerations. They have been suggests measures aiming at evaluating other aspects than accuracy of recommendation algorithms. Other considerations include (1) coverage, which measures the percentage of a data set that a recommender system is able to provide recommendation for, (2) confidence metrics that can help users make more effective decisions, (3) computing time, which measures how quickly an algorithm can produce good recommendations, (4) novelty/serendipity, which measure whether a recommendation is original, and (5) robustness which measure the ability of the algorithm to make good predictions in the presence of noisy or sparse data. There are six collaborative recommendation methods are investigated. 2.6 Evaluating Recommender Systems Recommender systems are considered as an answer to the information overload in a web environment. Such systems Shraddha Shinde, IJECS Volume 4 Issue 2 February, 2015 Page No.10351-10355 Page 10353 recommend items (movies, music, books, news, web pages, etc.) that the user should be interested in. Collaborative filtering recommender systems have a huge success in commercial applications. The sales in these applications follow a power law distribution. However, with the increase of the number of recommendation techniques and algorithms in the literature, there is no indication that the datasets used for the evaluation follow a real world distribution. Zied Zaier et.al.[5] have introduced the long tail theory and its impact on recommender systems. It also provides a comprehensive review of the different datasets used to evaluate collaborative filtering recommender systems techniques and algorithms. Finally, it investigates which of these datasets present a distribution that follows this power law distribution and which distribution would be the most relevant. 3. Comparative Study Traditional metrics like precision and recall have some limitations like: (1) Search results don't have to be very good. (2) Recall? Not important (as long as you get at least some good hits). (3) Precision? Not important (as long as at least some of the hits on the first page you return are good)." A major problem with the frequently used metrics precision, recall and F1-measure is that they suffer from severe biases. The outcome of these measures not only depends on the accuracy of the recommender (or classifier) but also very much on the ratio of relevant items. In order to illustrate this imagine an ideal recommender and its inverse and apply them to generate top-k lists of four items for varying ratios of relevant items. To solve this problem there are three new metrics are introduced for avoiding these biases by integrating the inverse precision and inverse recall respectively. These three new metrics are: Informedness, Markedness and Matthews Correlation. Markedness combines precision and inverse precision into a single measure and expresses how marked the classifications of a recommender are in comparison to chance: dana et.al.[4] References [1] [2] [3] [5] [6] [7] Correlation = ± Existing Evaluatio n Metrics Feature Author Asela Gunawar Precision They suffer Gunnar Schrode Newly Proposed Evaluation Metrics Informednes s [8] Feature Avoid these biases by integratin g the inverse precision and inverse recall Acknowledgement informedness = recall + inverseRecall-1 Table 1: Compare Existing and Newly Proposed Metrics Markedness Matthews Correlation We would like to thank the publishers, researchers and teachers for their guidance. We would also thank the college authority for providing the required infrastructure and support. Last but not the least we would like to extend a heartfelt gratitude to friends and family members for their support. [4] Autor r et.al.[8] This paper surveys evaluation of Recommender Systems. Recommendation algorithms could be evaluated in order to select the best algorithm for a specific task from a set of candidates are discussed in detail. Many evaluation metrics have been used for comparing recommendation algorithms. Error-based metrics likes MAE, RMSE and Ranking-based metrics like Precision and Recall are also discussed in this paper. Using this evaluation metrics, performance of recommender systems are evaluated. Informedness combines recall and inverse recall into a single measure and expresses how informed the classifications of a recommender are in comparison to chance: The Matthews Correlation combines the informedness and markedness measures into a single metric by calculating their geometric mean: from severe biases. 4. Conclusion markedness = precision + inversePrecision-1 Both markedness and informedness return values in the range [1, 1]. Recall F1measure Adomavicius, Gediminas, and Alexander Tuzhilin. "Toward the next generation of recommender systems: A survey of the state-of-the-art and possible extensions." Knowledge and Data Engineering, IEEE Transactions on 17.6 (2005): 734-749. Said, Alan, and Alejandro Bellogín. "Comparative recommender system evaluation: benchmarking recommendation frameworks." Proceedings of the 8th ACM Conference on Recommender systems. ACM, 2014. Bellogin, Alejandro, Pablo Castells, and Ivan Cantador. "Precision-oriented evaluation of recommender systems: an algorithmic comparison." Proceedings of the fifth ACM conference on Recommender systems. ACM, 2011. Gunawardana, Asela, and Guy Shani. "A survey of accuracy evaluation metrics of recommendation tasks." The Journal of Machine Learning Research 10 (2009): 2935-2962. Zaier, Zied, Robert Godin, and Luc Faucher. "Evaluating recommender systems." Automated solutions for Cross Media Content and Multi-channel Distribution, 2008. AXMEDIS'08. International Conference on. IEEE, 2008. Shani, Guy, and Asela Gunawardana. "Evaluating recommendation systems." Recommender systems handbook. Springer US, 2011. 257-297. Fouss, François, and Marco Saerens. "Evaluating performance of recommender systems: An experimental comparison." Web Intelligence and Intelligent Agent Technology, 2008. WI-IAT'08. IEEE/WIC/ACM International Conference on. Vol. 1. IEEE, 2008. Schröder, Gunnar, Maik Thiele, and Wolfgang Lehner. "Setting Goals and Choosing Metrics for Recommender System Evaluations." UCERSTI2 Workshop at the 5th ACM Conference on Recommender Systems, Chicago, USA. Vol. 23. 2011. Shraddha Shinde, IJECS Volume 4 Issue 2 February, 2015 Page No.10351-10355 Page 10354 [9] Cremonesi, Paolo, et al. "Comparative evaluation of recommender system quality." CHI'11 Extended Abstracts on Human Factors in Computing Systems. ACM, 2011. [10] Ge, Mouzhi, Carla Delgado-Battenfeld, and Dietmar Jannach. "Beyond accuracy: evaluating recommender systems by coverage and serendipity." Proceedings of the fourth ACM conference on Recommender systems. ACM, 2010. [11] del Olmo, Félix Hernández, and Elena Gaudioso. "Evaluation of recommender systems: A new approach." Expert Systems with Applications 35.3 (2008): 790-804. [12] Kluver, Daniel, and Joseph A. Konstan. "Evaluating recommender behavior for new users." Proceedings of the 8th ACM Conference on Recommender systems. ACM, 2014. [13] Lei Shi. Trading-off among accuracy, similarity, diversity, and long-tail: a graph-based recommendation approach. In Proceedings of the 7th ACM conference on Recommender systems, pages 57-64. ACM, 2013. [14] Markus Zanker, Matthias Fuchs, Wolfram Hopken, Mario Tuta, and Nina Muller. Evaluating recommender systems in tourism a case study from austria. Information and communication technologies in tourism 2008, pages 2434, 2008. [15] Beel, Joeran, et al. "A comparative analysis of offline and online evaluations and discussion of research paper recommender system evaluation." Proceedings of the International Workshop on Reproducibility and Replication in Recommender Systems Evaluation. ACM, 2013. [16] Beel, Joeran, et al. "Research paper recommender system evaluation: A quantitative literature survey." Proceedings of the International Workshop on Reproducibility and Replication in Recommender Systems Evaluation. ACM, 2013. [17] McDonald, David W. "Evaluating expertise recommendations." Proceedings of the 2001 International ACM SIGGROUP Conference on Supporting Group Work. ACM, 2001. [18] Ruffo, Giancarlo, and Rossano Schifanella. "Evaluating peer-to-peer recommender systems that exploit spontaneous affinities." Proceedings of the 2007 ACM symposium on Applied computing. ACM, 2007. [19] Stephan Doerfel and Robert Jaschke. An analysis of tagrecommender evaluation procedures. In Proceedings of the 7th ACM conference on Recommender systems, pages 343-346. ACM, 2013. [20] Wolfgang Woerndl and Georg Groh. Utilizing physical and social context to improve recommender systems. In Proceedings of the 2007 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology-Workshops, pages 123-128. IEEE Computer Society, 2007. [21] https://mahout.apache.org/general/downloads.html [22] http://lenskit.org/download/ [23] http://mymedialite.net/download/index.html Shraddha Shinde, IJECS Volume 4 Issue 2 February, 2015 Page No.10351-10355 Page 10355