Evaluation parameter study for recommendation algorithms Chhavi Rana Department of Computer Science Engineering, University Institute of Engineering and Technology, MD University, Rohtak, Haryana, India. e-mail: chhavi1jan@yahoo.com Sanjay Kumar Jain Department of Computer Engineering, National Institute of Technology, Kurukshetra,136119 Haryana, India e-mail: skjnith@yahoo.com Abstract—The enormous amount of Web data has challenged its usage in efficient manner in the past few years. As such, a range of techniques are applied to tackle this problem; prominent among them is personalization and recommender system. In fact, these are the tools that assist user in finding relevant information of web. Most of the e-commerce websites are applying such tools in one way or the other. In the past decade, a large number of recommendation algorithms have been proposed to tackle such problems. However, there have not been much research in the evaluation criteria for these algorithms. As such, the traditional accuracy and classification metrics are still used for the evaluation purpose that provides a static view. This paper studies how the evolution of user preference over a period of time can be mapped in a recommender system using a new evaluation methodology that explicitly using time dimension. We have also presented different types of experimental set up that are generally used for recommender system evaluation. Furthermore, an overview of major accuracy metrics and metrics that go beyond the scope of accuracy as researched in the past few years is also discussed in detail. Keywords—Evolutionary; Clustering; Algorithm; Recommender Systems; Collaborative filtering; Data mining I. INTRODUCTION The focus of recommendation algorithm evaluation has been on accuracy. Most of the research in this field focuses on algorithms that reduce the error between the predicted ratings and true ratings. However, such method used to test recommendation algorithms fails to include an important aspect of parameter: time [23, 25]. Real world recommender systems are iteratively updated and reorganized as items are added and user provide ratings. Thus, the underlying user rating dataset will grow, and any recommendations derived from it will be subject to change with passage of time. The experiments that are carried on an unchanging dataset cannot reflect the ability of a real world recommender system. Consequently the effect of these changes on users preferences cannot be explored with any static method. In the last decade, there has been a huge amount of research in the field of recommender systems, which mostly focuses on implementing new algorithms for recommendations. Most of the researchers who develop new recommendation algorithms compare the performance of their new algorithm to a set of existing approaches. Such evaluations are typically performed by applying some evaluation metric and many evaluation metrics have been used to rank recommendation algorithms. It is obvious that there does not exist a single algorithm that would outperform all others over all possible methods. Hence the evaluation needs to be done over some particular parameters and goals. In this thesis, the performance will be evaluated over temporal parameters. The prime focal point is to examine if recommendation accuracy as well as scalability can be improved by adding temporal dimension. The general experimental approaches used for evaluating recommendation algorithms are divided into three categories namely online experiments, user study and offline experiments [21]. Online experiments can measure the real performance of the system but there subjective nature and cost limit them in using them on a large scale. The use offline experiments are more widespread and therefore the major part of this paper describes the offline experimental setting in detail. There are several requirements that make sure that the results of the experiments are statistically sound. This evaluation approaches are also applied in related areas such as machine learning and information retrieval, highlighting their relevancy to evaluating recommendation systems [12, 32]. II. LITERATURE REVIEW A wide of range of algorithms have been applied in the field of recommender system to make their prediction accurate and scalable. These algorithms have been developed and improved over the past decade to the point where a wide variety of algorithms exist for generating recommendations. Each algorithmic approach has disciples who assert it to be superior for some purpose. The performance parameters for identifying the best algorithm for a given purpose has proven challenging. This is because researchers disagree on which metrics should be used and also what are the attributes on which these metrics should be applied. A dozen of quantitative metrics and additional qualitative evaluation techniques are spread across literature. Moreover, even if the metrics and their corresponding attributes are agreed upon there are several other factors that hamper the process of evaluating recommender systems and their algorithms. First, different algorithms may be better or worse on different data sets. In addition the properties of these dataset, the rating scale and rating density also effect the evaluation process. Most of the recommendation algorithms have been tested on data sets that have many more users than items (e.g., the MovieLens data set has 65,000 users and 5,000 movies). Such algorithms may be entirely unsuitable in a domain where there are many more items than users (e.g., an Academic referencing Rs where there are many more research papers that their users). The second criterion that makes evaluation difficult is related to the objectives of the evaluation process. The majority of these process focus on the “accuracy” of algorithms in predicting ratings. However, even early researchers accepted that there are properties different from accuracy that has a bigger effect on user satisfaction and system performance. For example, Mobasher et al. [28] studied the extent to which the recommendations cover the entire set of items and McNee et al. [27] explored the novelty and serendipity of recommendations. Most of the researchers agree that the bottom-line measure of recommender system success should be user satisfaction. The measurement of user satisfaction again varies for different system. In the e-commerce domain user satisfaction is generally measured with the number of products purchased, but what if the users have not liked those products after buying them. The best case scenario could be directly asking user’s how satisfied they are, but again it is not feasible all the time due to user reluctance in filling up such details. Thus, there is a significant challenge in maintaining the balance between the taking explicit feedbacks from the user itself and generating implicit feedback from user navigation behavior. It is very difficult to deciding what combination of measures should be used in comparative evaluation. The majority of the published empirical evaluations of recommender algorithms till date have focused on accuracy, even though it has been challenges by few researchers in the past few years. The development of new evaluation metrics and their implementation is growing as a full-fledged area of research [25, 30]. On one hand, the traditional choices for evaluation has been reviewed by few researchers. Herlocker et.al. [21] analysed the key parameters that should be taken into account in evaluating collaborative filtering recommender systems such as: the user tasks, datasets, prediction quality metrics, prediction attributes and the user-based evaluation of the system as a whole. In addition to reviewing the evaluation strategies implemented by prior researchers, they also presented empirical results from the analysis of various accuracy metrics on one content domain where all the tested metrics collapsed roughly into three equivalence classes. Shani and Gunawardana [38] discussed how to compare recommenders on the basis of a set of properties that are relevant for a particular application. A comparative analysis is performed where few algorithms are compared using some evaluation metric, rather than absolute benchmark algorithms. Cremonesi et al. [10] focused on analyzing the performance of a recommendation algorithm on new data rather than on past data. The tried to complement the work done by Herlocker et al. [21] with appropriate methods partitioning dataset and applying an integrated methodology to evaluate different algorithms on the same data. On the other few researchers have tried to go beyond the accuracy metrics. McNee et al. [27] proposed informal arguments that researchers should move beyond the conventional accuracy metrics and their associated experimental methodologies to get the actual picture of realworld performance. They proposed new user-centric directions for evaluating recommender systems. Similarily, Ge et al. [14] focus on two crucial metrics in RS evaluation: coverage and serendipity. They have tried to analyze the role of coverage and serendipity as indicators of recommendation quality and discussed novel ways of measuring and interpreting the obtained measurements. Olma et al. [29] proposed an evaluation framework that tries to extract the essential features of recommender systems. In this framework, the most essential feature is the objective of the recommender system such as (i) choose which (of the items) should be shown to the user, (ii) decide when and how the recommendations must be shown. Based on this framework they developed a new metric and compared it with other metrics. Grimberghe et al. [18] focused on the requirements of business and research users and proposed a novel, extensible framework for the evaluation of recommender systems. The research framework uses a multidimensional approach and applies a user driven approach through an interactive visual analysis, which allows easy refining and reshaping of queries. Integrated actions such as drill-down or slice/dice, enable the user to assess the performance of recommendations in terms of business criteria such as increase in revenue, accuracy, prediction error, coverage and more. Schröder et al. [37] focuses on importance of clearly defining the objective of an evaluation and how this could affect the selection of an appropriate metric. III. PERFORMANCE PARAMETER The majority of the published empirical evaluations of recommender algorithms till date has focused on accuracy even though it has been challenges by few researchers in the past few years. The development of new evaluation metrics and their implementation is growing as a full-fledged area of research [25, 30]. Accuracy metric empirically measures the difference between the predicted ratings of items by the recommendation algorithm and the user’s true rating of preference. In selecting accuracy metric, researchers have a number of choices that are described below. The following list the most prominent metrics that include accuracy metrics as well as some of newly proposed metrics, but is certainly not exhaustive. A. Accuracy metrics The accuracy of recommendation algorithms has been evaluated in the research literature since 1994 [31]. A number of different metrices are used by these published evaluations. The generally used accuracy metrices are described hereby are used in most of the recommender system experimentation. These are broadly classified by Herlocker et al. [21] into three classes: predictive accuracy metrics, classification accuracy metrics, and rank accuracy metrics. 1) . Predictive Accuracy Metrics. Predictive accuracy metrics measure how close the recommender system’s predicted ratings are to the true user ratings. In the application domains of recommender system where prediction ratings are displayed to user, these metrics are generally used. For example, the Lastfm movie recommender [8] predicts the number of stars that a user will give each song and displays that prediction to the user. The researchers who want to measure predictive accuracy computes the difference between the predicted rating and true rating using some metrics such as mean absolute error. a) Mean Absolute Error and Related Metrics. Mean absolute error (often referred to as MAE) measures the average absolute deviation between a predicted rating and the user’s true rating. Mean absolute error (Eq. (1)) has been used to evaluate recommender systems in several cases [8, 21, 39]. (1) Where E is MAE p is the predicted rating r is the true rating N are the total number of rating The three variants of mean absolute error are mean squared error, root mean squared error, and normalized mean absolute error. On one hand, Mean squared error and Root mean squared error calculates the square of the error before summing it leading to more emphasis on large errors. On the other hand, Normalized mean absolute error [17], is mean absolute error normalized with respect to the range of rating values. This is done in order to differentiate between prediction runs on different datasets though it is not experimented yet. The drawback of these metrics is that they are less suitable for tasks such as Find Good Items where a ranked result is given to the user, who views only top ranking items. Also, when the granularity of true preference of a user is small, these metrics are less relevant. 2) Classification Accuracy Metrics. Classification metrics measure the rate with which a recommendation algorithm predicts a correct or incorrect decisions on whether an item is good. These metrics are more suitable for application such as Find Good Items where users have twofold choices. These metrics cannot be applied to nonsynthesized data in offline experiments, as the problem data sparsity challenges the calculations. There could be recommendations that are not rated and evaluating them can lead to certain biases. These metrics are also less relevant for domains with non-binary granularity of true preference. The particular metrics that are commonly used for measuring classification accuracy include Precision and Recall and ROC. a) Precision and Recall and Related Measures Precision and recall are the most popular metrics for evaluating information retrieval systems [9]. They have been used for the evaluation of recommender systems [3, 5, 33, 34]. Precision and recall are computed from a 2×2 table, such as the one shown in Table I. for calculating precision and recall, the item set is first divided into two classes—relevant or not relevant. Secondly, the set is further divide into two segments one that was returned to the user selected/recommended), and the other set that was not. Precision is defined as the ratio of relevant items selected to number of items selected, shown in Eq. (2) (2) Where P is precision Nrs is relevant items Ns are the total number of items selected Precision represents the probability that a selected item is relevant. Recall, shown in Eq. (3), is defined as the ratio of relevant items selected to total number of relevant items available. Recall represents the probability that a relevant item will be selected (3) Where R is Recall Nrs is relevant items Ns are the total number of irrelevant items available Figure 2. Relevant and irrelevant items b) F1 metric Further an approach that combines precision and recall into one metrics is termed as F1 metric (Eq. (4)). The F1 has been used to evaluate recommender systems in Sarwar et al. [33, 34]. Mean Average Precision is another approach taken by the TREC community is to compute the average precision across several different levels of recall or the average precision at the rank of each relevant document [19]. F1 and mean average precision may be appropriate if the underlying precision and recall measures on which it is based are determined to be appropriate (4) Where P is precision R is recall c) ROC Curves, Swets’ A Measure, and Related Metrics The ROC metric [40] is an alternative method that measure the degree to which an information filtering system can successfully distinguish between signal (relevance) and noise. There is an assumption in ROC model that states that every item is assigned a predicted level of relevance. Consequently, there will be two distributions, shown in Figure 1 and figure 2. Fig 1 represents the probability distribution of the system for items that are not relevant yet the system predict a given level of relevance for them (the x-axis). Fig 2 represents the same probability distribution for items that are relevant (the y- axis). A system performs better in differentiating relevant items from non-relevant items when these two probability distributions are clearly differentiable. Figure 2. ROC Curves 3) Rank Accuracy Metrics. This metrics measure the capacity of a recommendation algorithm to produce a recommended ordering of items that matches how the user would have ordered the same items. For evaluating algorithms that that create recommendation lists for the user, ranking metrics are more especially when user preferences are non-binary. When user preferences are binary (user is concerned with items that are good enough and not their order), rank accuracy metrics are generally not preferred. Few other metrics like Half-life Utility Metric as defined by Breese et al. [8], a new evaluation metric for recommender systems that is designed for tasks where the user is presented with a ranked list of results, and is unlikely to browse very deeply into the ranked list. Another description of this metric is given by Heckerman et al. [20] defines it on the pre-assumption that most Internet users will not browse very deeply into results returned by search engines. Another metrics NDPM Measure (normalized distance-based performance measure) was used to evaluate the accuracy of the FAB recommender system [2]. It was originally proposed by Yao [41]. Yao developed NDPM theoretically, using an approach from decision and measurement theory. Although the literature is filled with a number of many other similar metrics but our focus will be on the widely used metrics which includes accuracy and classification metrics. IV. CONCLUSION In this paper we presented different types of experimental set up that are generally used for recommender system evaluation. We also gave a review of literature and listed efforts carried out by various researchers in enlisting such approaches and development of some new approaches. An overview of major accuracy metrics and metrics that go beyond the scope of accuracy as researcher in the past few years is also discussed in detail. We also proposed a new methodology that is particularly suited for temporal evaluation of recommender system in general. Therefore, this paper provided some innovative guidelines that help researchers to choose an appropriate evaluation metric for a temporal evaluation scenario. In our future work, we will consider the implementing of our evaluation framework to a set of benchmark recommendation algorithm that will justify its usage. REFERENCES [1] Agrawal, R., T. Imielinski and A. Swami, 1993. Mining association between sets of items in massive database. Proceedings of the ACM SIGMOD International Conference on Management of Data, June 1, 1993, ACM New York, USA., pp: 207-216. [2] Balabanovic, M., & Shoham, Y. 1997. Combining Content-Based and Collaborative , Recommendation Communications of the ACM, 40(3): 66-72. [3] Basu, C., Hirsh, H., and Cohen W. 1998. Recommendation as classification: using social and content-based information in recommendation. In Proceedings of the 1998 National Conference on Artificial Intelligence (AAAI-98), pp. 714-720. [4] Bellog´ın, A., Castells, P., Cantador, I. 2011. Precision-based evaluation of recommender systems: An algorithmic comparison. In: Proc. RecSys 2011, pp. 333–336. [5] Billsus, D. & Pazzani, M. 1998. Learning Collaborative Information Filters, In Shavlik, J., ed., Machine Learning: Proceedings of the Fifteenth International Conference, Morgan Kaufmann Publishers, San Francisco, CA. [6] Box, G.E.P., Hunter, W.G., and Hunter, J.S. 1978, Statistics for Experimenters: an Introduction to Design, Data Analysis, and Model Building. John Wiley & Sons. New York [7] Boutilier, C. and Zemel, R. 2002. Online queries for collaborative filtering, Ninth International Workshop on Artificial Intelligence and Statistics. [8] Breese, J., Heckerman, D., & Kadie, C. 1998. Empirical Analysis of Predictive Algorithms for Collaborative Filtering. In Proceedings of the 14th Conference on Uncertainty in Artificial Intelligence (UAI-98), pp 43-52. [9] Cleverdon, C. and Kean, M. 1968. Factors Determining the Performance of Indexing Systems. Aslib Cranfield Research Project, Cranfield, England. [10] Cremonesi, P. Koren, Y. Turrin. R. Performance of Recommender Algorithms on Top-N Recommendation Tasks. In RecSys 2010. [11] Das, A. Datar, M. Gang, A.and Rajaram, S. 2007. Google news personalization: Scalable online collaborative filtering. Proceedings of the 16th International Conference on World Wide Web (WWW'07). [12] Demsar, J. 2006. Statistical Comparisons of Classifiers over Multiple Data Sets. Journal of Machine Learning Research 7: 1-30. [13] Fu, C., Silver, D. 2004. Time-Sensitive Sampling for Spam Filtering. In Tawfik, A.; and Goodwin, S. (eds.), Advances in Artificial Intelligence. Springer Berlin / Heidelberg. [14] Ge, M. Battenfeld, C. D. and Jannach, D. 2010. Beyond accuracy: evaluating recommender systems by coverage and serendipity. In Proceedings of the fourth ACM conference on Recommender systems (RecSys '10). ACM, New York, NY, USA, pp. 257-260. [15] Karypis, G. 2001. Evaluation of Item-Based Top-N Recommendation Algorithms. CIKM pp. 247-254 [16] Good, N., Schafer, J.B., Konstan, J.A., Borchers, A., Sarwar, B., Herlocker, J., & Riedl, J. 1999. Combining Collaborative Filtering with Personal Agents for Better Recommendations, Proceedings of Sixteenth National Conference on Artificial Intelligence (AAAI99), AAAI Press. pp 439-446. [17] Goldberg, D., Nichols, D., Oki, B. M., and Terry, D. 1992. Using collaborative filtering to weave an information tapestry. Communication of the. ACM, 35(12), pp. 61-70. [18] Krohn-Grimberghe, A. Nanopoulos, A. and Schmidt-Thieme. L. 2010. Proceedings of LWA2010 - Workshop-Woche: Lernen, Wissen \& Adaptivitaet, Kassel, Germany. [19] Harman, D. R. Naremore, R. C., Densmore, A. E. 1995. Language intervention with school-aged children: Conversation, narrative, and text. San Diego, CA: Singular. Overview of the fourth text retrieval conference. [20] Heckerman, D., Chickering, D. M., Meek, C., Rounthwaite, R., And Kadie, C. 2000. Dependency networks for inference, collaborative filtering, and data visualization. Journal of Machine Learning Research 1: 49–75. [21] Herlocker, J. Konstan, J. Terveen, L. & Riedl, J. 2004. Evaluating collaborative filtering recommender systems, ACM Trans. Inf. Syst., 22(1): 5-53. [22] Kohavi, R. Provost, F.: Applications of data mining to electronic commerce. Data Mining and Knowledge Discovery , 5(1–2): 5–10. [23] Koren, Y. 2009. Collaborative filtering with temporal dynamics, Proceedings of the 15th ACM International Conference on Knowledge Discovery and Data Mining, 53(4): 447-452. [24] Koychev, I. and Schwab, I. 2000. Adaptation to Drifting User's Intersects - Proceedings ECML2000/MLnet workshop: ML in the New Information Age, Barcelona, Spain, 2000, pp. 39-45. [25] Lathia, N., Hailes, S., & Capra, L. 2009. Temporal collaborative filtering with adaptive neighbourhoods, Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval SIGIR 09, pp. 796-797. [26] Mahmood, T. and Ricci, F. 2007. Learning and adaptivity in interactive recommender systems. In Dellarocas, C., and Dignum, F., editors, Proceedings of the ICEC’07 Conference, Minneapolis, USA, pages 75-84. . [27] McNee, S. M.; Riedl, J. & Konstan, J. 2006, Accurate is not always good: How Accuracy Metrics have hurt Recommender Systems, in 'Extended Abstracts of the 2006 ACM Conference on Human Factors in Computing Systems (CHI 2006)' . [28] Mobasher, B. Dai, H. Luo, T. Nakagawa, M. 2001. Effective personalization based on association rule discovery from web usage data. WIDM, pp. 9-15 [29] Olma H. F. & Gaudioso, E., 2008. Evaluation of recommender systems: A new approach. Expert Systems with Applications, 35(3):.790-804. [30] Pedro G. Campos, Fernando Díez, Alejandro Bellogín, 2011. Temporal Rating Habits: A Valuable Tool for Rater Differentiation. In Proceedings of the 2nd Challenge on Context-Aware Movie Recommendation (CAMRa 2011), at the 5th ACM Conference on Recommender Systems (RecSys 2011). Chicago, IL, USA, October 2011 [31] Resnick, P. and Varian, H.R. 1997. Recommender Systems, Communications of the ACM, 40(3): 56–58. [32] Salzberg, S.L. 1997. On Comparing Classifiers: Pitfalls to Avoid and a Recommended Approach. Data Mining and Knowledge Discovery 1(3): 317-327. [33] Sarwar, B. M., Karypis, G., Konstan, J. A., and Riedl, J. 2000. Analysis of Recommender Algorithms for E-Commerce, In Proceedings of the 2ndACM E-Commerce Conference (EC'00). Sarwar, B. M., Karypis, G., Konstan, J. A., and Riedl, J. 2000. Application of Dimensionality Reduction in Recommender System -- A Case Study". In ACM WebKDD 2000 Web Mining for E-Commerce Workshop.