Search Personalization using Machine Learning Hema Yoganarasimhan ∗ University of Washington Abstract Query-based search is commonly used by many businesses to help consumers find information/products on their websites. Examples include search engines (Google, Bing), online retailers (Amazon, Macy’s), and entertainment sites (Hulu, YouTube). Nevertheless, a significant portion of search sessions are unsuccessful, i.e., do not provide information that the user was looking for. We present a machine learning framework that improves the quality of search results through automated personalization based on a user’s search history. Our framework consists of three modules – (a) Feature generation, (b) NDCG-based LambdaMART algorithm, and (c) Feature selection wrapper. We estimate our framework on large-scale data from a leading search engine using Amazon EC2 servers. We show that our framework offers a significant improvement in search quality compared to non-personalized results. We also show that the returns to personalization are monotonically, but concavely increasing with the length of user history. Next, we find that personalization based on short-term history or “within-session” behavior is less valuable than long-term or “across-session” personalization. We also derive the value of different feature sets – user-specific features contribute over 50% of the improvement and click-specific over 28%. Finally, we demonstrate scalability to big data and derive the set of optimal features that maximize accuracy while minimizing computing speed. Keywords: marketing, online search, personalization, big data, machine learning ∗ The author is grateful to Yandex for providing the data used in this paper. She thanks Brett Gordon for detailed comments on the paper. Thanks are also due to the participants of the 2014 UT Dallas FORMS, Marketing Science, and Big Data and Marketing Analytics conferences for feedback. Please address all correspondence to: hemay@uw.edu. 1 1.1 Introduction Online Search and Personalization As the Internet has matured, the amount of information and products available in individual websites has grown exponentially. Today, Google indexes over 30 trillion webpages (Google, 2014), Amazon sells 232 million products (Grey, 2013), Netflix streams more than 10,000 titles (InstantWatcher, 2014), and over 100 hours of video are uploaded to YouTube every minute (YouTube, 2014). While unprecedented growth in product assortments gives an abundance of choice to consumers, it nevertheless makes it hard for them to locate the exact product or information they are looking for. To address this issue, most businesses today use a query-based search model to help consumers locate the product/information that best fits their needs. Consumers enter queries (or keywords) in a search box, and the firm presents a set of products/information that is deemed most relevant to the query. We focus on the most common class of search – Internet search engines, which play an important role in today’s digital economy. In the U.S. alone, over 18 billion search queries are made every month, and search advertising generates over $4 billion US dollars in quarterly revenues (Silverman, 2013; comScore, 2014). Consumers use search engines for a variety of reasons; e.g., to gather information on a topic from authoritative sources, to remain up-to-date with breaking news, and to aid purchase decisions. Search results and their relative ranking can therefore have significant implications for consumers, businesses, governments, and for the search engines themselves. However, a significant portion of searches are unsuccessful, i.e., they do not provide the information that the user was looking for. According to some experts, this is due to the fact that even when two users search using the same keywords, they are often looking for different information. Thus, a standardized ranking of results across users may not achieve the best results. To address this concern, Google launched personalized search a few years back, customizing results based on a user’s IP address and cookie (individual-level search and click data), using up to 180 days of history (Google Official Blog, 2009). Since then, other search engines such as Bing and Yandex have also experimented with personalization (Sullivan, 2011; Yandex, 2013). See Figure 1 for an example of how personalization can influence search results. However, the extent and effectiveness of search personalization has been a source of intense debate, partly fueled by the secrecy of search engines on how they implement personalization (Schwartz, 2012; Hannak et al., 2013).1 Indeed, some researchers have even questioned the value of personalizing search results, not only in the context of search engines, but also in other settings (Feuz et al., 2011; 1 In Google’s case, the only source of information on its personalization strategy is a post in its official blog, which suggests that whether a user previously clicked a page and the temporal order of a user’s queries influence her personalized rankings (Singhal, 2011). In Yandex’s case, all that we know comes from a quote by one of its executives, Mr. Bakunov – “Our testing and research shows that our users appreciate an appropriate level of personalization and increase their use of Yandex as a result, so it would be fair to say it can only have a positive impact on our users.” (Atkins-Krüger, 2012). 1 Figure 1: An example of personalization. The right panel shows results with personalization turned off by using the Chrome browser in a fresh incognito session. The left panel shows personalized results. The fifth result in the left panel from the domain babycenter.com does not appear in the right panel. This is because the user in the left panel had previously visited this URL and domain multiple times in the past few days. Ghose et al., 2014). In fact, this debate on the returns to search personalization is part of a larger one on the returns to personalization of any marketing mix variable (see §2 for details). Given the privacy implications of storing long user histories, both consumers and managers need to understand the value of personalization. 1.2 Research Agenda In this paper, we examine the returns to personalization in the context of Internet search. We address three managerially relevant research questions: 1. Can we provide a general framework to implement personalization in the field and evaluate the gains from it? In principle, if consumers’ tastes and behavior show persistence over time in a given setting, personalization using data on their past behavior should improve outcomes. However, we do not know the degree of persistence consumers’ tastes/behavior and the extent to which personalization can improve search experience is an empirical question. 2. Gains from personalization depend on a large number of factors that are under the manager’s 2 control. These include the length of user history, the different attributes of data used in the empirical framework, and whether data across multiple sessions is used or not. We examine the impact of these factors on the effectiveness of personalization. Specifically: (a) How does the improvement in personalization vary with the length of user history? For example, how much better can we do if we have data on the past 100 queries of a user compared to only her past 50 queries? (b) Which attributes of the data or features, lead to better results from personalization? For example, which of these metrics is more effective – a user’s propensity to click on the first result or her propensity to click on previously clicked results? Feature design is an important aspect of any ranking problem and is therefore treated as a source of competitive advantage and kept under wraps by most search engines (Liu, 2011). However, all firms that use search would benefit from understanding the relative usefulness of different types of features. (c) What is the relative value of personalizing based on short-run data (a single session) vs. long-run data (multiple sessions)? On the one hand, within session personalization might be more informative of consumer’s current quest. On the other hand, across session data can help infer persistence in consumers’ behavior. Moreover, within session personalization has to happen in real time and is computationally intensive. So knowing the returns to it can help managers decide whether it is worth the computational cost. 3. What is the scalability of personalized search to big data? Search datasets tend to be extremely large and personalizing the results for each consumer is computationally intensive. For a proposed framework to be implementable in the field, the framework has to be scalable and efficient. Can we specify sparser models that would significantly reduce the computational costs while minimizing the loss in accuracy? 1.3 Challenges and Research Framework We now discuss the three key challenges involved in answering these questions and present an overview of our research framework. • Out-of-sample predictive accuracy: First, we need a framework with extremely high out-ofsample predictive accuracy. Note that the search engine’s problem is one of prediction – which result or document will a given user find most relevant? More formally, what is the user-specific rank-ordering of the documents in decreasing order of relevance, so the most relevant documents are at the top for each user?2 2 Search engines don’t aim to provide the most objectively correct or useful information. Relevance is data-driven and search engines aim to maximize the user’s probability of clicking on and dwelling on the results shown. Hence, it is 3 Unfortunately, the traditional econometric tools used in marketing literature have been designed for causal inference (to derive the best unbiased estimates) and not prediction. So they perform poorly in out-of-sample predictions. Indeed, the well-known “bias-variance trade-off” suggests that unbiased estimators are not always the best predictors because they tend to have high variance and vice-versa. See Hastie et al. (2009) for an excellent discussion of this concept. The problems of prediction and causal inference are thus fundamentally different and often at odds with each other. Hence, we need a methodological framework that is optimized for predictive accuracy. • Large number of attributes with complex interactions: The standard approach to modeling is – assume a fixed functional form to model an outcome variable, include a small set of explanatory variables, allow for a few interactions between them, and then infer the parameters associated with the pre-specified functional form. However, this approach has poor predictive ability (as we later show). In our setting, we have a large number of features (close to 300), and we need to allow for non-linear interactions between them. This problem is exacerbated by the fact that we don’t know which of these variables and which interaction effects matter, a priori. Indeed, it is not possible for the researcher to come up with the best non-linear specification of the functional form either by manually eyeballing the data or by including all possible interactions. The latter would lead to over 45,000 variables, even if we restrict ourselves to simple pair-wise interactions. Thus, we are faced with the problem of not only inferring the parameters of the model, but also inferring the optimal functional form of the model itself. So we need a framework that can search and identify the relevant set of features and all possible non-linear interactions between them. • Scalability and efficiency to big data settings: Finally, as discussed earlier, our approach needs to be both efficient and scalable. Efficiency refers to the idea that the run-time and deployment of the model should be as small as possible without minimal loss in accuracy. Thus, an efficient model may lose a very small amount of accuracy, but make huge gains on run-time. Scalability refers to the idea that X times increase in the data leads to no more than 2X (or linear) increase in the deployment time. Both scalability and efficiency are necessary to run the model in real settings. To address these challenges, we turn to the Machine Learning methods, pioneered by computer scientists. These are specifically designed to provide extremely high out-of-sample accuracy, work with a large number of data attributes, and perform efficiently at scale. Our framework employs a three-pronged approach – (a) Feature generation, (b) NDCG-based LambdaMART algorithm, and (c) Feature selection using the Wrapper method. The first module consists of a program that uses a series of functions to generate a large set of features (or attributes) that function as inputs to the subsequent modules. These features consist of both global as well as personalized metrics that characterize the aggregate as well as user-level data regarding relevance of possible that non-personalized results may be perceived as more informative and the personalized ones more narrow, by an objective observer. However, this is not an issue when maximizing the probability of click for a specific user. 4 search results. The second module consists of the LambdaMART ranking algorithm that maximizes search quality, as measured by Normalized Discounted Cumulative Gain (NDCG). This module employs the Lambda ranking algorithm in conjunction with the machine learning algorithm called Multiple Additive Regression Tree (MART) to maximize the predictive ability of the framework. MART performs automatic variable selection and infers both the optimal functional form (allowing for all types of interaction effects and non-linearities) as well as the parameters of the model. It is not hampered by the large number of parameters and variables since it performs optimization efficiently in gradient space using a greedy algorithm. The third module focuses on scalability and efficiency – it consists of a Wrapper algorithm that wraps around the LambdaMART step to derive the set of optimal features that maximize search quality with minimal loss in accuracy. The Wrapper thus identifies the most efficient model that scales well. Apart from the benefits discussed earlier, an additional advantage of our approach is its applicability to datasets that lack information on context, text, and links, factors that have been shown to play a big role in improving search quality (Qin et al., 2010). Our framework only requires an existing non-personalized ranking. As concerns about user privacy increase, this is valuable. 1.4 Findings and Contribution We apply our framework to a large-scale dataset from Yandex. It is the fourth largest search engine in the world, with over 60% market share in Russia and Eastern European (Clayton, 2013; Pavliva, 2013) and over $320 billion US dollars in quarterly revenues (Yandex, 2013). In October 2013, Yandex released a large-scale anonymized dataset for a ‘Yandex Personalized Web Search Challenge’, which was hosted by Kaggle (Kaggle, 2013). The data consists of information on anonymized user identifiers, queries, query terms, URLs, URL domains and clicks. The data spans one month, consists of 5.7 million users, 21 million unique queries, 34 million search sessions, and over 64 million clicks. Because all three modules in our framework are memory and compute intensive, we estimate the model on the fastest servers available on Amazon EC2, with 32 cores and 256 GB of memory. We present five key results. First, we show that our personalized machine learning framework leads to significant improvements in search quality as measured by NDCG. In situations where the topmost link is not the most relevant (based on the current ranking), personalization improves search quality by approximately 15%. From a consumer’s perspective, this leads to big changes in the ordering of results they see – over 55% of the documents are re-ordered, of which 30% are re-ordered by two or more positions. From the search engine’s perspective, personalization leads to a substantial increase in click-through-rates at the top position, which is a leading measure of user satisfaction. We also show that standard logit-style models used in the marketing literature lead to no improvement in search quality, even if we use the same set of personalized features generated by the machine learning framework. This re-affirms the value of using a large-scale machine learning framework that 5 can provide high predictive accuracy. Second, the quality of personalized results increases with the length of a user’s search history in a monotonic but concave fashion (between 3% − 17% for the median user who has made about 50 past queries). Since data storage costs are linear, firms can use our estimates to derive the optimal length of history that maximizes their profits. Third, we examine the value of different types of features. User-specific features generate over 50% of the improvement from personalization. Click-related features (that are based on users’ clicks and dwell-times) are also valuable, contributing over 28% of the improvement in NDCG. Both domain- and URL-specific features are also important, contributing over 10% of the gain in NDCG. Interestingly, session-specific features are not very useful – they contribute only 4% of the gain in NDCG. Thus, we find that “across-session” personalization based on user-level data is more valuable than “within-session” personalization based on session-specific data. Firms often struggle with implementing within-session personalization since it has to happen in real-time and is computationally expensive. Our findings can help firms decide whether the relatively low value from within-session personalization is sufficient to warrant the computing costs of real-time responses. Fourth, we show that our method is scalable to big data. Going from 20% of the data to the full dataset leads to a 2.38 times increase in computing time for feature-generation, 20.8 times increase for training, and 11.6 times increase for testing/deployment of the final model. Thus, training and testing costs increase super-linearly, whereas feature-generation costs increase sub-linearly. Crucially, none of the speed increases are exponential, and all of them provide substantial improvements in NDCG. Finally, to increase our framework’s efficiency, we provide a forward-selection wrapper, which derives the optimal set of features that maximize search quality with minimal loss in accuracy. Of the 293 features provided by the feature-generation module, we optimally select 47 that provide 95% of the improvement in NDCG with significantly lower computing costs and speed. We find that training the pruned model takes only 15% of the time taken to train the unpruned model. Similarly, we find that deploying the pruned model is only 50% as expensive as deploying the full model. In sum, our paper makes three main contributions to the marketing literature on personalization. First, it presents a comprehensive machine learning framework to help researchers and managers improve personalized marketing decisions using data on consumers’ past behavior. Second, it presents empirical evidence in support of the returns to personalization in the online search context. Third, it demonstrates how large datasets or big data can be leveraged to improve marketing outcomes without compromising the speed or real-time performance of digital applications. 2 Related Literature Our paper relates to many streams of literature in marketing and computer science. First, it relates to the empirical literature on personalization in marketing. In an influential paper, 6 Rossi et al. (1996) quantify the benefits of one-to-one pricing using purchase history data and find that personalization improves profits by 7.6%. Similarly, Ansari and Mela (2003) find that contenttargeted emails can increase the click-throughs up to 62%, and Arora and Henderson (2007) find that customized promotions in the form of ‘embedded premiums’ or social causes associated with the product can improve profits. On the other hand, a series of recent papers question the returns to personalization in a variety of contexts ranging from advertising to ranking of hotels in travel sites (Zhang and Wedel, 2009; Lambrecht and Tucker, 2013; Ghose et al., 2014).3 Our paper differs from this literature on two important dimensions. Substantively, we focus on product personalization (search results) unlike the earlier papers that focus on the personalization of marketing mix variables such as price and advertising. Even papers that consider ranking outcomes, e.g., Ghose et al. (2014), use marketing mix variables such as advertising and pricing to make ranking recommendations. However, in the information retrieval context, there are no such marketing mix variables. Moreover, with our anonymized data, we don’t even have access to the topic/context of search. Second, from a methodological perspective, previous papers use computationally intensive Bayesian methods that employ Markov Chain Monte Carlo (MCMC) procedures, or randomized experiments, or discrete choice models – all of which are not only difficult to scale, but also lack predictive power (Rossi and Allenby, 2003; Friedman et al., 2000). In contrast, we use machine learning methods that can handle big data and offer high predictive performance. In a similar vein, our paper also relates to the computer science literature on personalization of search engine rankings. Qiu and Cho (2006) infer user interests from browsing history using experimental data on ten subjects. Teevan et al. (2008) and Eickhoff et al. (2013) show that there could be returns to personalization for some specific types of searches involving ambiguous queries and atypical search sessions. Bennett et al. (2012) find that long-term history is useful early in the session whereas short-term history later in the session. There are two main differences between these papers and ours. First, in all these papers, the researchers had access to the data with context – information on the URLs and domains of the webpages, the hyperlink structure between them, and the queries used in the search process. In contrast, we work with anonymized data – we don’t have data on the URL identifiers, texts, or links to/from webpages; nor do we know the queries. This makes our task significantly more challenging. We show that personalization is not only feasible, but also beneficial in such cases. Second, the previous papers work with much smaller datasets, which helps them avoid the scalability issues that we face. More broadly, our paper is relevant to the growing literature on machine learning methodologies for Learning-to-Rank (LETOR) methods. Many information retrieval problems in business settings often boil down to ranking. Hence, ranking methods have been studied in many contexts such as 3 A separate stream of literature examines how individual-level personalized measures can be used to improve multi-touch attribution in online channels (Li and Kannan, 2014). 7 collaborative filtering (Harrington, 2003), question answering (Surdeanu et al., 2008; Banerjee et al., 2009), text mining (Metzler and Kanungo, 2008), and digital advertising (Ciaramita et al., 2008). We refer interested readers to Liu (2011) for details. Our paper also contributes to the literature on machine learning methods in marketing. Most of these applications have been in the context of online conjoint analysis and use statistical machine learning techniques. These methods are employed to alleviate issues such as noisy data and complexity control (similar to feature selection in our paper); they also help in model selection and improve robustness to response error (in participants). Some prominent works in this area include Toubia et al. (2003, 2004); Evgeniou et al. (2005, 2007). Please see Toubia et al. (2007) for a detailed discussion of this literature, and Dzyabura and Hauser (2011); Tirunillai and Tellis (2014); Netzer et al. (2012); Huang and Luo (2015) for other applications of machine learning in marketing. 3 3.1 Data Data Description We use the publicly available anonymized dataset released by Yandex for its personalization challenge (Kaggle, 2013). The dataset is from 2011 (exact dates are not disclosed) and represents 27 days of search activity. The queries and users are sampled from a large city in eastern Europe. The data is processed such that two types of sessions are removed – those containing queries with commercial intent (as detected by Yandex’s proprietary classifier) and those containing one or more of the top-K most popular queries (where K is not disclosed). This minimizes risks of reverse engineering of user identities and business interests. The current rankings shown in the data are based on Yandex’s proprietary non-personalized algorithm. Table 1 shows an example of a session from the data. As such, the data contains information on the following anonymized variables: • User: A user, indexed by i, is an agent who uses the search engine for information retrieval. Users are tracked over time through a combination of cookies, IP addresses, and logged in accounts. There are 5,659,229 unique users in the data. • Query: A query is composed of one or more words, in response to which the search engine returns a page of results. Queries are indexed by q. There are a total of 65,172,853 queries in the data, of which 20,588,928 are unique.4 Figure 2 shows the distribution of queries in the data, which follows a long-tail pattern – 1% of queries account for 47% of the searches, and 5% for 60%. • Term: Each query consists of one or more terms, and terms are indexed by l. For example, the query ‘pictures of richard armitage’ has three terms – ‘pictures’, ‘richard’, and ‘armitage’. There are 4,701,603 unique terms in the data, and we present the distribution of the number of terms in 4 Queries are not the same as keywords, which is another term used frequently in connection with search engines. Please see Gabbert (2011) for a simple explanation of the difference between the two phrases. 8 20 M 6 5 20 0 Q 0 6716819 20 20 13 495 C C 0 0 70100440 29601078 2635270,4603267 70100440,5165392 17705992,1877254 41393646,3685579 46243617,3959876 26213624,2597528 29601078,2867232 53702486,4362178 11708750,1163891 53485090,4357824 4643491,616472 Table 1: Example of a session. The data is in a log format describing a stream of user actions, with each line representing a session, query, or a click. In all lines, the first column represents the session (SessionID = 20 in this example). The first row from a session contains the session-level metadata and is indicated by the letter M. It has information on the day the session happened (6) and the UserID (5). Rows with the letter Q denote a Query and those with C denote a Click. The user submitted the query with QueryID 6716819, which contains terms with TermIDs 2635270, 4603267. The URL with URLID 70100440 and DomainID 5165392 is the top result on SERP. The rest of the nine results follow similarly. 13 units of time into the session the user clicked on the result with URLID 70100440 (ranked first in the list) and 495 units of time into the session s/he clicked on the result with URLID 29601078. Fraction of searches 1 0.8 0.6 0.4 0.2 0 0 0.2 0.4 0.6 Fraction of unique queries 0.8 1 Figure 2: Cumulative distribution function of the number of unique queries – the fraction of overall searches corresponding to unique queries, sorted by popularity. all the queries seen in the data in Table 2.5 • Session: A search session, indexed by j, is a continuous block of time during which a user is repeatedly issuing queries, browsing and clicking on results to obtain information on a topic. Search engines use proprietary algorithms to detect session boundaries (Göker and He, 2000); and the exact method used by Yandex is not publicly revealed. According to Yandex, there are 34,573,630 sessions in the data. • URL: After a user issues a query, the search engine returns a set of ten results in the first page. These results are essentially URLs of webpages relevant to the query. There are 69,134,718 unique URLs in the data, and URLs are indexed by u. Yandex only provides data for the first Search Engine Results Page (SERP) for a given query because a majority of users never go beyond 5 Search engines typically ignore common keywords and prepositions used in searches such as ‘of’, ‘a’, ‘in’, because they tend to be uninformative. Such words are called ‘Stop Words’. 9 Number of Terms Per Query 1 2 3 4 5 6 7 8 9 10 or more Percentage of Queries 22.32 24.85 20.63 14.12 8.18 4.67 2.38 1.27 0.64 1.27 Position 1 2 3 4 5 6 7 8 9 10 Table 2: Distribution of number of terms in a query. (Min, Max) = (1, 82). Probability of click 0.4451 0.1659 0.1067 0.0748 0.0556 0.0419 0.0320 0.0258 0.0221 0.0206 Table 3: Click probability based on position. the first page, and click data for subsequent pages is very sparse. So from an implementation perspective, it is difficult to personalize later SERPs; and from a managerial perspective, the returns to personalizing them is low since most users never see them. • URL domain: Is the domain to which a given URL belongs. For example, www.imdb.com is a domain, while http://www.imdb.com/name/nm0035514/ and http://www.imdb. com/name/nm0185819/ are URLs or webpages hosted at this domain. There are 5,199,927 unique domains in the data. • Click: After issuing a query, users can click on one or all the results on the SERP. Clicking a URL is an indication of user interest. We observe 64,693,054 clicks in the data. So, on average, a user clicks on 3.257 times following a query (including repeat clicks to the same result). Table 3 shows the probability of click by position or rank of the result. Note the steep drop off in click probability with position – the first document is clicked 44.51% of times, whereas the fifth one is clicked a mere 5.56% of times. • Time: Yandex gives us the timeline of each session. However, to preserve anonymity, it does not disclose how many milliseconds are in one unit of time. Each action in a session comes with a time stamp that tells us how long into the session the action took place. Using these time stamps, Yandex calculates ‘dwell time’ for each click, which is the time between a click and the next click or the next query. Dwell time is informative of how much time a user spent with a document. • Relevance: A result is relevant to a user if she finds it useful and irrelevant otherwise. Search engines attribute relevance scores to search results based on user behavior – clicks and dwell time. It is well known that dwell time is correlated with the probability that the user satisfied her information needs with the clicked document. The labeling is done automatically, and hence different for each query issued by the user. Yandex uses three grades of relevance: • Irrelevant, r = 0. 10 1 0.8 0.8 Fraction of users Fraction of sessions 1 0.6 0.4 0.4 0.2 0.2 0 0.6 0 0 2 4 6 Number of queries 8 10 1 4 16 Number of queries 64 256 Figure 4: Cumulative distribution function of number of queries per user (observed in days 26, 27. Min, Max = 1, 1260. Figure 3: Cumulative distribution function of the number of queries in a session. Min, Max = 1, 280. A result gets a irrelevant grade if it doesn’t get a click or if it receives a click with dwell time strictly less than 50 units of time. • Relevant, r = 1. Results with dwell times between 50 and 399 time units receive a relevant grade. • Highly relevant, r = 2. Is given to two types of documents – a) those that received clicks with a dwell time of 400 units or more, and b) if the document was clicked, and the click was the last action of the session (irrespective of dwell time). The last click of a session is considered to be highly relevant because a user is unlikely to have terminated her search if she is not satisfied. In cases where a document is clicked multiple times after a query, the maximum dwell time is used to calculate the document’s relevance. In sum, we have a total of 167,413,039 lines of data. 3.2 Model-Free Analysis For personalization to be effective, we need to observe three types of patterns in the data. First, we need sufficient data on user history to provide personalized results. Second, there has to be significant heterogeneity in users’ tastes. Third, users’ tastes have to exhibit persistence over time. Below, we examine the extent to which search and click patterns satisfy these criteria. First, consider user history metrics. If we examine the number of queries per session (see Figure 3), we find that 60% of the sessions have only one query. This implies that within session personalization is not possible for such sessions. However, 20% of sessions have two or more queries, making them amenable to within session personalization. Next, we examine the feasibility of across-session personalization for users observed in days 26 and 27 by considering the number of queries they have 11 1 0.8 0.8 Fraction of users Fraction of users 1 0.6 0.4 0.4 0.2 0.2 0 0.6 0 0 1 2 3 Average clicks/query for user 4 5 0 1 2 3 Average clicks/query for user 4 5 Figure 6: Cumulative distribution function of average clicks per query for users who have issued 10 or more queries. Figure 5: Cumulative distribution function of average number of clicks per query for a given user. issued in the past.6 We find that the median user has issued 16 queries and over 25% of the users have issued at least 32 queries (see Figure 4). Together, these patterns suggest that a reasonable degree of both short and long-term personalization is feasible in this context. Figure 5 presents the distribution of the average number of clicks per query for the users in the data. As we can see, there is considerable heterogeneity in users’ propensity to click. For example, more than 15% of users don’t click at all, whereas 20% of them click on 1.5 results or more for each query they issue. To ensure that these patterns are not driven by one-time or infrequent users, we present the same analysis for users who have issued at least 10 queries in Figure 6, which shows similar click patterns. Together, these two figures establish the presence of significant user-specific heterogeneity in users’ clicking behavior. Next, we examine whether there is variation or entropy in the URLs that are clicked for a given query. Figure 7 presents the cumulative distribution of the number of unique URLs clicked for a given query through the time period of the data. 37% of queries receive no clicks at all, and close to another 37% receive a click to only one unique URL. These are likely to be navigational queries; e.g., almost everyone who searches for ‘cnn’ will click on www.cnn.com. However, close to 30% of queries receive clicks on two or more URLs. This suggests that users are clicking on multiple results. This could be either due to heterogeneity in users’ preferences for different results or because the results themselves change a lot for the query during the observation period (high churn in the results is likely for news related queries, such as ‘earthquake’). To examine if the last explanation drives all the variation in Figure 7, we plot the number of unique URLs and domains we see for a given query through the course of the data in Figure 8. We find that more than 90% of queries have no churn in the 6 As explained in §6.1, users observed early in the data have no/very little past history because we have a data snapshot for fixed time period (27 days), and not a fixed history for each user. Hence, following Yandex’s suggestion, we run the personalization algorithm for users observed in the last two of days. In practice, of course, firms don’t have to work with snapshots of data; rather they are likely to have longer histories stored at the user level. 12 1 Fraction of unique queries Fraction of unique queries 1 0.8 0.6 0.4 0.2 0 0 2 4 6 Total URLs clicked 8 0.8 0.6 0.4 0.2 0 10 Figure 7: Cumulative distribution function of number of unique URLs clicked for a given query in the 27 day period. URL Results Domain Results 8 10 16 Total URL/Domain results 32 Figure 8: Cumulative distribution function of number of unique URLs and domains shown for a given query in the 27 day period. URLs shown in results page. Hence, it is unlikely that the entropy in clicked URLs is driven by churn in search engine results. Rather, it is more likely to be driven in heterogeneity in users’ tastes. In sum, the basic patterns in the data point to significant heterogeneity in users’ preferences and persistence of those preferences, which suggest the existence of positive returns to personalization. 4 Feature Generation We now present the first module of our three-pronged empirical strategy – feature generation. In this module, we conceptualize and define an exhaustive feature set that captures both global and individual-level attributes that can help predict the most relevant documents for a given search by a user i during a session j. Feature engineering is an important aspect of all machine learning problems and is of both technical and substantive interest. Before proceeding, we make note of two points. First, the term ‘features’ as used in the machine learning literature may be interpreted as ‘attributes’ or ‘explanatory variables’ in the marketing context. Second, features based on the hyper-link structure of the web and the text/context of the URL, domain, and query are not visible to us since we work with anonymized data. Such features play an important role in ranking algorithms and not having information on them is a significant drawback (Qin et al., 2010). Nevertheless, we have one metric that can be interpreted as a discretized combination of such features – the current rank based on Yandex’s algorithm (which does not take personalized search history into account). While this is not perfect (since there is information-loss due to discretization and because we cannot interact the underlying features with the user-specific ones generated here), it is the best data we have, and we use it as a measure of features invisible to us. Our framework consists of a series of functions that can be used to generate a large and varied set of features. There are two main advantages to using such a framework instead of deriving and defining each feature individually. First, it allows us to succinctly define and obtain a large number 13 of features. Second, the general framework ensures that the methods presented in this paper can be extrapolated to other personalized ranking contexts, e.g., online retailers or entertainment sites. 4.1 Functions for Feature Generation We now present a series of functions that are used to compute sets of features that capture attributes associated with the past history of both the search result and the user. 1. Listings(query, url , usr , session, time): This function calculates the number of times a url appeared in a SERP for searches issued by a given usr for a given query in a specific session before a given instance of time. This function can also be invoked with certain input values left unspecified, in which case it is computed over all possible values of the unspecified inputs. For example, if the user and session are not specified, then it is computed over all users and across all sessions. That is, Listings(query, url , −, −, time) computes the number of occurrences of a url in any SERP for a specific query issued before a given instance of time. Another example of a generalization of this function involves not providing an input value for query, usr , and session (as in Listings(−, url , −, −, time)), in which case the count is performed over all SERPs associated with all queries, users, and sessions; i.e., we count the number of times a given url appears in all SERPs seen in the data before a given point in time irrespective of the query. We also define the following two variants of this function. • If we supply a search term (term) as an input instead of a query, then the invocation Listings(term, url , usr , session, time) takes into account all queries that include term as one of the search terms. • If we supply a domain as an input instead of a URL, then the invocation Listings(query, domain, usr , session, time) will count the number of times that a result from domain appeared in the SERP in the searches associated with a given query. 2. AvgPosn(query, url , usr , session, time): This function calculates the average position/rank of a url within SERPs that list it in response to searches for a given query within a given session and before an instance of time for a specific usr . It provides information over and above the Listings function as it characterizes the average ranking of the result as opposed to simply noting its presence in the SERP. 3. Clicks(query, url , usr , session, time): This function calculates the number of times that a specific usr clicked on a given url after seeing it on a SERP provided in response to searches for a given query within a specific session before a given instance of time. Clicks(·) can then 14 be used to compute the click-through rates (CTRs) as follows: CTR(query, url , usr , session, time) = Clicks(query, url , usr , session, time) Listings(query, url , usr , session, time) (1) 4. HDClicks(query, url , usr , session, time): This function calculates the number of times that usr made a high duration click or a click of relevance r = 2 on a given url after seeing it on a SERP provided in response to searches for a given query within a specific session and before a given instance of time. HDClicks(·) can then be used to compute high-duration click-through rates (HDCTRs) as follows: HDCTR(query, url , usr , session, time) = HDClicks(query, url , usr , session, time) Listings(query, url , usr , session, time) (2) Note that HDClicks and HDCTR have information over and beyond Clicks and CTR since they consider clicks of high relevance. 5. ClickOrder(query, url , usr , session, time) is the order of the click (conditional on a click) for a given url by a specific usr in response to a given query within a specific session and before a given instance of time. Then: WClicks(query, url , usr , session, time) = 1 ClickOrder(query,url,usr ,session,time) 0 if url was clicked (3) if url was not clicked WClicks calculates a weighted click metric that captures the idea that a URL that is clicked first is more relevant than one that is clicked second, and so on. This is in contrast to Clicks that assigns the same weight to URLs irrespective of the order in which they are clicked. WClicks is used to compute weighted click-through rates (WCTRs) as follows: WCTR(query, url , usr , session, time) = WClicks(query, url , usr , session, time) Listings(query, url , usr , session, time) (4) For instance, a URL that is always clicked first when it is displayed has WCTR = CTR, whereas for a URL that is always clicked second, WCTR = 12 CTR, and for one that is always clicked third, WCTR = 13 CTR. 6. PosnClicks(query, url , usr , session, posn, time). The position of a URL in the SERP is denoted by posn. Then: PosnClicks(query, url , usr , session, posn, time) = 15 posn × Clicks(query, url , usr , session, time) (5) 10 If a URL in a low position gets clicked, it is likely to be more relevant than one that is in a high position and gets clicked. This stems from the idea that users usually go through the list of URLs from top to bottom and that they only click on URLs in lower positions if they are especially relevant. Using this, we can calculate the position weighted CTR as follows: PosnCTR(query, url , usr , session, posn, time) = PosnClicks(query, url , usr , session, posn, time) (6) Listings(query, url , usr , session, posn, time) 7. Skips(query, url , usr , session, time): Previous research suggests that users usually go over the list of results in a SERP sequentially from top to bottom (Craswell et al., 2008). Thus, if the user skips a top URL and clicks on one in a lower position, it is significant because the decision to skip the top URL is a conscious decision. Thus, if a user clicks on position 2, after skipping position 1, the implications for the URLs in positions 1 and 3 are different. The URL in position 1 has been examined and deemed irrelevant, whereas the URL in position 3 may not have been clicked because the user never reached it. Thus, the implications of a lack of clicks are more damning for the skipped URL in position 1 compared to the one in position 3. The function Skips thus captures the number of times a specific url has been ‘skipped’ by a given usr when it was shown in response to a specific query within a given session and before time. For example, for a given query assume that the user clicks on the URL in position 5 followed by that in position 3. Then URLs in positions 1 and 2 were skipped twice, the URLs in positions 3 and 4 were skipped once, and those in positions 6-10 were skipped zero times. We then compute SkipRate as follows: SkipRate(query, url , usr , session, time) = Skips(query, url , usr , session, time) Listings(query, url , usr , session, time) (7) 8. Dwells(query, url , usr , session, time): This function calculates the total amount of time that a given usr dwells on a specific url after clicking on it in response to a given query within a given session and before time. Dwells can be used to compute the average dwell time (ADT), which has additional information on the relevance of a url over and above the click-through rate. ADT(query, url , usr , session, time) = Dwells(query, url , usr , session, time) Clicks(query, url , usr , session, time) (8) As in the case of Listings, all the above functions can be invoked in their generalized forms without specifying a query, user, or session, Similarly, they can also be invoked for terms and domains, except 16 SkipRate which is not invoked for domains. The features generated from these functions are listed in Table A1 in the Appendix. Note that we use only the first four terms for features computed using search terms (e.g., those features numbered 16 through 19). Given that 80% of the queries have four or fewer search terms (see Table 2), the information loss from doing so is minimal while keeping the number of features reasonable. In addition to the above, we also use the following functions to generate additional features. 9. QueryTerms(query) is the total number of terms in a given query. 10. Day(usr , session, time) is the day of the week (Monday, Tuesday, etc.) on which the query was made by a given usr in specific session before a given instance of time. 11. ClickProb(usr , posn, time). This is the probability that a given usr clicks on a search result that appears at posn within a SERP before a given instance of time. This function captures the heterogeneity in users’ preferences for clicking on results appearing at different positions. 12. NumSearchResults(query) is the total number of search results appearing on any SERP associated with a given query over the observation period. It is an entropy metric on the extent to which search results associated with a query vary over time. This number is likely to be high for news-related queries. 13. NumClickedURLs(query) is the total number of unique URLs clicked on any SERP associated with that query. It is a measure of the heterogeneity in users’ tastes for information/results within a query. For instance, for navigational queries this number is likely to be very low, whereas for other queries like ‘Egypt’, it is likely to be high since some users may be seeking information on the recent political conflict in Egypt, others on tourism to Egypt, while yet others on the history/culture of Egypt. 14. SERPRank(url , query, usr , session). Finally, as mentioned earlier, we include the position/rank of a result for a given query by a given usr in a specific session. This is informative of many aspects of the query, page, and context that are not visible to us. By using SERPRank as one of the features, we are able to fold in this information into a personalized ranking algorithm. Note that SERPRank is the same as the position posn of the URL in the results page. However, we define it as a separate feature since the position can vary across observations. As before, some of these functions can be invoked more generally, and all the features generated through them are listed in Table A1 in the Appendix. 4.2 Categorization of Feature Set The total number of features that can we can generate from these functions is very large (293 to be specific). To aid exposition and experiment with the impact of including or excluding certain sets of 17 features, we group the above set of features into the following (partially overlapping) categories. This grouping also provides a general framework that can be used in other contexts. 1. FP includes all user or person-specific features except ones that are at session-level granularity. For example, FP includes Feature 76, which counts the number of times a specific url appeared in a SERP for searches issued by a given usr for a specific query. 2. FS includes all user-specific features at a session-level granularity. For example, FS includes Feature 151, which measures the number of times a given url appeared in a SERP for searches issued by a given usr for a specific query in a given session. 3. FG includes features that provide global or system-wide statistics regarding searches (i.e., features which are not user-specific). For instance, FG includes Feature 1, which measures the number of times a specific url appeared in a SERP for searches involving a specific query irrespective of the user issuing the search. 4. FU includes features associated with a given url . For example, FU includes Features 1, 76, 151, but not Feature 9, which is a function of the domain name associated with a search result, but not URL. 5. FD includes features that are attributes associate with the domain name of the search result. This set includes features such as Feature 9, which measures the number of times a given domain is the domain name of a result in a SERP provided for a specific query. 6. FQ includes features that are computed based on the query used for the search. For example, FQ includes Feature 1, but does not include Features 16–19 (which are based on the individual terms used in the search) and Feature 226 (which is not query-specific). 7. FT includes features that are computed based on the individual terms comprising a search query. For example, FT includes Features 16 − 19. 8. FC includes features that are based on the clicks received by the result. Examples include Features 2, 5, and 6. 9. FN includes features that are based on weighted clicks; e.g., it includes Features 36 − 39. 10. FW includes features that are computed based on the dwell times associated with user clicks; e.g., it includes Feature 3, which is the average dwell time for a given url in response to a specific query. 11. FR includes features that are computed based on the rankings of a search result on different SERPs. Examples include Features 7 and 15. 12. FK includes features that measure how often a specific search result was skipped over by a user across different search queries, e.g., Feature 8. 18 The third column of Table A1 in the Appendix provides a mapping from features to the feature sets that they belong to. It is not necessary to include polynomial or other higher-order transformations of the features because the optimizer we use (MART) is insensitive to monotone transformations of the input features (Murphy, 2012). 5 Ranking Methodology We now present the second module of our empirical framework, which consists of a methodology for re-ranking the top 10 results for a given query, by a given user, at a given point in time. The goal of re-ranking is to promote URLs that are more likely to be relevant to the user and demote those that are less likely to be relevant. Relevance is data-driven and defined based on clicks and dwell times. 5.1 Evaluation Metric To evaluate whether a specific re-ranking method improves overall search quality or not, we first need to define an evaluation metric. The most commonly used metric in the ranking literature is Normalized Discounted Cumulative Gain (NDCG). It is also used by Yandex to evaluate the quality of its search results. Hence, we use this metric for evaluation too. NDCG is a list-level graded relevance metric, i.e., it allows for multiple levels of relevance. NDCG measures the usefulness, or gain, of a document based on its position in the result list. NDCG rewards rank-orderings that place documents in decreasing order of relevance. It penalizes mistakes in top positions more than those in lower positions, because users are more likely to focus and click on the top documents. We start by defining the simpler metric, Discounted Cumulative Gain, DCG (Järvelin and Kekäläinen, 2000; Järvelin and Kekäläinen, 2002). DCG is a list-level metric defined for the truncation position P . For example, if we are only concerned about the top 10 documents, then we would work with DCG10 . The DCG at truncation position P is defined as: DCGP = P X p=1 2rp − 1 log2 (p + 1) (9) where rp is the relevance rating of the document (or URL) in position p. A problem with DCG is that the contribution of each query to the overall DCG score of the data varies. For example, a search that retrieves four ordered documents with relevance scores 2, 2, 1, 1, will have higher DCG scores compared to a different search that retrieves four ordered documents with relevance scores 1, 1, 0, 0. This is problematic, especially for re-ranking algorithms, where we only care about the relative ordering of documents. To address this issue, a normalized version of DCG is used: DCGP NDCGP = (10) IDCGP 19 where IDCGP is the ideal DCG at truncation position P . The IDCGP for a list is obtained by rankordering all the documents in the list in decreasing order of relevance. NDCG always takes values between 0 and 1. Since NDCG is computed for each search in the training data and then averaged over all queries in the training set, no matter how poorly the documents associated with a particular query are ranked, it will not dominate the evaluation process since each query contributes equally to the average measure. NDCG has two main advantages. First, it allows a researcher to specify the number of positions that she is interested in. For example, if we only care about the first page of search results (as in our case), then we can work with NDCG10 (henceforth referred to as N DCG for notational simplicity). Instead, if we are interested in the first two pages, we can work with NDCG20 . Second, NDCG is a list-level metric; it allows us to evaluate the quality of results for the entire list of results with built-in penalties for mistakes at each position. Nevertheless, NDCG suffers from a tricky disadvantage. It is based on discrete positions, which makes its optimization very difficult. For instance, if we were to use an optimizer that uses continuous scores, then we would find that the relative positions of two documents (based on NDCG) will not change unless there are significant changes in their rank scores. So we could obtain the same rank orderings for multiple parameter values of the model. This is an important consideration that constrains the types of optimization methods that can be used to maximize NDCG. 5.2 General Structure of the Ranking Problem We now present the general structure of the problem and some nomenclature to help with exposition. We use i to index users, j to index sessions, and k to index the search within a session. Further, let q denote the query associated with a search, and u denote the URL of a given search result. F th Let xqu ijk ∈ R denote a set of F features for document/URL u in response to a query q for the k search in the j th session of user i. The function f (·) maps the feature vector to a score squ ijk such that qu qu qu sijk = f (xijk ; w), where w is a set of parameters that define the function f (·). sijk can be interpreted as the gain or value associated with document u for query q for user i in the k th search in her j th qm qn session. Let Uijk / Uijk denote the event that URL m should be ranked higher than n for query q for th user i in the k search of her j th session. For example, if m has been labeled highly relevant (rm = 2) qm qn and n has been labeled irrelevant (rn = 0), then Uijk / Uijk . Note that the labels for the same URLs can vary with queries, users, and across and within the session; hence the super-scripting by q and sub-scripting by i, j, and k. Given two documents m and n associated with a search, we can write out the probability that document m is more relevant than n using a logistic function as follows: qmn qm qn Pijk = P (Uijk / Uijk )= 20 1 1+e qn −σ(sqm ijk −sijk ) (11) qn where {xqm ijk , xijk } refer to the input features of the two documents and σ determines the shape of this sigmoid function. Using these probabilities, we can define the log-Likelihood of observing the data or the Cross Entropy Cost function as follows. This is interpreted as the penalty imposed for deviations of model predicted probabilities from the desired probabilities. qmn qmn qmn qmn qmn Cijk = −P̄ijk log(Pijk ) − (1 − P̄ijk ) log(1 − Pijk ) X X X X qmn C = Cijk i j k (12) (13) ∀m,n qmn where P̄ijk is the observed probability (in data) that m is more relevant than n. The summation in Equation (13) is taken in this order – first, over all pairs of URLs ({m, n}) in the consideration set for the k th search in session j for user i, next over all the searches within session j, then over all the sessions for the user i, and finally over all users. If the function f (·) is known or specified by the researcher, the cost C can be minimized using commonly available optimizers such as Newton-Raphson or BFGS. Traditionally, this has been the approach adopted by marketing researchers since they are usually interested in inferring the impact of some marketing activity (e.g., price, promotion) on an outcome variable (e.g., awareness, purchase) and therefore assume the functional form of f (·). In contrast, when the goal is to predict the best rank-ordering that maximizes consumers’ click probability and dwell-time, assuming the functional form f (·) is detrimental to the predictive ability of the model. Moreover, manual experimentation to arrive at an optimal functional form becomes exponentially difficult as the number of features expands. This is especially the case if we want the function to be sufficiently flexible (non-linear) to capture the myriad interaction patterns in the data. To address these challenges, we turn to solutions from the machine learning literature such as MART that infer both the optimal parameters w and function f (·) from the data.7 5.3 MART – Multiple Additive Regression Trees MART is a machine learning algorithm that models a dependent or output variable as a linear combination of a set of shallow regression trees (a process known as boosting). In this section, we introduce the concepts of Classification and Regression Trees (CART) and boosted CART (referred to as MART). We present the high level overview of these models here and refer interested readers to Murphy (2012) for details. 21 A B Figure 9: Example of a CART model. 5.3.1 Classification and Regression Trees CART methods are a popular class of prediction algorithms that recursively partition the input space corresponding to a set of explanatory variables into multiple regions and assign an output value for each region. This kind of partitioning can be represented by a tree structure, where each leaf of the tree represents an output region. Consider a dataset with two input variables {x1 , x2 }, which are used to predict or model an output variable y using a CART. An example tree with three leaves (or output regions) is shown in Figure 9. This tree first asks if x1 is less than or equal to a threshold t1 . If yes, it assigns the value of 1 to the output y. If not (i.e., if x1 > t1 ), it then asks if x2 is less than or equal to a threshold t2 . If yes, then it assigns y = 2 to this region. If not, it assigns the value y = 3 to this region. The chosen y value for a region corresponds to the mean value of y in that region in the case of a continuous output and the dominant y in case of discrete outputs. Trees are trained or grown using a pre-defined number of leaves and by specifying a cost function that is minimized at each step of the tree using a greedy algorithm. The greedy algorithm implies that at each split, the previous splits are taken as given, and the cost function is minimized going forward. For instance, at node B in Figure 9, the algorithm does not revisit the split at node A. It however considers all possible splits on all the variables at each node. Thus, the split points at each node can be arbitrary, the tree can be highly unbalanced, and variables can potentially repeat at latter child nodes. All of this flexibility in tree construction can be used to capture a complex set of flexible interactions, which are not predefined but are learned using the data. CART is popular in the machine learning literature because it is scalable, is easy to interpret, can handle a mixture of discrete and continuous inputs, is insensitive to monotone transformations, and performs automatic variable selection (Murphy, 2012). However, it has accuracy limitations because of its discontinuous nature and because it is trained using greedy algorithms. These drawbacks can be addressed (while preserving all the advantages) through boosting, which gives us MART. 7 In standard marketing models, ws are treated as structural parameters that define the utility associated with an option. So their directionality and magnitude are of interest to researchers. However, in this case there is no structural interpretation of w. In fact, with complex trainers such as MART, the number of parameters is large and difficult to interpret. 22 5.3.2 Boosting Boosting is a technique that can be applied to any classification or prediction algorithm to improve its accuracy (Schapire, 1990). Applying the additive boosting technique to CART produces MART, which has now been shown empirically to be the best classifier available (Caruana and NiculescuMizil, 2006; Hastie et al., 2009). MART can be viewed as performing gradient descent in the function space using shallow regression trees (with a small number of leaves). MART works well because it combines the positive aspects of CART with those of boosting. CART, especially shallow regression trees, tend to have high bias, but have low variance. Boosting CART models addresses the bias problem while retaining the low variance. Thus, MART produces high quality classifiers. MART can be interpreted as a weighted linear combination of a series of regression trees, each trained sequentially to improve the final output using a greedy algorithm. MART’s output L(x) can be written as: N X LN (x) = αn ln (x, βn ) (14) n=1 where ln (x, βn ) is the function modeled by the nth regression tree and αn is the weight associated with the nth tree. Both l(·)s and αs are learned during the training or estimation. MARTs are also trained using greedy algorithms. ln (x, βn ) is chosen so as to minimize a pre-specified cost function, which is usually the least-squared error in the case of regressions and an entropy or logit loss function in the case of classification or discrete choice models. 5.4 Learning Algorithms for Ranking We now discuss two Learning to Rank algorithms that can be used to improve search quality – RankNet and LambdaMART. Both of these algorithms can use MART as the underlying classifier. 5.4.1 RankNet – Pairwise learning with gradient descent In an influential paper, Burges et al. (2005) proposed one of the earliest ranking algorithms – RankNet. Versions of RankNet are used by commercial search engines such as Bing (Liu, 2011). RankNet uses a gradient descent formulation to minimize the Cross-entropy function defined in Equations (12) and (13). Recall that we use i to index users, j to index sessions, k to index the search within a session, q to denote the query associated with a search, and u to denote the URL of a given search result. The score and entropy functions are defined for two documents m and n. qmn qmn Then, let Sijk ∈ {0, 1, −1}, where Sijk = 1 if m is more relevant than n, −1 if n is more qmn qmn relevant than m, and 0 if they are equally relevant. Thus, we have P̄ijk = 21 (1 + Sijk ), and we can qmn rewrite Cijk as: qn 1 qmn qmn qn −σ(sqm ijk −sijk ) ) Cijk = (1 − Sijk )σ(sqm ijk − sijk ) + log(1 + e 2 23 (15) Note that Cross Entropy function is symmetric; it is invariant to swapping m and n and flipping the qmn 8 sign of Sijk . Further, we have: qm qn qn qm qmn qmn Cijk = log(1 + e−σ(sijk −sijk ) if Sijk =1 qmn qmn Cijk = log(1 + e−σ(sijk −sijk ) if Sijk = −1 qn 1 qmn qn qmn −σ(sqm ijk −sijk ) ) if S Cijk = σ(sqm ijk − sijk ) + log(1 + e ijk = 0 2 (16) The gradient of the Cross Entropy Cost function is: qmn qnm ∂Cijk ∂Cijk =− =σ ∂sqm ∂sqn ijk ijk 1 1 qmn (1 − Sijk )− qm qn −σ(s ijk −sijk ) 2 1+e (17) We can now employ a stochastic gradient descent algorithm to update the parameters, w, of the model, as follows: ! qmn qnm qmn qn ∂Cijk ∂C ∂Cijk ∂sqm ∂s ijk ijk ijk =w−η + (18) w →w−η qn ∂w ∂sqm ∂w ∂s ∂w ijk ijk where η is a learning rate specified by the researcher. The advantage of splitting the gradient as shown qmn in Equation (18) instead of working with ∂Cijk /∂w is that it speeds up the gradient calculation by allowing us to use the analytical formulations for the component partial derivatives. Most recent implementations of RankNet use the following factorization to improve speed. qmn ∂Cijk = λqmn ijk ∂w where λqmn ijk qmn ∂Cijk = =σ ∂sqm ijk ∂sqm ∂sqn ijk ijk − ∂w ∂w 1 1 qmn (1 − Sijk )− qm qn −σ(s ijk −sijk ) 2 1+e (19) (20) Given this cost and gradient formulation, MART can now be used to construct a sequence of regression trees that sequentially minimize the cost function defined by RankNet. The nth step (or tree) of MART can be conceptualized as the nth gradient descent step that refines the function f (·) in order to lower the cost function. At the end of the process, both the optimal f (·) and ws are inferred. While RankNet has been successfully implemented in many settings, it suffers from three significant disadvantages. First, because it is a pairwise metric, in cases where there are multiple levels of relevance, it doesn’t use all the information available. For example, consider two sets of documents {m1 , n1 } and {m2 , n2 } with the following relevance values: {1, 0} and {2, 0}. In both cases, RankNet only utilizes the information that m should be ranked higher than n and not the actual relevance scores, which makes it inefficient. Second, it does not differentiate between mistakes at the 8 qmn Asymptotically, Cijk becomes linear if the scores give the wrong ranking and zero if they give the correct ranking. 24 top of the list from those at the bottom. This is problematic if the goal is to penalize mistakes at the top of the list more than those at the bottom. Third, it optimizes the Cross Entropy Cost function, and not NDCG, which is our true objective function. This mismatch between the true objective function and the optimized objective function implies that RankNet is not the best algorithm to optimize NDCG. 5.4.2 LambdaMART: Optimizing for NDCG Note that RankNet’s speed and versatility comes from its gradient descent formulation, which requires the loss or cost function to be twice differentiable in model parameters. Ideally, we would like to preserve the gradient descent formulation because of its attractive convergence properties. However, we need to find a way to optimize for NDCG within this framework, even though it is a discontinuous optimization metric that does not lend itself easily to gradient descent easily. LambdaMART addresses this challenge (Burges et al., 2006; Wu et al., 2010). The key insight of LambdaMART is that we can directly train a model without explicitly specifying a cost function, as long as we have gradients of the cost and these gradients are consistent with the cost function that we wish to minimize. The gradient functions, λqmn ijk s, in RankNet can be interpreted as directional forces that pull documents in the direction of increasing relevance. If a document m is more relevant than n, then m gets an upward pull of magnitude |λqmn ijk | and n gets a downward pull of the same magnitude. LambdaMART expands this idea of gradients functioning as directional forces by weighing them with the change in NDCG caused by swapping the positions of m and n, as follows: λqmn ijk = −σ 1+e qn −σ(sqm ijk −sijk ) |∆NDCGqmn ijk | (21) qmn qmn Note that NDCGqmn ijk = NDCGijk = 0 when m and n have the same relevance; so λijk = 0 in those qmn cases. (This is the reason why the term σ2 (1 − Sijk ) drops out.) Thus, the algorithm does not spend effort swapping two documents of the same relevance. qmn Given that we want to maximize NDCG, suppose that there exists a utility or gain function Cijk qmn that satisfies the above gradient equation. (We refer to Cijk as a gain function in this context, as opposed to a cost function, since we are looking to maximize NDCG.) Then, we can continue to use gradient descent to update our parameter vector w as follows: qmn ∂Cijk w →w+η ∂w (22) Using Poincare’s Lemma and a series of empirical results, Yue and Burges (2007) and Donmez et al. (2009), have shown that the gradient functions defined in Equation (21) correspond to a gain function that maximizes NDCG. Thus, we can use these gradient formulations, and follow the same set of 25 steps as we did for RankNet to maximize NDCG.9 LambdaMART has been successfully used in many settings (Chapelle and Chang, 2011) and we refer interested readers to Burges (2010) for a detailed discussion on the implementation and relative merits of these algorithms. In our analysis, we use the RankLib implementation of LambdaMART (Dang, 2011), which implements the above algorithm and learns the functional form of f (·) and the parameters (ws) using MART. 6 Implementation We now discuss the details of implementation and data preparation. 6.1 Datasets – Training, Validation, and Testing Supervised machine learning algorithms (such as ours) typically require three datasets – training, validation, and testing. Training data is the set of pre-classified data used for learning or estimating the model, and inferring the model parameters. In marketing settings, usually the model is optimized on the training dataset and its performance is then tested on a hold-out sample. However, this leads the researcher to pick a model with great in-sample performance, without optimizing for out-of-sample performance. In contrast, in most machine learning settings (including ours), the goal is to pick a model with the best out-of-sample performance, hence the validation data. So at each step of the optimization (which is done on the training data), the model’s performance is evaluated on the validation data. In our analysis, we optimize the model on the training data, and stop the optimization only when we see no improvement in its performance on the validation data for at least 100 steps. Then we go back and pick the model that offered the best performance on the validation data. Because both the training and validation datasets are used in model selection, we need a new dataset for testing the predictive accuracy of the model. This is done on a separate hold-out sample, which is labeled as the test data. The results from the final model on the test data are considered the final predictive accuracy achieved by our framework. The features described in §4.1 can be broadly categorized into two sets – (a) global features, which are common to all the users in the data, and (b) user-specific features that are specific to a user (could be within or across session features). In real business settings, managers typically have data on both global and user-level features at any given point in time. Global features are constructed by aggregating the behavior of all the users in the system. Because this requires aggregation over millions of users and data stored across multiple servers and geographies, it is not feasible to keep this up-to-date on a real-time basis. So firms usually update this data on a regular (weekly or biweekly) basis. Further, global data is useful only if has been aggregated over a sufficiently long period of time. In our setting, we cannot generate reliable global features for the first few days of observation because 9 The original version of LambdaRank was trained using neural networks. To improve its speed and performance, Wu et al. (2010) proposed a boosted tree version of LambdaRank – LambdaMART, which is faster and more accurate. 26 Global Data User A User B User C Session A1 Session A2 Session B1 Session A3 Validation Data Session B2 Session C1 Session C2 Session C3 Days 1--25 Training Data Test Data Days 26--27 Figure 10: Data-generation framework. of insufficient history. In contrast, user-specific features have to be generated on a real-time basis and need to include all the past activity of the user, including earlier activity in the same session. This is necessary to ensure that we can make personalized recommendations for each user. Given these constraints and considerations, we now explain how we partition the data and generate these features. Recall that we have a snapshot of 27 days of data from Yandex. Figure 10 presents a schema of how the three datasets (training, validation, test) are generated from the 27 days of data. First, using the entire data from the first 25 days, we generate the global features for all three datasets. Next, we split all the users observed in the last two days (days 26 − 27) into three random samples (training, validation, and testing).10 Then, for each observation of a user, all the data up till that observation is used to generate the user/session level features. For example, Sessions A1, A2, B1, B2, and C1 are used to generate global features (such as the listing frequency and CTRs at the query, URL, and domain level) for all three datasets. Thus, the global features for all three datasets do not include information from sessions in days 26 − 27 (Sessions A3, C2, and C3). Next, for user A in the training data, the user level features at any given point in Session A3 are calculated based on her entire history starting from Session A1 (when she is first observed) till her last observed action in Session A3. User B is not observed in the last two days. So her sessions are only used to generate aggregate global features, i.e., she does not appear at the user-level in the training, validation, or test datasets. For user C, the user-level features at any given point in Sessions C2 or C3 are calculated based on her entire history starting from Session C1. 10 We observe a total of 1174096 users in days 26 − 27, who participate in 2327766 sessions. Thus, each dataset (training, validation, and testing) has approximately one third of these users and sessions in it. 27 Dataset Dataset 1 Dataset 2 NDCG Baseline 0.7937 0.5085 NDCG with Personalization 0.8121 0.5837 % Gain 2.32% 14.78% Table 4: Gains from Personalization 6.2 Data Preparation Before constructing the three datasets (training, validation, and test) described above, we clean the data to accommodate the test objective. Testing whether personalization is effective or not is only possible if there is at least one click. So following Yandex’s recommendation, we filter the datasets in two ways. First, for each user observed in days 26 − 27, we only include all her queries from the test period with at least one click with a dwell time of 50 units or more (so that there is at least one relevant or highly relevant document). Second, if a query in this period does not have any short-term context (i.e., it is the first one in the session) and the user is new (i.e., she has no search sessions in the training period), then we do not consider the query since it has no information for personalization. Additional analysis of the test data shows that about 66% of the clicks observed are already for the first position. Personalization cannot further improve user experience in these cases because the most relevant document already seems to have been ranked at the highest position. Hence, to get a clearer picture of the value of personalization, we need to focus on situations where the user clicks on a document in a lower position (in the current ranking) and then examine if our proposed personalized ranking improves the position of the clicked documents. Therefore, we consider two test datasets – 1) the first is the entire test data, which we denote as Dataset 1, and 2) the second is a subset that only consists of queries for which clicks are at position 2 or lower, which we denote as Dataset 2. 6.3 Model Training We now comment on some issues related to training our model. The performance of LambdaMART depends on a set of tunable parameters. For instance, the researcher has control over the number of regression trees used in the estimation, the number of leaves in a given tree, the type of normalization employed on the features, etc. It is important to ensure that the tuning parameters lead to a robust model. A non-robust model usually leads to overfitting; that is, it leads to higher improvement in the training data (in-sample fit), but lower improvements in validation and testing datasets (out-of-sample fits). We take two key steps to ensure that the final model is robust. First, we normalize the features using their zscores to make the feature scales comparable. Second, we experimented with a large number of specifications for the number of trees and leaves in a tree, and found that 1000 trees and 20 leaves per tree was the most robust combination. 28 7 Results 7.1 Gains from Personalization After training and validating the model, we test it on the two datasets described earlier – Dataset 1 and Dataset 2. The results are presented in Table 4. First, for Dataset 1, we find that personalization improves the NDCG from a baseline (with SERPRank as the only feature) of 0.7937 to 0.8121, which represents a 2.32% improvement over the baseline. Since 66% of the clicks in this dataset are already for the top-most position, it is hard to interpret the value of personalization from these numbers. So we focus on Dataset 2, which consists of queries that received clicks at positions 2 or lower, i.e., only those queries where personalization can play a role in improving the search results. Here, we find the results promising – compared to the baseline, our personalized ranking algorithm leads to a 14.78% increase in search quality, as measured by NDCG. These improvements are significant in the search industry. Recall that the baseline ranking captures a large number of factors including the relevance of page content, page rank, and the linkstructure of the web-pages, which are not available to us. It is therefore notable that we are able to further improve on the ranking accuracy by incorporating personalized user/session-specific features. Our overall improvement is higher than that achieved by winning teams from the Kaggle contest (Kaggle, 2014).This is in spite of the fact that the contestants used a longer dataset for training their model.11 We conjecture that this is due to our use of a comprehensive set of personalized features in conjunction with the parallelized LambdaMART algorithm that allows us to exploit the complete scale of the data. The top three teams in the contest released a technical report each discussing their respective methodologies. Interested readers please refer to Masurel et al. (2014); Song (2014); Volkovs (2014) for details on their approaches. 7.2 Value of Personalization from the Firm’s and Consumers’ Perspectives So far we presented the results purely in terms of NDCG. However, the NDCG metric is hard to interpret, and it is not obvious how changes in NDCG translate to differences in rank-orderings and click-through-rates (CTRs) in a personalized page. So we now present some results on these metrics. The first question we consider is this – From a consumer’s perspective, how different does the SERP look when we deploy the personalized ranking algorithm compared to what it looks like under the current non-personalized algorithm? Figures 11 and 12 summarize the shifts after personalization for Dataset 1 and Dataset 2, respectively. For both datasets, over 55% of the documents are re-ordered, which is a staggering number of re-orderings. (Note that the percentage of results (URLs) with zero 11 The contestants used all the data from days 1 − 27 for building their models and had their models tested by Yandex on an unmarked dataset (with no information on clicks made by users) from days 28 − 30. In contrast, we had to resort to using the dataset from days 1 − 27 for both model building and testing. So the amount of past history we have for each user in the test data is smaller, putting us at a relative disadvantage (especially since we find that longer histories lead to higher NDCGs). Thus, it is noteworthy that our improvement is higher than that achieved by the winning teams. 29 50.00 37.50 37.50 Percentage Percentage 50.00 25.00 12.50 12.50 0.00 25.00 0.00 -5 -4 -3 -2 -1 0 1 2 3 4 5 -5 -4 -3 -2 -1 0 1 2 3 4 5 Change in Position Change in Position Figure 12: Histogram of change in position for the results in Dataset 2. Figure 11: Histogram of change in position for the results in Dataset 1. change in position is 44.32% for Dataset 1 and 41.85% for Dataset 2.) Approximately 25% of the documents are re-ordered by one position (increase of one position or decrease of one position). The rest 30% are re-ordered by two positions or more. Re-orderings by more than five positions are rare. Overall, these results suggest that the rank-ordering of results in the personalized SERP is likely to be quite different from the non-personalized SERP. These findings highlight the fact that even small improvements in NDCG lead to significant changes in rank-orderings, as perceived by consumers. 4.00 18.00 13.50 Increase in CTR Increase in CTR 3.00 2.00 1.00 0.00 -1.00 9.00 4.50 0.00 -4.50 1 2 3 4 5 6 7 8 9 -9.00 10 1 2 3 4 5 6 7 8 9 10 Position Position Figure 14: Expected increase in CTR as a function of position for Dataset 2. CTRs are expressed in percentages; therefore the change in CTRs are also expressed in percentages. Figure 13: Expected increase in CTR as a function of position for Dataset 1. CTRs are expressed in percentages; therefore the change in CTRs are also expressed in percentages. Next, we consider the returns to personalization from the firm’s perspective. User satisfaction depends on the search engine’s ability to serve relevant results at the top of a SERP. So any algorithm that increases CTRs for top positions is valuable. We examine how position-based CTRs are expected to change with personalization compared to the current non-personalized ranking. Figures 13 and 14 summarize our findings for Datasets 1 and 2. The y-axis depicts the difference in CTRs (in percentage) 30 Percentage improvement over baseline Percentage improvement over baseline 3.5 3 2.5 2 1.5 1 0.5 0 0 20 40 60 Number of prior user queries 80 100 Figure 15: NDCG improvements over the baseline for different lower-bounds on the number of prior queries by the user for Dataset 1. 20 18 16 14 12 10 0 20 40 60 Number of prior user queries 80 100 Figure 16: NDCG improvements over the baseline for different lower-bounds on the number of prior queries by the user for Dataset 2. after and before personalization, and the x-axis depicts the position. Since both datasets only include queries with one or more clicks, the total number of clicks before and after personalization are maintained to be the same. (The sum of the changes in CTRs across all positions is zero.) However, after personalization, the documents are re-ordered so that more relevant documents are displayed higher up. Thus, the CTRs for the topmost position increases, while that of lower positions decrease. In both datasets, we see a significant increase in the CTR for position 1. For Dataset 1, CTR for position 1 changes (in a positive direction) by 3.5%. For Dataset 2, which consists of queries receiving clicks to positions 2 or lower, this increase is even more substantial. We find that the CTR to first position is now higher by 18% . For Dataset 1, the increase in CTR for position 1 mainly comes from positions 3 and 10, whereas for Dataset 2, it comes from positions 2 and 3. Overall, we find that deploying a personalization algorithm would lead to increase in CTRs for higher positions, thereby reducing the need for consumers to scroll further down the SERP for relevant results. This in turn would increase user satisfaction and reduce switching – both of which are valuable outcomes from the search engine’s perspective. 7.3 Value of Search History Next, we examine the returns to using longer search histories for personalization. Even though we have less than 26 days of history on average, the median query (by a user) in the test datasets has 49 prior queries, and the 75th percentile has 100 prior queries.12 We find that this accumulated history pays significant dividends in improving search quality, as shown in Figures 15 and 16. When we set the lower-bound for past queries to zero, we get the standard 2.32% improvement over the baseline 12 The number of past queries for a median query (50) is a different metric than the one discussed earlier, median of the number of queries per user (shown in Figure 4 as 16), because of two reasons. First, this number is calculated only for queries which have one or more prior queries (see data preparation in §6.2). Second, not all users submit the same number of queries. Indeed, a small subset of users account for most queries. So the average query has a larger median history than the average user. 31 for Dataset 1 and 14.78% for Dataset 2 (Table 4), as shown by the points on the y-axis in Figures 15 and 16. When we restrict the test data to only contain queries for which the user has issued at least 50 past queries (which is close to the median), we find that the NDCG improves by 2.76% for Dataset 1 and 17% for Dataset 2. When we increase the lower-bound to 100 queries (the 75th percentile for our data), NDCG improves by 3.03% for Dataset 1 and 18.51% for Dataset 2, both of which are quite significant when compared to the baseline improvements for the entire datasets. These findings can be summarized into two main patterns. First, the search history is more valuable when we focus on searches where the user has to go further down in the SERP to look for relevant results. Second, there is a monotonic, albeit concave increase in NDCG as a function of the length of search history. These findings re-affirm the value of storing and using personalized browsing histories to improve users’ search experience. Indeed, firms are currently struggling with the decision of how much history to store, especially since storage of these massive personal datasets can be extremely costly. By quantifying the improvement in search results for each additional piece of past history, we provide firms with tools to decide whether the costs associated with storing longer histories are offset by the improvements in consumer experience. 7.4 Comparison of Features We now examine the value of different types of features in improving search results using a comparative analysis. In order to perform this analysis, we start with the full model, then remove a specific feature set (as described in §4.2) and evaluate the loss in NDCG prediction accuracy as a consequence of the removal. The percentage loss is defined as: N DCGf ull − N DCGcurrent × 100 N DCGf ull − N DCGbase (23) Note that all the models always have SERPRank as a feature. Table 5 provides the results from this experiment. We observe that by eliminating the Global features, we lose 33.15% of the improvement that we gain from using the full feature set. This suggests that the interaction effects between global and personalized features plays an important role in improving the accuracy of our algorithm. Next, we find that eliminating user-level features results in a loss of a whopping 53.26%. Thus, over 50% of the improvement we provide comes from using personalized user-history. In contrast, eliminating session-level features while retaining across-session user-level features results in a loss of only 3.81%. This implies that across session features are more valuable than within-session features and that user behavior is stable over longer time periods. Thus, recommendation engines can forgo real-time session-level data (which is costly to incorporate in terms of computing time and deployment) without too much loss in prediction accuracy. We also observe that eliminating domain name based features results in a significant loss in 32 Full set − Global features (FG ) − User features (FP ) − Session features (FS ) − URL features (FU ) − Domain features (FD ) − Query features (FQ ) − Term features (FT ) − Click features (FC ) − Position features (FR ) − Skip features (FK ) − Posn-click features (FN ) − Dwell features (FW ) NDCG 0.8121 0.8060 0.8023 0.8114 0.8101 0.8098 0.8106 0.8119 0.8068 0.8116 0.8118 0.8120 0.8112 % Loss 33.15 53.26 3.80 10.87 12.50 8.15 1.09 28.80 2.72 1.63 0.54 4.89 Table 5: Loss in prediction accuracy when different feature sets are removed from the full model. prediction accuracy, as does the elimination of URL level features. This is because users prefer search results from domains and URLs that they trust. Search engines can thus track user preferences over domains and use this information to optimize search ranking. Query-specific features can also be quite informative (8.15% loss in accuracy). This could be because they capture differences in user behavior across queries. We also find that features based on users’ past clicking and dwelling behavior can be informative, with Click-specific features being the most valuable (with a 28.8% loss in accuracy). This is likely because users tend to exhibit persistence in how they click on links – both across positions (some users may only click on the first two positions while others may click on multiple positions) and across URLs and domains. (For a given query, some users may have persistence in taste for certain URLs; e.g., some users might be using a query for navigational purposes while others might be using it to get information/news.) History on past clicking behavior can inform us of these proclivities and help us improve the results shown to different users. In sum, these findings give us insight into the relative value of different types of features. This can help businesses decide which types of information to incorporate into their search algorithms and which types of information to store for longer periods of time. 7.5 Comparison with Pairwise Maximum Likelihood We now compare the effectiveness of machine learning based approaches to this problem to the standard technique used in the marketing literature – Maximum Likelihood. We estimate a pairwise Maximum Likelihood model with a linear combination of the set of 293 features used previously. Note that this is similar to a RankNet model where the functional form f (·) is specified as a linear combination of the features, and we are simply inferring the parameter vector w. Thus, it is a pairwise optimization that does not use any machine learning algorithms. 33 NDCG on training data NDCG on validation data NDCG on test data Pairwise Maximum Likelihood Single feature Full features 0.7930 0.7932 0.7933 0.7932 0.7937 0.7936 LambdaMART Single feature Full features 0.7930 0.8141 0.7933 0.8117 0.7937 0.8121 Table 6: Comparison with Pairwise Maximum Likelihood We present the results of this analysis in Table 6. For both Maximum Likelihood and LambdaMART, we report results for a single feature model (with just SERPRank) and a model with all the features. In both methods, the single feature model produces the baseline NDCG of 0.7937 on the test data, which shows that both methods perform as expected when used with no additional features. Also, as expected, in the case of full feature models, both methods show improvement on the training data. However, the improvement is very marginal in the case of pairwise Maximum Likelihood, and completely disappears when we take the model to the test data. While LambdaMART also suffers from some loss of accuracy when moving from training to test data, much of its improvement is sustained in the test data, which turns out to be quite significant. Together these results suggest that Maximum Likelihood is not an appropriate technique for large scale predictive models. The lackluster performance of the Maximum Likelihood model stems from two issues. First, because it is a pairwise gradient-based method it does not maximize NDCG. Second, it does not allow for the fully flexible functional form and a large number of non-linear interactions that machine learning techniques such as MART do. 7.6 Scalability We now examine the scalability of our approach to large datasets. Note that there are three aspects of modeling – feature generation, testing, and training – all of which are computing and memory intensive. In general, feature generation is more memory intensive (requiring at least 140GB of memory for the full dataset), and hence it is not parallelized. The training module is more computing intensive. The RankLib package that we use for training and testing is parallelized and can run on multi-core servers, which makes it scalable. All the stages (feature generation, training, and testing) were run on Amazon EC2 nodes since the computational load and memory requirements for these analyses cannot be handled by regular PCs. The specification of the compute server used on Amazon’s EC2 node is – Intel(R) Xeon(R) CPU E5 − 2670 v2 @ 2.50GHz with 32 cores and 256GB of memory. This was the fastest server with the highest memory available at Amazon EC2. We now discuss the scalability of each module – feature generation, training, and testing – one by one, using Figures 17, 18 and 19. The runtimes are presented as the total CPU time added up over all the 32 cores. To construct the 20% dataset, we continue to use all the users and sessions from days 1 − 25 to generate the global features, but use only 20% of the users from days 26 − 27 to generate 34 1000000 5000 Training time (secs) Feature generation time (secs) 6000 4000 3000 2000 1000 0 0 20 40 60 Percentage of dataset 80 600000 400000 200000 0 100 Figure 17: Runtimes for generating the features the model on different percentages of the data. 0 20 40 60 Percentage of dataset 80 100 Figure 18: Runtimes for training the model on different percentages of the data. 2500 0.814 0.813 2000 0.812 1500 NDCG Testing time (secs) 800000 1000 0.811 0.81 500 0 0.809 0 20 40 60 Percentage of dataset 80 0.808 100 Figure 19: Runtimes for testing the model on different percentages of the data. 0 20 40 60 Percentage of dataset 80 100 Figure 20: NDCG improvements from using different percentages of the data. user and session specific features. For the full dataset, the feature generation module takes about 83 minutes on a single-core computer (see Figure 17), the training module takes over 11 days (see Figure 18), and the testing module takes only 35 minutes (see Figure 19). These numbers suggest that training is the primary bottleneck in implementing the algorithm, especially if the search engine wants to update the trained model on a regular basis. So we parallelize the training module over a 32-core server. This reduces the training time to about 8.7 hours. Overall, with sufficient parallelization, we find that the framework can be scaled to large datasets in reasonable time-frames. Next, we compare the computing cost of going from 20% of the data to the full dataset. We find that it increases by 2.38 times for feature generation, by 20.8 times for training, and 11.6 times for testing. Thus, training and testing costs increase super-linearly whereas it is sub-linear for feature generation. These increases in computing costs however brings a significant improvement in model performance, as measured by NDCG (see Figure 20). Overall, these measurements affirm the value of using large scale datasets for personalized recommendations, especially since the cost of increasing 35 the size of the data is paid mostly at the training stage, which is easily parallelizable. 8 Feature Selection We now discuss the third and final step of our empirical approach. As we saw in the previous section, even when parallelized on 32-core servers, personalization algorithms can be very computingintensive. Therefore, we now examine feature selection, where we optimally choose a subset of features that maximize the predictive accuracy of the algorithm with minimal loss in accuracy. We start by describing the background literature and frameworks for feature selection, then describe its application to our context, followed by the results. Feature selection is a standard step in all machine learning settings that involve supervised learning because of two reasons (Guyon and Elisseeff, 2003). First, it can provide a faster and cost-effective, or a more efficient, model by eliminating less relevant features with minimal loss in accuracy. Efficiency is particularly relevant when training large datasets such as ours. Second, it provides more comprehensible models which offer a better understanding of the underlying data generating process.13 The goal of feature selection is to find the smallest set of features that can provide a fixed predictive accuracy. In principle, this is a straightforward problem since it simply involves an exhaustive search of the feature space. However, even with a moderately large number of features, this is practically impossible. With F features, an exhaustive search requires 2F runs of the algorithm on the training dataset, which is exponentially increasing in F . In fact, this problem is known to be NP-hard (Amaldi and Kann, 1998). 8.1 Wrapper Approach The Wrapper method addresses this problem by using a greedy algorithm. We describe it briefly here and refer interested readers to Kohavi and John (1997) for a detailed review. Wrappers can be categorized into two types – forward selection and backward elimination. In forward selection, features are progressively added until a desired prediction accuracy is reached or until the incremental improvement is very small. In contrast, a backward elimination Wrapper starts with all the features and sequentially eliminates the least valuable features. Both these Wrappers are greedy in the sense that they do not revisit former decisions to include (in forward selection) or exclude features (in backward elimination). In our implementation, we employ a forward-selection Wrapper. 13 In cases where the datasets are small and the number of features large, feature selection can actually improve the predictive accuracy of the model by eliminating irrelevant features whose inclusion often results in overfitting. Many machine learning algorithms, including neural networks, decision trees, CART, and Naive Bayes learners have been shown to have significantly worse accuracy when trained on small datasets with superfluous features (Duda and Hart, 1973; Aha et al., 1991; Breiman et al., 1984; Quinlan, 1993). In our setting, we have an extremely large dataset, which prevents the algorithm from making spurious inferences in the first place. So feature selection is not accuracy-improving in our case; however it leads to significantly lower computing times. 36 To enable a Wrapper algorithm, we also need to specify a selection as well as a stopping rule. We use the robust Best-first selection rule (Ginsberg, 1993), wherein the most promising node is selected at every decision point. For example, in a forward-selection algorithm with 10 features, at the first node, this algorithm considers 10 versions of the model (each with one of the features added) and then picks the feature whose addition offers the highest prediction accuracy. Further, we use the following stopping rule, which is quite conservative. Let N DCGbase be the N DCG from the base model currently used by Yandex, which only uses SERPRank. Similarly, let N DCGcurrent be the improved N DCG at the current iteration, and N DCGf ull be the improvement that can be obtained from using a model with all the 293 features. Then our stopping rule is: N DCGcurrent − N DCGbase > 0.95 N DCGf ull − N DCGbase (24) According to this stopping rule, the Wrapper algorithm stops adding features when it achieves 95% of the improvement offered by the full model. Wrappers offer many advantages. First, they are agnostic to the underlying learning algorithm and the accuracy metric used for evaluating the predictor. Second, greedy Wrappers have been shown to be computationally advantageous and robust against overfitting (Reunanen, 2003). 8.2 Feature Classification for Wrapper Wrappers are greedy algorithms that offers computational advantages. They reduce the number of runs of the training algorithm from 2F to F (F2+1) . While this is significantly less than 2F , it is still prohibitively expensive for large feature sets. For instance, in our case, this would amount to 42, 778 runs of the training algorithm, which is not feasible. Hence, we use a coarse Wrapper – that runs on sets or classes of features as opposed to individual features. We use domain-specific knowledge to identify groups of features and perform the search process on these groups of features as opposed to individual ones. This allows us to limit the number of iterations required for the search process to a reasonable number. Recall that we have 293 features. We classify these into 39 classes. First, we explain how we classify the first 270 features listed in Table A1 in the Appendix. These features encompass the following group of eight features – Listings, AvgPosn, CTR, HDCTR, WCTR, PosnCTR, ADT, SkipRate. We classify these in to 36 classes as shown in Figure 21. At the lowest denomination, this set of features can be obtained at the URL or domain level.14 This gives us two sets of features. At the next level, these features can be aggregated at the query level, at the term level (term1, term2, term3, or term4), or aggregated over all queries and terms (referred to as Null). This gives us six possible options. At a higher level, they can be further aggregated at the session-level, over users (across all 14 Note that SkipRate is not calculated at the domain level; hence at the domain level this set consists of seven features. 37 URL features Listings Avg.Posn CTR HDCTR WCTR PosnCTR ADT SkipRate Query Global Term1 Term2 User Term3 User-session Domain features Listings Avg.Posn CTR HDCTR WCTR PosnCTR ADT Term4 Null 3 X 6 X 2 = 36 Figure 21: Classes of features used in feature-selection. their sessions), or over all users and sessions (referred to as Global). Thus, we have 2 × 6 × 3 = 36 sets of features. Next, we classify features 271 − 278, which consist of Listings, CTR, HDCTR, ADT at user and session levels, aggregated over all URLs and domains, into the 37th class. The 38th class consists of user-specific click probabilities for all ten positions, aggregated over the user history (features 279 − 288 in Table A1 in the Appendix). Finally, the remaining features 289 − 292 form the 39th class. Note that SERPRank is not classified under any class, since it is included in all models by default. 8.3 Results: Feature Selection We use a search process that starts with a minimal singleton feature set that comprises just the SERPRank feature. This produces the baseline NDCG of 0.7937. Then we sequentially explore the feature classes using the first-best selection rule. Table 7 shows the results of this exploration process, along with the class of features added in each iteration. We stop at iteration six (by which time, we have run the model 219 times), when the improvement in NDCG satisfies the condition laid out in Equation (24). The improvement in NDCG at each iteration is also shown in Figure 22, along with the maximum NDCG obtainable with a model that includes all the features. Though the condition for the stopping rule is satisfied at iteration 6, we show the results till iteration 11 in Figure 22, to demonstrate 38 Iteration no. 0 1 2 3 4 5 6 No. of features 1 9 17 24 32 40 47 Feature class added at iteration SERPRank User–Query–URL Global–Query–URL User–Null–Domain Global–Null–URL User-session–Null–URL Global–Null–Domain NDCG 0.7937 0.8033 0.8070 0.8094 0.8104 0.8108 0.8112 Table 7: Set of features added at each iteration of the Wrapper algorithm based on Figure 21. The corresponding N DCGs are also shown. 0.816 0.8120 NDCG 0.808 0.804 0.800 0.796 0.792 0 10 20 30 40 50 60 Number of features 70 80 90 Figure 22: Improvement in N DCG at each iteration of the Wrapper algorithm. The tick marks in the red line refer to the iteration number and the blue line depicts the maximum NDCG obtained with all the features included. the fact that the improvement in NDCG after iteration 6 is minimal. Note that at iteration six, the pruned model with 47 features produces an NDCG of 0.8112, which equals 95% of the improvement offered by the full model. This small loss in accuracy brings a huge benefit in scalability as shown in Figures 23 and 24. When using the entire dataset, the training time for the pruned model is ≈ 15% of the time it takes to train the full model. Moreover, the pruned model scales to larger datasets much better than the full model. For instance, the difference in training times for full and pruned models is small when performed on 20% of the dataset. However, as the data size grows, the training time for the full model increases more steeply than that for the pruned model. While training-time is the main time sink in this application, training is only done once. In contrast, testing runtimes have more consequences for the model’s applicability since testing/prediction has to be done in real-time for most applications. As with training times, we find that the pruned model runs significantly faster on the full dataset (≈ 50%) as well as scaling better with larger datasets. Together, these findings suggest that feature selection is a valuable and important step in reducing runtimes and making personalization implementable in real-time applications when we have big data. 39 2500 Full Pruned 800000 Testing time (secs) Training time (secs) 1000000 600000 400000 200000 0 0 20 40 60 Percentage of dataset 80 Figure 23: Runtimes for training the full and pruned models on different percentages of the data. 9 1500 1000 500 0 100 Full Pruned 2000 0 20 40 60 Percentage of dataset 80 100 Figure 24: Runtimes for testing the full and pruned models on different percentages of the data. Conclusion Personalization of digital services remains the holy grail of marketing. In this paper, we study an important personalization problem – that of search rankings in the context of online search or information retrieval. Query-based web search is now ubiquitous and used by a large number of consumers for a variety of reasons. Personalization of search results can therefore have significant implications for consumers, businesses, governments, and for the search engines themselves. We present a three-pronged machine learning framework that improves search results through automated personalization using gradient-based ranking. We apply our framework to data from the premier eastern European search engine, Yandex, and provide evidence of substantial improvements in search quality using personalization. In contrast, standard maximum likelihood based procedures provide zero improvement, even when supplied with features generated by our framework. We quantify the value of the length of user history, the relative effectiveness of short-term and long-term context, and the impact of other factors on the returns to personalization. Finally, we show that our framework can perform efficiently at scale, making it well suited for real-time deployment. Our paper makes four key contributions to the marketing literature. First, it presents a general machine learning framework that marketers can use to rank recommendations using personalized data in many different settings. Second, it presents empirical evidence in support of the returns to personalization in the online search context. Third, it demonstrates how big data can be leveraged to improve marketing outcomes without compromising the speed or real-time performance of digital applications. Finally, it provides insights on how machine learning methods can be adapted to solve important marketing problems that have been technically unsolvable so far (using traditional econometric or analytical models). We hope our work will spur more research on the application of machine learning methods to not only personalization-related issues, but also on a broad array of marketing issues. 40 A Appendix Table A1: List of features. Feature No. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16-19 20-23 24-27 28-31 32-35 36-39 40-43 44-47 48-51 52-55 56-59 60-63 64-67 68-71 72-75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91-94 95-98 Feature Name Listings(query, url , −, −, −) CTR(query, url , −, −, −) ADT(query, url , −, −, −) AvgPosn(query, url , −, −, −) HDCTR(query, url , −, −, −) WCTR(query, url , −, −, −) PosnCTR(query, url , −, −, −) SkipRate(query, url , −, −, −) Listings(query, domain, −, −, −) CTR(query, domain, −, −, −) ADT(query, domain, −, −, −) AvgPosn(query, domain, −, −, −) HDCTR(query, domain, −, −, −) WCTR(query, domain, −, −, −) PosnCTR(query, domain, −, −, −) Listings(term 1..4 , url , −, −, −) CTR(term 1..4 , url , −, −, −) ADT(term 1..4 , url , −, −, −) AvgPosn(term 1..4 , url , −, −, −) HDCTR(term 1..4 , url , −, −, −) WCTR(term 1..4 , url , −, −, −) PosnCTR(term 1..4 , url , −, −, −) SkipRate(term 1..4 , url , −, −, −) Listings(term 1..4 , domain, −, −, −) CTR(term 1..4 , domain, −, −, −) ADT(term 1..4 , domain, −, −, −) AvgPosn(term 1..4 , domain, −, −, −) HDCTR(term 1..4 , domain, −, −, −) WCTR(term 1..4 , domain, −, −, −) PosnCTR(term 1..4 , domain, −, −, −) Listings(query, url , usr , −, time) CTR(query, url , usr , −, time) ADT(query, url , usr , −, time) AvgPosn(query, url , usr , −, time) HDCTR(query, url , usr , −, time) WCTR(query, url , usr , −, time) PosnCTR(query, url , usr , −, time) SkipRate(query, url , usr , −, time) Listings(query, domain, usr , −, time) CTR(query, domain, usr , −, time) ADT(query, domain, usr , −, time) AvgPosn(query, domain, usr , −, time) HDCTR(query, domain, usr , −, time) WCTR(query, domain, usr , −, time) PosnCTR(query, domain, usr , −, time) Listings(term 1..4 , url , usr , −, time) CTR(term 1..4 , url , usr , −, time) i Feature Classes FG , FQ , FU , FC FG , FQ , FU , FC FG , FQ , FU , FW FG , FQ , FU , FR FG , FQ , FU , FC , FW FG , FQ , FU , FC , FN FG , FQ , FU , FC , FR FG , FQ , FU , FK FG , FQ , FD , FC FG , FQ , FD , FC FG , FQ , FD , FW FG , FQ , FD , FR FG , FQ , FD , FC , FW FG , FQ , FD , FC , FN FG , FQ , FD , FC , FR FG , FT , FU , FC FG , FT , FU , FC FG , FT , FU , FW FG , FT , FU , FR FG , FT , FU , FC , FW FG , FT , FU , FC , FN FG , FT , FU , FC , FR FG , FT , FU , FK FG , FT , FD , FC FG , FT , FD , FC FG , FT , FD , FW FG , FT , FD , FR FG , FT , FD , FC , FW FG , FT , FD , FC , FN FG , FT , FD , FC , FR FP , FQ , FU , FC FP , FQ , FU , FC FP , FQ , FU , FW FP , FQ , FU , FR FP , FQ , FU , FC , FW FP , FQ , FU , FC , FN FP , FQ , FU , FC , FR FP , FQ , FU , FK FP , FQ , FD , FC FP , FQ , FD , FC FP , FQ , FD , FW FP , FQ , FD , FR FP , FQ , FD , FC , FW FP , FQ , FD , FC , FN FP , FQ , FD , FC , FR FP , FT , FU , FC FP , FT , FU , FC Continued on next page Feature No. 99-102 103-106 107-110 111-114 115-118 119-122 123-126 127-130 131-134 135-138 139-142 143-146 147-150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166-169 170-173 174-177 178-181 182-185 186-189 190-193 194-197 198-201 202-205 206-209 210-213 214-217 218-221 222-225 226 227 228 229 230 231 232 233 Table A1 – continued from previous page Feature Name Feature Classes ADT(term 1..4 , url , usr , −, time) FP , FT , FU , FW AvgPosn(term 1..4 , url , usr , −, time) FP , FT , FU , FR HDCTR(term 1..4 , url , usr , −, time) FP , FT , FU , FC , FW WCTR(term 1..4 , url , usr , −, time) FP , FT , FU , FC , FN PosnCTR(term 1..4 , url , usr , −, time) FP , FT , FU , FC , FR SkipRate(term 1..4 , url , usr , −, time) FP , FT , FU , FK Listings(term 1..4 , domain, usr , −, time) FP , FT , FD , FC CTR(term 1..4 , domain, usr , −, time) FP , FT , FD , FC ADT(term 1..4 , domain, usr , −, time) FP , FT , FD , FW AvgPosn(term 1..4 , domain, usr , −, time) FP , FT , FD , FR HDCTR(term 1..4 , domain, usr , −, time) FP , FT , FD , FC , FW WCTR(term 1..4 , domain, usr , −, time) FP , FT , FD , FC , FN PosnCTR(term 1..4 , domain, usr , −, time) FP , FT , FD , FC , FR Listings(query, url , usr , session, time) FS , FQ , FU , FC CTR(query, url , usr , session, time) FS , FQ , FU , FC ADT(query, url , usr , session, time) FS , FQ , FU , FW AvgPosn(query, url , usr , session, time) FS , FQ , FU , FR HDCTR(query, url , usr , session, time) FS , FQ , FU , FC , FW WCTR(query, url , usr , session, time) FS , FQ , FU , FC , FN PosnCTR(query, url , usr , session, time) FS , FQ , FU , FC , FR SkipRate(query, url , usr , session, time) FS , FQ , FU , FK Listings(query, domain, usr , session, time) FS , FQ , FD , FC CTR(query, domain, usr , session, time) FS , FQ , FD , FC ADT(query, domain, usr , session, time) FS , FQ , FD , FW AvgPosn(query, domain, usr , session, time) FS , FQ , FD , FR HDCTR(query, domain, usr , session, time) FS , FQ , FD , FC , FW WCTR(query, domain, usr , session, time) FS , FQ , FD , FC , FN PosnCTR(query, domain, usr , session, time) FS , FQ , FD , FC , FR Listings(term 1..4 , url , usr , session, time) FS , FT , FU , FC CTR(term 1..4 , url , usr , session, time) FS , FT , FU , FC ADT(term 1..4 , url , usr , session, time) FS , FT , FU , FW AvgPosn(term 1..4 , url , usr , session, time) FS , FT , FU , FR HDCTR(term 1..4 , url , usr , session, time) FS , FT , FU , FC , FW WCTR(term 1..4 , url , usr , session, time) FS , FT , FU , FC , FN PosnCTR(term 1..4 , url , usr , session, time) FS , FT , FU , FC , FR SkipRate(term 1..4 , url , usr , session, time) FS , FT , FU , FK Listings(term 1..4 , domain, usr , session, time) FS , FT , FD , FC CTR(term 1..4 , domain, usr , session, time) FS , FT , FD , FC ADT(term 1..4 , domain, usr , session, time) FS , FT , FD , FW AvgPosn(term 1..4 , domain, usr , session, time) FS , FT , FD , FR HDCTR(term 1..4 , domain, usr , session, time) FS , FT , FD , FC , FW WCTR(term 1..4 , domain, usr , session, time) FS , FT , FD , FC , FN PosnCTR(term 1..4 , domain, usr , session, time) FS , FT , FD , FC , FR Listings(−, url , −, −, −) FG , FU , FC CTR(−, url , −, −, −) FG , FU , FC ADT(−, url , −, −, −) FG , FU , FW AvgPosn(−, url , −, −, −) FG , FU , FR HDCTR(−, url , −, −, −) FG , FU , FC , FW WCTR(−, url , −, −, −) FG , FU , FC , FN PosnCTR(−, url , −, −, −) FG , FU , FC , FR SkipRate(−, url , −, −, −) FG , FU , FK Continued on next page ii Feature No. 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279-288 289 290 291 292 293 Table A1 – continued from previous page Feature Name Listings(−, domain, −, −, −) CTR(−, domain, −, −, −) ADT(−, domain, −, −, −) AvgPosn(−, domain, −, −, −) HDCTR(−, domain, −, −, −) WCTR(−, domain, −, −, −) PosnCTR(−, domain, −, −, −) Listings(−, url , usr , −, time) CTR(−, url , usr , −, time) ADT(−, url , usr , −, time) AvgPosn(−, url , usr , −, time) HDCTR(−, url , usr , −, time) WCTR(−, url , usr , −, time) PosnCTR(−, url , usr , −, time) SkipRate(−, url , usr , −, time) Listings(−, domain, usr , −, time) CTR(−, domain, usr , −, time) ADT(−, domain, usr , −, time) AvgPosn(−, domain, usr , −, time) HDCTR(−, domain, usr , −, time) WCTR(−, domain, usr , −, time) PosnCTR(−, domain, usr , −, time) Listings(−, url , usr , session, time) CTR(−, url , usr , session, time) ADT(−, url , usr , session, time) AvgPosn(−, url , usr , session, time) HDCTR(−, url , usr , session, time) WCTR(−, url , usr , session, time) PosnCTR(−, url , usr , session, time) SkipRate(−, url , usr , session, time) Listings(−, domain, usr , session, time) CTR(−, domain, usr , session, time) ADT(−, domain, usr , session, time) AvgPosn(−, domain, usr , session, time) HDCTR(−, domain, usr , session, time) WCTR(−, domain, usr , session, time) PosnCTR(−, domain, usr , session, time) Listings(−, −, usr , −, time) CTR(−, −, usr , −, time) HDCTR(−, −, usr , −, time) ADT(−, −, usr , −, time) Listings(−, −, usr , session, time) CTR(−, −, usr , session, time) HDCTR(−, −, usr , session, time) ADT(−, −, usr , session, time) ClickProb(usr , posn, time) NumSearchResults(query) NumClickedURLs(query) Day(usr , session, time) QueryTerms(query) SERPRank(url , usr , session, search) iii Feature Classes FG , FD , FC FG , FD , FC FG , FD , FW FG , FD , FR FG , FD , FC , FW FG , FD , FC , FN FG , FD , FC , FR FP , FU , FC FP , FU , FC FP , FU , FW FP , FU , FR FP , FU , FC , FW FP , FU , FC , FN FP , FU , FC , FR FP , FU , FK FP , FD , FC FP , FD , FC FP , FD , FW FP , FD , FR FP , FD , FC , FW FP , FD , FC , FN FP , FD , FC , FR FS , FU , FC FS , FU , FC FS , FU , FW FS , FU , FR FS , FU , FC , FW FS , FU , FC , FN FS , FU , FC , FR FS , FU , FK FS , FD , FC FS , FD , FC FS , FD , FW FS , FD , FR FS , FD , FC , FW FS , FD , FC , FN FS , FD , FC , FR FP , FC FP , FC FP , FC FP , FW FS , FC FS , FC FS , FC FS , FW FP , FC FG , FQ , FC FG , FQ , FC FS , FU FQ References D. W. Aha, D. Kibler, and M. K. Albert. Instance-based Learning Algorithms. Machine learning, 6(1):37–66, 1991. E. Amaldi and V. Kann. On the Approximability of Minimizing Nonzero Variables or Unsatisfied Relations in Linear Systems. Theoretical Computer Science, 209(1):237–260, 1998. A. Ansari and C. F. Mela. E-customization. Journal of Marketing Research, pages 131–145, 2003. N. Arora and T. Henderson. Embedded Premium Promotion: Why it Works and How to Make it More Effective. Marketing Science, 26(4):514–531, 2007. A. Atkins-Krüger. Yandex Launches Personalized Search Results For Eastern Europe, December 2012. URL http://searchengineland.com/ yandex-launches-personalized-search-results-for-eastern-europe-142186. S. Banerjee, S. Chakrabarti, and G. Ramakrishnan. Learning to Rank for Quantity Consensus Queries. In Proceedings of the 32nd international ACM SIGIR Conference on Research and Development in Information Retrieval, pages 243–250. ACM, 2009. P. N. Bennett, R. W. White, W. Chu, S. T. Dumais, P. Bailey, F. Borisyuk, and X. Cui. Modeling the Impact of Short- and Long-term Behavior on Search Personalization. In Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 185–194, 2012. L. Breiman, J. Friedman, C. Stone, and R. Olshen. Classification and Regression Trees. The Wadsworth and Brooks-Cole statistics-probability series. Taylor & Francis, 1984. C. Burges. From RankNet to LambdaRank to LambdaMART: An Overview. Technical report, Microsoft Research Technical Report MSR-TR-2010-82, 2010. C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender. Learning to Rank Using Gradient Descent. In Proceedings of the 22nd International Conference on Machine Learning, pages 89–96. ACM, 2005. C. J. Burges, R. Ragno, and Q. V. Le. Learning to Rank with Nonsmooth Cost Functions. In Advances in Neural Information Processing Systems, pages 193–200, 2006. R. Caruana and A. Niculescu-Mizil. An Empirical Comparison of Supervised Learning Algorithms. In Proceedings of the 23rd International Conference on Machine Learning, pages 161–168. ACM, 2006. O. Chapelle and Y. Chang. Yahoo! Learning to Rank Challenge Overview. Journal of Machine Learning Research-Proceedings Track, 14:1–24, 2011. M. Ciaramita, V. Murdock, and V. Plachouras. Online Learning from Click Data for Sponsored Search. In Proceedings of the 17th International Conference on World Wide Web, pages 227–236. ACM, 2008. N. Clayton. Yandex Overtakes Bing as Worlds Fourth Search Engine, 2013. URL http://blogs.wsj.com/tech-europe/2013/02/11/ yandex-overtakes-bing-as-worlds-fourth-search-engine/. comScore. December 2013 U.S. Search Engine Rankings, 2014. URL http://investor.google. com/financial/tables.html. N. Craswell, O. Zoeter, M. Taylor, and B. Ramsey. An Experimental Comparison of Click Position-bias Models. In Proceedings of the 2008 International Conference on Web Search and Data Mining, pages 87–94, 2008. V. Dang. RankLib. Online, 2011. URL http://www.cs.umass.edu/˜{}vdang/ranklib.html. P. Donmez, K. M. Svore, and C. J. Burges. On the Local Optimality of LambdaRank. In Proceedings of the 32nd international ACM SIGIR Conference on Research and Development in Information Retrieval, pages 460–467. ACM, 2009. R. O. Duda and P. E. Hart. Pattern Recognition and Scene Analysis, 1973. iv D. Dzyabura and J. R. Hauser. Active Machine Learning for Consideration Heuristics. Marketing Science, 30 (5):801–819, 2011. C. Eickhoff, K. Collins-Thompson, P. N. Bennett, and S. Dumais. Personalizing Atypical Web Search Sessions. In Proceedings of the Sixth ACM International Conference on Web Search and Data Mining, pages 285–294, 2013. T. Evgeniou, C. Boussios, and G. Zacharia. Generalized Robust Conjoint Estimation. Marketing Science, 24 (3):415–429, 2005. T. Evgeniou, M. Pontil, and O. Toubia. A Convex Optimization Approach to Modeling Consumer Heterogeneity in Conjoint Estimation. Marketing Science, 26(6):805–818, 2007. M. Feuz, M. Fuller, and F. Stalder. Personal Web Searching in the Age of Semantic Capitalism: Diagnosing the Mechanisms of Personalisation. First Monday, 16(2), 2011. J. Friedman, T. Hastie, R. Tibshirani, et al. Additive Logistic Regression: A Statistical View of Boosting (with Discussion and a Rejoinder by the Authors). The Annals of Statistics, 28(2):337–407, 2000. E. Gabbert. Keywords vs. Search Queries: What’s the Difference?, 2011. URL http://www.wordstream. com/blog/ws/2011/05/25/keywords-vs-search-queries. A. Ghose, P. G. Ipeirotis, and B. Li. Examining the Impact of Ranking on Consumer Behavior and Search Engine Revenue. Management Science, 2014. M. Ginsberg. Essentials of Artificial Intelligence. Newnes, 1993. A. Göker and D. He. Analysing Web Search Logs to Determine Session Boundaries for User-oriented Learning. In Adaptive Hypermedia and Adaptive Web-Based Systems, pages 319–322, 2000. Google. How Search Works, 2014. URL http://www.google.com/intl/en_us/ insidesearch/howsearchworks/thestory/. Google Official Blog. Personalized Search for Everyone, 2009. URL http://googleblog.blogspot. com/2009/12/personalized-search-for-everyone.html. P. Grey. How Many Products Does Amazon Sell?, 2013. URL http://export-x.com/2013/12/15/ many-products-amazon-sell/. I. Guyon and A. Elisseeff. An Introduction to Variable and Feature Selection. The Journal of Machine Learning Research, 3:1157–1182, 2003. A. Hannak, P. Sapiezynski, A. Molavi Kakhki, B. Krishnamurthy, D. Lazer, A. Mislove, and C. Wilson. Measuring Personalization of Web Search. In Proceedings of the 22nd International Conference on World Wide Web, pages 527–538. International World Wide Web Conferences Steering Committee, 2013. E. F. Harrington. Online Ranking/Collaborative Filtering using the Perceptron Algorithm. In ICML, volume 20, pages 250–257, 2003. T. Hastie, R. Tibshirani, J. Friedman, T. Hastie, J. Friedman, and R. Tibshirani. The Elements of Statistical Learning, volume 2. Springer, 2009. D. Huang and L. Luo. Consumer Preference Elicitation of Complex Products using Fuzzy Support Vector Machine Active Learning. Marketing Science, 2015. InstantWatcher. InstantWatcher.com, 2014. URL http://instantwatcher.com/titles/all. K. Järvelin and J. Kekäläinen. IR Evaluation Methods for Retrieving Highly Relevant Documents. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’00, pages 41–48. ACM, 2000. K. Järvelin and J. Kekäläinen. Cumulated Gain-based Evaluation of IR Techniques. ACM Transactions on Information Systems (TOIS), 20(4):422–446, 2002. Kaggle. Personalized Web Search Challenge, 2013. URL http://www.kaggle.com/c/ yandex-personalized-web-search-challenge. Kaggle. Private Leaderboard - Personalized Web Search Challenge , 2014. URL https://www.kaggle. com/c/yandex-personalized-web-search-challenge/leaderboard. v R. Kohavi and G. H. John. Wrappers for Feature Subset Selection. Artificial intelligence, 97(1):273–324, 1997. A. Lambrecht and C. Tucker. When Does Retargeting Work? Information Specificity in Online Advertising. Journal of Marketing Research, 50(5):561–576, 2013. H. Li and P. Kannan. Attributing Conversions in a Multichannel Online Marketing Environment: An Empirical Model and a Field Experiment. Journal of Marketing Research, 51(1):40–56, 2014. T. Y. Liu. Learning to Rank for Information Retrieval. Springer, 2011. P. Masurel, K. Lefvre-Hasegawa, C. Bourguignat, and M. Scordia. Dataiku’s Solution to Yandexs Personalized Web Search Challenge. Technical report, Dataiku, 2014. D. Metzler and T. Kanungo. Machine Learned Sentence Selection Strategies for Query-biased Summarization. In SIGIR Learning to Rank Workshop, 2008. K. P. Murphy. Machine Learning: A Probabilistic Perspective. The MIT Press, 2012. ISBN 0262018020, 9780262018029. O. Netzer, R. Feldman, J. Goldenberg, and M. Fresko. Mine Your Own Business: Market-Structure Surveillance Through Text-Mining. Marketing Science, 31(3):521–543, 2012. H. Pavliva. Google-Beater Yandex Winning Over Wall Street on Ad View, 2013. URL http://www.bloomberg.com/news/2013-04-25/ google-tamer-yandex-amasses-buy-ratings-russia-overnight.html. T. Qin, T.-Y. Liu, J. Xu, and H. Li. LETOR: A Benchmark Collection for Research on Learning to Rank for Information Retrieval. Information Retrieval, 13(4):346–374, 2010. F. Qiu and J. Cho. Automatic Identification of User Interest for Personalized Search. In Proceedings of the 15th International Conference on World Wide Web, pages 727–736, 2006. J. R. Quinlan. C4. 5: Programs for Machine Learning, volume 1. Morgan kaufmann, 1993. J. Reunanen. Overfitting in Making Comparisons Between Variable Selection Methods. The Journal of Machine Learning Research, 3:1371–1382, 2003. P. E. Rossi and G. M. Allenby. Bayesian Statistics and Marketing. Marketing Science, 22(3):304–328, 2003. P. E. Rossi, R. E. McCulloch, and G. M. Allenby. The Value of Purchase History Data in Target Marketing. Marketing Science, 15(4):321–340, 1996. R. E. Schapire. The Strength of Weak Learnability. Machine learning, 5(2):197–227, 1990. B. Schwartz. Google: Previous Query Used On 0.3% Of Searches, 2012. URL http://www. seroundtable.com/google-previous-search-15924.html. D. Silverman. IAB Internet Advertising Revenue Report, October 2013. URL http://www.iab.net/ media/file/IABInternetAdvertisingRevenueReportHY2013FINALdoc.pdf. A. Singhal. Some Thoughts on Personalization, 2011. URL http://insidesearch.blogspot.com/ 2011/11/some-thoughts-on-personalization.html. G. Song. Point-Wise Approach for Yandex Personalized Web Search Challenge. Technical report, IEEE.org, 2014. D. Sullivan. Bing Results Get Localized & Personalized, 2011. URL http://searchengineland. com/bing-results-get-localized-personalized-64284. M. Surdeanu, M. Ciaramita, and H. Zaragoza. Learning to Rank Answers on Large Online QA Collections. In ACL, pages 719–727, 2008. J. Teevan, S. T. Dumais, and D. J. Liebling. To Personalize or Not to Personalize: Modeling Queries with Variation in User Intent. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 163–170, 2008. S. Tirunillai and G. J. Tellis. Extracting Dimensions of Consumer Satisfaction with Quality from Online Chatter: Strategic Brand Analysis of Big Data Using Latent Dirichlet Allocation. Working paper, 2014. O. Toubia, D. I. Simester, J. R. Hauser, and E. Dahan. Fast Polyhedral Adaptive Conjoint Estimation. Marketing Science, 22(3):273–303, 2003. vi O. Toubia, J. R. Hauser, and D. I. Simester. Polyhedral Methods for Adaptive Choice-based Conjoint Analysis. Journal of Marketing Research, 41(1):116–131, 2004. O. Toubia, T. Evgeniou, and J. Hauser. Optimization-Based and Machine-Learning Methods for Conjoint Analysis: Estimation and Question Design. Conjoint Measurement: Methods and Applications, page 231, 2007. M. Volkovs. Context Models For Web Search Personalization. Technical report, University of Toronto, 2014. Q. Wu, C. J. Burges, K. M. Svore, and J. Gao. Adapting Boosting for Information Retrieval Measures. Information Retrieval, 13(3):254–270, 2010. Yandex. It May Get Really Personal We have Rolled Out Our Second–Generation Personalised Search Program, 2013. URL http://company.yandex.com/press_center/blog/entry.xml?pid=20. Yandex. Yandex Announces Third Quarter 2013 Financial Results, 2013. URL http://company.yandex. com/press_center/blog/entry.xml?pid=20. YouTube. YouTube Statistics, 2014. URL https://www.youtube.com/yt/press/statistics. html. Y. Yue and C. Burges. On Using Simultaneous Perturbation Stochastic Approximation for IR measures, and the Empirical Optimality of LambdaRank. In NIPS Machine Learning for Web Search Workshop, 2007. J. Zhang and M. Wedel. The Effectiveness of Customized Promotions in Online and Offline Stores. Journal of Marketing Research, 46(2):190–206, 2009. vii