Search Personalization using Machine Learning

advertisement
Search Personalization using Machine Learning
Hema Yoganarasimhan ∗
University of Washington
Abstract
Query-based search is commonly used by many businesses to help consumers find
information/products on their websites. Examples include search engines (Google,
Bing), online retailers (Amazon, Macy’s), and entertainment sites (Hulu, YouTube).
Nevertheless, a significant portion of search sessions are unsuccessful, i.e., do not provide
information that the user was looking for. We present a machine learning framework that
improves the quality of search results through automated personalization based on a user’s
search history. Our framework consists of three modules – (a) Feature generation, (b)
NDCG-based LambdaMART algorithm, and (c) Feature selection wrapper. We estimate
our framework on large-scale data from a leading search engine using Amazon EC2
servers. We show that our framework offers a significant improvement in search quality
compared to non-personalized results. We also show that the returns to personalization
are monotonically, but concavely increasing with the length of user history. Next, we
find that personalization based on short-term history or “within-session” behavior is less
valuable than long-term or “across-session” personalization. We also derive the value of
different feature sets – user-specific features contribute over 50% of the improvement
and click-specific over 28%. Finally, we demonstrate scalability to big data and derive
the set of optimal features that maximize accuracy while minimizing computing speed.
Keywords: marketing, online search, personalization, big data, machine learning
∗
The author is grateful to Yandex for providing the data used in this paper. She thanks Brett Gordon for detailed
comments on the paper. Thanks are also due to the participants of the 2014 UT Dallas FORMS, Marketing Science, and
Big Data and Marketing Analytics conferences for feedback. Please address all correspondence to: hemay@uw.edu.
1
1.1
Introduction
Online Search and Personalization
As the Internet has matured, the amount of information and products available in individual websites
has grown exponentially. Today, Google indexes over 30 trillion webpages (Google, 2014), Amazon
sells 232 million products (Grey, 2013), Netflix streams more than 10,000 titles (InstantWatcher,
2014), and over 100 hours of video are uploaded to YouTube every minute (YouTube, 2014). While
unprecedented growth in product assortments gives an abundance of choice to consumers, it nevertheless makes it hard for them to locate the exact product or information they are looking for. To
address this issue, most businesses today use a query-based search model to help consumers locate
the product/information that best fits their needs. Consumers enter queries (or keywords) in a search
box, and the firm presents a set of products/information that is deemed most relevant to the query.
We focus on the most common class of search – Internet search engines, which play an important
role in today’s digital economy. In the U.S. alone, over 18 billion search queries are made every month,
and search advertising generates over $4 billion US dollars in quarterly revenues (Silverman, 2013;
comScore, 2014). Consumers use search engines for a variety of reasons; e.g., to gather information
on a topic from authoritative sources, to remain up-to-date with breaking news, and to aid purchase
decisions. Search results and their relative ranking can therefore have significant implications for
consumers, businesses, governments, and for the search engines themselves.
However, a significant portion of searches are unsuccessful, i.e., they do not provide the information that the user was looking for. According to some experts, this is due to the fact that even
when two users search using the same keywords, they are often looking for different information.
Thus, a standardized ranking of results across users may not achieve the best results. To address
this concern, Google launched personalized search a few years back, customizing results based on a
user’s IP address and cookie (individual-level search and click data), using up to 180 days of history
(Google Official Blog, 2009). Since then, other search engines such as Bing and Yandex have also
experimented with personalization (Sullivan, 2011; Yandex, 2013). See Figure 1 for an example of
how personalization can influence search results.
However, the extent and effectiveness of search personalization has been a source of intense debate,
partly fueled by the secrecy of search engines on how they implement personalization (Schwartz,
2012; Hannak et al., 2013).1 Indeed, some researchers have even questioned the value of personalizing
search results, not only in the context of search engines, but also in other settings (Feuz et al., 2011;
1
In Google’s case, the only source of information on its personalization strategy is a post in its official blog, which suggests
that whether a user previously clicked a page and the temporal order of a user’s queries influence her personalized
rankings (Singhal, 2011). In Yandex’s case, all that we know comes from a quote by one of its executives, Mr. Bakunov –
“Our testing and research shows that our users appreciate an appropriate level of personalization and increase their use of
Yandex as a result, so it would be fair to say it can only have a positive impact on our users.” (Atkins-Krüger, 2012).
1
Figure 1: An example of personalization. The right panel shows results with personalization turned off by
using the Chrome browser in a fresh incognito session. The left panel shows personalized results. The fifth
result in the left panel from the domain babycenter.com does not appear in the right panel. This is because the
user in the left panel had previously visited this URL and domain multiple times in the past few days.
Ghose et al., 2014). In fact, this debate on the returns to search personalization is part of a larger
one on the returns to personalization of any marketing mix variable (see §2 for details). Given the
privacy implications of storing long user histories, both consumers and managers need to understand
the value of personalization.
1.2
Research Agenda
In this paper, we examine the returns to personalization in the context of Internet search. We address
three managerially relevant research questions:
1. Can we provide a general framework to implement personalization in the field and evaluate
the gains from it? In principle, if consumers’ tastes and behavior show persistence over time
in a given setting, personalization using data on their past behavior should improve outcomes.
However, we do not know the degree of persistence consumers’ tastes/behavior and the extent
to which personalization can improve search experience is an empirical question.
2. Gains from personalization depend on a large number of factors that are under the manager’s
2
control. These include the length of user history, the different attributes of data used in the
empirical framework, and whether data across multiple sessions is used or not. We examine the
impact of these factors on the effectiveness of personalization. Specifically:
(a) How does the improvement in personalization vary with the length of user history? For
example, how much better can we do if we have data on the past 100 queries of a user
compared to only her past 50 queries?
(b) Which attributes of the data or features, lead to better results from personalization? For
example, which of these metrics is more effective – a user’s propensity to click on the
first result or her propensity to click on previously clicked results? Feature design is an
important aspect of any ranking problem and is therefore treated as a source of competitive
advantage and kept under wraps by most search engines (Liu, 2011). However, all firms
that use search would benefit from understanding the relative usefulness of different types
of features.
(c) What is the relative value of personalizing based on short-run data (a single session)
vs. long-run data (multiple sessions)? On the one hand, within session personalization
might be more informative of consumer’s current quest. On the other hand, across
session data can help infer persistence in consumers’ behavior. Moreover, within session
personalization has to happen in real time and is computationally intensive. So knowing
the returns to it can help managers decide whether it is worth the computational cost.
3. What is the scalability of personalized search to big data? Search datasets tend to be extremely
large and personalizing the results for each consumer is computationally intensive. For a
proposed framework to be implementable in the field, the framework has to be scalable and
efficient. Can we specify sparser models that would significantly reduce the computational
costs while minimizing the loss in accuracy?
1.3
Challenges and Research Framework
We now discuss the three key challenges involved in answering these questions and present an
overview of our research framework.
• Out-of-sample predictive accuracy: First, we need a framework with extremely high out-ofsample predictive accuracy. Note that the search engine’s problem is one of prediction – which
result or document will a given user find most relevant? More formally, what is the user-specific
rank-ordering of the documents in decreasing order of relevance, so the most relevant documents
are at the top for each user?2
2
Search engines don’t aim to provide the most objectively correct or useful information. Relevance is data-driven and
search engines aim to maximize the user’s probability of clicking on and dwelling on the results shown. Hence, it is
3
Unfortunately, the traditional econometric tools used in marketing literature have been designed
for causal inference (to derive the best unbiased estimates) and not prediction. So they perform
poorly in out-of-sample predictions. Indeed, the well-known “bias-variance trade-off” suggests
that unbiased estimators are not always the best predictors because they tend to have high variance
and vice-versa. See Hastie et al. (2009) for an excellent discussion of this concept. The problems
of prediction and causal inference are thus fundamentally different and often at odds with each
other. Hence, we need a methodological framework that is optimized for predictive accuracy.
• Large number of attributes with complex interactions: The standard approach to modeling is –
assume a fixed functional form to model an outcome variable, include a small set of explanatory
variables, allow for a few interactions between them, and then infer the parameters associated
with the pre-specified functional form. However, this approach has poor predictive ability (as
we later show). In our setting, we have a large number of features (close to 300), and we need to
allow for non-linear interactions between them. This problem is exacerbated by the fact that we
don’t know which of these variables and which interaction effects matter, a priori. Indeed, it is
not possible for the researcher to come up with the best non-linear specification of the functional
form either by manually eyeballing the data or by including all possible interactions. The latter
would lead to over 45,000 variables, even if we restrict ourselves to simple pair-wise interactions.
Thus, we are faced with the problem of not only inferring the parameters of the model, but also
inferring the optimal functional form of the model itself. So we need a framework that can search
and identify the relevant set of features and all possible non-linear interactions between them.
• Scalability and efficiency to big data settings: Finally, as discussed earlier, our approach needs to
be both efficient and scalable. Efficiency refers to the idea that the run-time and deployment of the
model should be as small as possible without minimal loss in accuracy. Thus, an efficient model
may lose a very small amount of accuracy, but make huge gains on run-time. Scalability refers
to the idea that X times increase in the data leads to no more than 2X (or linear) increase in the
deployment time. Both scalability and efficiency are necessary to run the model in real settings.
To address these challenges, we turn to the Machine Learning methods, pioneered by computer
scientists. These are specifically designed to provide extremely high out-of-sample accuracy, work
with a large number of data attributes, and perform efficiently at scale.
Our framework employs a three-pronged approach – (a) Feature generation, (b) NDCG-based
LambdaMART algorithm, and (c) Feature selection using the Wrapper method. The first module
consists of a program that uses a series of functions to generate a large set of features (or attributes)
that function as inputs to the subsequent modules. These features consist of both global as well as
personalized metrics that characterize the aggregate as well as user-level data regarding relevance of
possible that non-personalized results may be perceived as more informative and the personalized ones more narrow, by
an objective observer. However, this is not an issue when maximizing the probability of click for a specific user.
4
search results. The second module consists of the LambdaMART ranking algorithm that maximizes
search quality, as measured by Normalized Discounted Cumulative Gain (NDCG). This module
employs the Lambda ranking algorithm in conjunction with the machine learning algorithm called
Multiple Additive Regression Tree (MART) to maximize the predictive ability of the framework.
MART performs automatic variable selection and infers both the optimal functional form (allowing
for all types of interaction effects and non-linearities) as well as the parameters of the model. It is not
hampered by the large number of parameters and variables since it performs optimization efficiently
in gradient space using a greedy algorithm. The third module focuses on scalability and efficiency – it
consists of a Wrapper algorithm that wraps around the LambdaMART step to derive the set of optimal
features that maximize search quality with minimal loss in accuracy. The Wrapper thus identifies the
most efficient model that scales well.
Apart from the benefits discussed earlier, an additional advantage of our approach is its applicability to datasets that lack information on context, text, and links, factors that have been shown to play
a big role in improving search quality (Qin et al., 2010). Our framework only requires an existing
non-personalized ranking. As concerns about user privacy increase, this is valuable.
1.4
Findings and Contribution
We apply our framework to a large-scale dataset from Yandex. It is the fourth largest search engine
in the world, with over 60% market share in Russia and Eastern European (Clayton, 2013; Pavliva,
2013) and over $320 billion US dollars in quarterly revenues (Yandex, 2013). In October 2013,
Yandex released a large-scale anonymized dataset for a ‘Yandex Personalized Web Search Challenge’,
which was hosted by Kaggle (Kaggle, 2013). The data consists of information on anonymized user
identifiers, queries, query terms, URLs, URL domains and clicks. The data spans one month, consists
of 5.7 million users, 21 million unique queries, 34 million search sessions, and over 64 million clicks.
Because all three modules in our framework are memory and compute intensive, we estimate the
model on the fastest servers available on Amazon EC2, with 32 cores and 256 GB of memory.
We present five key results. First, we show that our personalized machine learning framework
leads to significant improvements in search quality as measured by NDCG. In situations where the
topmost link is not the most relevant (based on the current ranking), personalization improves search
quality by approximately 15%. From a consumer’s perspective, this leads to big changes in the
ordering of results they see – over 55% of the documents are re-ordered, of which 30% are re-ordered
by two or more positions. From the search engine’s perspective, personalization leads to a substantial
increase in click-through-rates at the top position, which is a leading measure of user satisfaction.
We also show that standard logit-style models used in the marketing literature lead to no improvement in search quality, even if we use the same set of personalized features generated by the machine
learning framework. This re-affirms the value of using a large-scale machine learning framework that
5
can provide high predictive accuracy.
Second, the quality of personalized results increases with the length of a user’s search history in a
monotonic but concave fashion (between 3% − 17% for the median user who has made about 50 past
queries). Since data storage costs are linear, firms can use our estimates to derive the optimal length
of history that maximizes their profits.
Third, we examine the value of different types of features. User-specific features generate over
50% of the improvement from personalization. Click-related features (that are based on users’
clicks and dwell-times) are also valuable, contributing over 28% of the improvement in NDCG.
Both domain- and URL-specific features are also important, contributing over 10% of the gain in
NDCG. Interestingly, session-specific features are not very useful – they contribute only 4% of
the gain in NDCG. Thus, we find that “across-session” personalization based on user-level data is
more valuable than “within-session” personalization based on session-specific data. Firms often
struggle with implementing within-session personalization since it has to happen in real-time and is
computationally expensive. Our findings can help firms decide whether the relatively low value from
within-session personalization is sufficient to warrant the computing costs of real-time responses.
Fourth, we show that our method is scalable to big data. Going from 20% of the data to the full
dataset leads to a 2.38 times increase in computing time for feature-generation, 20.8 times increase for
training, and 11.6 times increase for testing/deployment of the final model. Thus, training and testing
costs increase super-linearly, whereas feature-generation costs increase sub-linearly. Crucially, none
of the speed increases are exponential, and all of them provide substantial improvements in NDCG.
Finally, to increase our framework’s efficiency, we provide a forward-selection wrapper, which
derives the optimal set of features that maximize search quality with minimal loss in accuracy. Of the
293 features provided by the feature-generation module, we optimally select 47 that provide 95% of
the improvement in NDCG with significantly lower computing costs and speed. We find that training
the pruned model takes only 15% of the time taken to train the unpruned model. Similarly, we find
that deploying the pruned model is only 50% as expensive as deploying the full model.
In sum, our paper makes three main contributions to the marketing literature on personalization.
First, it presents a comprehensive machine learning framework to help researchers and managers
improve personalized marketing decisions using data on consumers’ past behavior. Second, it presents
empirical evidence in support of the returns to personalization in the online search context. Third, it
demonstrates how large datasets or big data can be leveraged to improve marketing outcomes without
compromising the speed or real-time performance of digital applications.
2
Related Literature
Our paper relates to many streams of literature in marketing and computer science.
First, it relates to the empirical literature on personalization in marketing. In an influential paper,
6
Rossi et al. (1996) quantify the benefits of one-to-one pricing using purchase history data and find
that personalization improves profits by 7.6%. Similarly, Ansari and Mela (2003) find that contenttargeted emails can increase the click-throughs up to 62%, and Arora and Henderson (2007) find
that customized promotions in the form of ‘embedded premiums’ or social causes associated with
the product can improve profits. On the other hand, a series of recent papers question the returns to
personalization in a variety of contexts ranging from advertising to ranking of hotels in travel sites
(Zhang and Wedel, 2009; Lambrecht and Tucker, 2013; Ghose et al., 2014).3
Our paper differs from this literature on two important dimensions. Substantively, we focus on
product personalization (search results) unlike the earlier papers that focus on the personalization of
marketing mix variables such as price and advertising. Even papers that consider ranking outcomes,
e.g., Ghose et al. (2014), use marketing mix variables such as advertising and pricing to make ranking
recommendations. However, in the information retrieval context, there are no such marketing mix
variables. Moreover, with our anonymized data, we don’t even have access to the topic/context of
search. Second, from a methodological perspective, previous papers use computationally intensive
Bayesian methods that employ Markov Chain Monte Carlo (MCMC) procedures, or randomized
experiments, or discrete choice models – all of which are not only difficult to scale, but also lack
predictive power (Rossi and Allenby, 2003; Friedman et al., 2000). In contrast, we use machine
learning methods that can handle big data and offer high predictive performance.
In a similar vein, our paper also relates to the computer science literature on personalization
of search engine rankings. Qiu and Cho (2006) infer user interests from browsing history using
experimental data on ten subjects. Teevan et al. (2008) and Eickhoff et al. (2013) show that there could
be returns to personalization for some specific types of searches involving ambiguous queries and
atypical search sessions. Bennett et al. (2012) find that long-term history is useful early in the session
whereas short-term history later in the session. There are two main differences between these papers
and ours. First, in all these papers, the researchers had access to the data with context – information
on the URLs and domains of the webpages, the hyperlink structure between them, and the queries
used in the search process. In contrast, we work with anonymized data – we don’t have data on the
URL identifiers, texts, or links to/from webpages; nor do we know the queries. This makes our task
significantly more challenging. We show that personalization is not only feasible, but also beneficial
in such cases. Second, the previous papers work with much smaller datasets, which helps them avoid
the scalability issues that we face.
More broadly, our paper is relevant to the growing literature on machine learning methodologies
for Learning-to-Rank (LETOR) methods. Many information retrieval problems in business settings
often boil down to ranking. Hence, ranking methods have been studied in many contexts such as
3
A separate stream of literature examines how individual-level personalized measures can be used to improve multi-touch
attribution in online channels (Li and Kannan, 2014).
7
collaborative filtering (Harrington, 2003), question answering (Surdeanu et al., 2008; Banerjee et al.,
2009), text mining (Metzler and Kanungo, 2008), and digital advertising (Ciaramita et al., 2008). We
refer interested readers to Liu (2011) for details.
Our paper also contributes to the literature on machine learning methods in marketing. Most of
these applications have been in the context of online conjoint analysis and use statistical machine
learning techniques. These methods are employed to alleviate issues such as noisy data and complexity
control (similar to feature selection in our paper); they also help in model selection and improve
robustness to response error (in participants). Some prominent works in this area include Toubia et al.
(2003, 2004); Evgeniou et al. (2005, 2007). Please see Toubia et al. (2007) for a detailed discussion of
this literature, and Dzyabura and Hauser (2011); Tirunillai and Tellis (2014); Netzer et al. (2012);
Huang and Luo (2015) for other applications of machine learning in marketing.
3
3.1
Data
Data Description
We use the publicly available anonymized dataset released by Yandex for its personalization challenge
(Kaggle, 2013). The dataset is from 2011 (exact dates are not disclosed) and represents 27 days of
search activity. The queries and users are sampled from a large city in eastern Europe. The data is
processed such that two types of sessions are removed – those containing queries with commercial
intent (as detected by Yandex’s proprietary classifier) and those containing one or more of the top-K
most popular queries (where K is not disclosed). This minimizes risks of reverse engineering of
user identities and business interests. The current rankings shown in the data are based on Yandex’s
proprietary non-personalized algorithm.
Table 1 shows an example of a session from the data. As such, the data contains information on
the following anonymized variables:
• User: A user, indexed by i, is an agent who uses the search engine for information retrieval. Users
are tracked over time through a combination of cookies, IP addresses, and logged in accounts.
There are 5,659,229 unique users in the data.
• Query: A query is composed of one or more words, in response to which the search engine returns
a page of results. Queries are indexed by q. There are a total of 65,172,853 queries in the data,
of which 20,588,928 are unique.4 Figure 2 shows the distribution of queries in the data, which
follows a long-tail pattern – 1% of queries account for 47% of the searches, and 5% for 60%.
• Term: Each query consists of one or more terms, and terms are indexed by l. For example, the
query ‘pictures of richard armitage’ has three terms – ‘pictures’, ‘richard’, and ‘armitage’. There
are 4,701,603 unique terms in the data, and we present the distribution of the number of terms in
4
Queries are not the same as keywords, which is another term used frequently in connection with search engines. Please
see Gabbert (2011) for a simple explanation of the difference between the two phrases.
8
20
M
6
5
20
0
Q
0
6716819
20
20
13
495
C
C
0
0
70100440
29601078
2635270,4603267
70100440,5165392 17705992,1877254 41393646,3685579
46243617,3959876 26213624,2597528 29601078,2867232
53702486,4362178 11708750,1163891 53485090,4357824
4643491,616472
Table 1: Example of a session. The data is in a log format describing a stream of user actions, with each line
representing a session, query, or a click. In all lines, the first column represents the session (SessionID = 20 in
this example). The first row from a session contains the session-level metadata and is indicated by the letter M.
It has information on the day the session happened (6) and the UserID (5). Rows with the letter Q denote a
Query and those with C denote a Click. The user submitted the query with QueryID 6716819, which contains
terms with TermIDs 2635270, 4603267. The URL with URLID 70100440 and DomainID 5165392 is the top
result on SERP. The rest of the nine results follow similarly. 13 units of time into the session the user clicked on
the result with URLID 70100440 (ranked first in the list) and 495 units of time into the session s/he clicked on
the result with URLID 29601078.
Fraction of searches
1
0.8
0.6
0.4
0.2
0
0
0.2
0.4
0.6
Fraction of unique queries
0.8
1
Figure 2: Cumulative distribution function of the number of unique queries – the fraction of overall searches
corresponding to unique queries, sorted by popularity.
all the queries seen in the data in Table 2.5
• Session: A search session, indexed by j, is a continuous block of time during which a user is
repeatedly issuing queries, browsing and clicking on results to obtain information on a topic.
Search engines use proprietary algorithms to detect session boundaries (Göker and He, 2000);
and the exact method used by Yandex is not publicly revealed. According to Yandex, there are
34,573,630 sessions in the data.
• URL: After a user issues a query, the search engine returns a set of ten results in the first page.
These results are essentially URLs of webpages relevant to the query. There are 69,134,718
unique URLs in the data, and URLs are indexed by u. Yandex only provides data for the first
Search Engine Results Page (SERP) for a given query because a majority of users never go beyond
5
Search engines typically ignore common keywords and prepositions used in searches such as ‘of’, ‘a’, ‘in’, because they
tend to be uninformative. Such words are called ‘Stop Words’.
9
Number of Terms
Per Query
1
2
3
4
5
6
7
8
9
10 or more
Percentage of
Queries
22.32
24.85
20.63
14.12
8.18
4.67
2.38
1.27
0.64
1.27
Position
1
2
3
4
5
6
7
8
9
10
Table 2: Distribution of number of terms in a query.
(Min, Max) = (1, 82).
Probability of click
0.4451
0.1659
0.1067
0.0748
0.0556
0.0419
0.0320
0.0258
0.0221
0.0206
Table 3: Click probability based on position.
the first page, and click data for subsequent pages is very sparse. So from an implementation
perspective, it is difficult to personalize later SERPs; and from a managerial perspective, the
returns to personalizing them is low since most users never see them.
• URL domain: Is the domain to which a given URL belongs. For example, www.imdb.com is a
domain, while http://www.imdb.com/name/nm0035514/ and http://www.imdb.
com/name/nm0185819/ are URLs or webpages hosted at this domain. There are 5,199,927
unique domains in the data.
• Click: After issuing a query, users can click on one or all the results on the SERP. Clicking a
URL is an indication of user interest. We observe 64,693,054 clicks in the data. So, on average, a
user clicks on 3.257 times following a query (including repeat clicks to the same result). Table 3
shows the probability of click by position or rank of the result. Note the steep drop off in click
probability with position – the first document is clicked 44.51% of times, whereas the fifth one is
clicked a mere 5.56% of times.
• Time: Yandex gives us the timeline of each session. However, to preserve anonymity, it does not
disclose how many milliseconds are in one unit of time. Each action in a session comes with a
time stamp that tells us how long into the session the action took place. Using these time stamps,
Yandex calculates ‘dwell time’ for each click, which is the time between a click and the next click
or the next query. Dwell time is informative of how much time a user spent with a document.
• Relevance: A result is relevant to a user if she finds it useful and irrelevant otherwise. Search
engines attribute relevance scores to search results based on user behavior – clicks and dwell
time. It is well known that dwell time is correlated with the probability that the user satisfied
her information needs with the clicked document. The labeling is done automatically, and hence
different for each query issued by the user. Yandex uses three grades of relevance:
• Irrelevant, r = 0.
10
1
0.8
0.8
Fraction of users
Fraction of sessions
1
0.6
0.4
0.4
0.2
0.2
0
0.6
0
0
2
4
6
Number of queries
8
10
1
4
16
Number of queries
64
256
Figure 4: Cumulative distribution function of number of queries per user (observed in days 26, 27. Min,
Max = 1, 1260.
Figure 3: Cumulative distribution function of the
number of queries in a session. Min, Max = 1, 280.
A result gets a irrelevant grade if it doesn’t get a click or if it receives a click with dwell time
strictly less than 50 units of time.
• Relevant, r = 1.
Results with dwell times between 50 and 399 time units receive a relevant grade.
• Highly relevant, r = 2.
Is given to two types of documents – a) those that received clicks with a dwell time of 400
units or more, and b) if the document was clicked, and the click was the last action of the
session (irrespective of dwell time). The last click of a session is considered to be highly
relevant because a user is unlikely to have terminated her search if she is not satisfied.
In cases where a document is clicked multiple times after a query, the maximum dwell time is
used to calculate the document’s relevance.
In sum, we have a total of 167,413,039 lines of data.
3.2
Model-Free Analysis
For personalization to be effective, we need to observe three types of patterns in the data. First, we
need sufficient data on user history to provide personalized results. Second, there has to be significant
heterogeneity in users’ tastes. Third, users’ tastes have to exhibit persistence over time. Below, we
examine the extent to which search and click patterns satisfy these criteria.
First, consider user history metrics. If we examine the number of queries per session (see Figure 3),
we find that 60% of the sessions have only one query. This implies that within session personalization
is not possible for such sessions. However, 20% of sessions have two or more queries, making
them amenable to within session personalization. Next, we examine the feasibility of across-session
personalization for users observed in days 26 and 27 by considering the number of queries they have
11
1
0.8
0.8
Fraction of users
Fraction of users
1
0.6
0.4
0.4
0.2
0.2
0
0.6
0
0
1
2
3
Average clicks/query for user
4
5
0
1
2
3
Average clicks/query for user
4
5
Figure 6: Cumulative distribution function of average clicks per query for users who have issued 10 or
more queries.
Figure 5: Cumulative distribution function of average number of clicks per query for a given user.
issued in the past.6 We find that the median user has issued 16 queries and over 25% of the users have
issued at least 32 queries (see Figure 4). Together, these patterns suggest that a reasonable degree of
both short and long-term personalization is feasible in this context.
Figure 5 presents the distribution of the average number of clicks per query for the users in the
data. As we can see, there is considerable heterogeneity in users’ propensity to click. For example,
more than 15% of users don’t click at all, whereas 20% of them click on 1.5 results or more for each
query they issue. To ensure that these patterns are not driven by one-time or infrequent users, we
present the same analysis for users who have issued at least 10 queries in Figure 6, which shows
similar click patterns. Together, these two figures establish the presence of significant user-specific
heterogeneity in users’ clicking behavior.
Next, we examine whether there is variation or entropy in the URLs that are clicked for a given
query. Figure 7 presents the cumulative distribution of the number of unique URLs clicked for a
given query through the time period of the data. 37% of queries receive no clicks at all, and close to
another 37% receive a click to only one unique URL. These are likely to be navigational queries; e.g.,
almost everyone who searches for ‘cnn’ will click on www.cnn.com. However, close to 30% of
queries receive clicks on two or more URLs. This suggests that users are clicking on multiple results.
This could be either due to heterogeneity in users’ preferences for different results or because the
results themselves change a lot for the query during the observation period (high churn in the results
is likely for news related queries, such as ‘earthquake’). To examine if the last explanation drives all
the variation in Figure 7, we plot the number of unique URLs and domains we see for a given query
through the course of the data in Figure 8. We find that more than 90% of queries have no churn in the
6
As explained in §6.1, users observed early in the data have no/very little past history because we have a data snapshot for
fixed time period (27 days), and not a fixed history for each user. Hence, following Yandex’s suggestion, we run the
personalization algorithm for users observed in the last two of days. In practice, of course, firms don’t have to work with
snapshots of data; rather they are likely to have longer histories stored at the user level.
12
1
Fraction of unique queries
Fraction of unique queries
1
0.8
0.6
0.4
0.2
0
0
2
4
6
Total URLs clicked
8
0.8
0.6
0.4
0.2
0
10
Figure 7: Cumulative distribution function of number of unique URLs clicked for a given query in the
27 day period.
URL Results
Domain Results
8
10
16
Total URL/Domain results
32
Figure 8: Cumulative distribution function of number of unique URLs and domains shown for a given
query in the 27 day period.
URLs shown in results page. Hence, it is unlikely that the entropy in clicked URLs is driven by churn
in search engine results. Rather, it is more likely to be driven in heterogeneity in users’ tastes.
In sum, the basic patterns in the data point to significant heterogeneity in users’ preferences and
persistence of those preferences, which suggest the existence of positive returns to personalization.
4
Feature Generation
We now present the first module of our three-pronged empirical strategy – feature generation. In
this module, we conceptualize and define an exhaustive feature set that captures both global and
individual-level attributes that can help predict the most relevant documents for a given search by a
user i during a session j. Feature engineering is an important aspect of all machine learning problems
and is of both technical and substantive interest.
Before proceeding, we make note of two points. First, the term ‘features’ as used in the machine
learning literature may be interpreted as ‘attributes’ or ‘explanatory variables’ in the marketing
context. Second, features based on the hyper-link structure of the web and the text/context of the
URL, domain, and query are not visible to us since we work with anonymized data. Such features
play an important role in ranking algorithms and not having information on them is a significant
drawback (Qin et al., 2010). Nevertheless, we have one metric that can be interpreted as a discretized
combination of such features – the current rank based on Yandex’s algorithm (which does not take
personalized search history into account). While this is not perfect (since there is information-loss
due to discretization and because we cannot interact the underlying features with the user-specific
ones generated here), it is the best data we have, and we use it as a measure of features invisible to us.
Our framework consists of a series of functions that can be used to generate a large and varied
set of features. There are two main advantages to using such a framework instead of deriving and
defining each feature individually. First, it allows us to succinctly define and obtain a large number
13
of features. Second, the general framework ensures that the methods presented in this paper can be
extrapolated to other personalized ranking contexts, e.g., online retailers or entertainment sites.
4.1
Functions for Feature Generation
We now present a series of functions that are used to compute sets of features that capture attributes
associated with the past history of both the search result and the user.
1. Listings(query, url , usr , session, time): This function calculates the number of times a url
appeared in a SERP for searches issued by a given usr for a given query in a specific session
before a given instance of time.
This function can also be invoked with certain input values left unspecified, in which case it
is computed over all possible values of the unspecified inputs. For example, if the user and
session are not specified, then it is computed over all users and across all sessions. That is,
Listings(query, url , −, −, time) computes the number of occurrences of a url in any SERP for
a specific query issued before a given instance of time. Another example of a generalization
of this function involves not providing an input value for query, usr , and session (as in
Listings(−, url , −, −, time)), in which case the count is performed over all SERPs associated
with all queries, users, and sessions; i.e., we count the number of times a given url appears in
all SERPs seen in the data before a given point in time irrespective of the query.
We also define the following two variants of this function.
• If we supply a search term (term) as an input instead of a query, then the invocation
Listings(term, url , usr , session, time) takes into account all queries that include term as
one of the search terms.
• If we supply a domain as an input instead of a URL, then the invocation
Listings(query, domain, usr , session, time) will count the number of times that a result
from domain appeared in the SERP in the searches associated with a given query.
2. AvgPosn(query, url , usr , session, time): This function calculates the average position/rank of
a url within SERPs that list it in response to searches for a given query within a given session
and before an instance of time for a specific usr . It provides information over and above the
Listings function as it characterizes the average ranking of the result as opposed to simply
noting its presence in the SERP.
3. Clicks(query, url , usr , session, time): This function calculates the number of times that a
specific usr clicked on a given url after seeing it on a SERP provided in response to searches
for a given query within a specific session before a given instance of time. Clicks(·) can then
14
be used to compute the click-through rates (CTRs) as follows:
CTR(query, url , usr , session, time) =
Clicks(query, url , usr , session, time)
Listings(query, url , usr , session, time)
(1)
4. HDClicks(query, url , usr , session, time): This function calculates the number of times that
usr made a high duration click or a click of relevance r = 2 on a given url after seeing it on a
SERP provided in response to searches for a given query within a specific session and before a
given instance of time. HDClicks(·) can then be used to compute high-duration click-through
rates (HDCTRs) as follows:
HDCTR(query, url , usr , session, time) =
HDClicks(query, url , usr , session, time)
Listings(query, url , usr , session, time)
(2)
Note that HDClicks and HDCTR have information over and beyond Clicks and CTR since they
consider clicks of high relevance.
5. ClickOrder(query, url , usr , session, time) is the order of the click (conditional on a click) for
a given url by a specific usr in response to a given query within a specific session and before a
given instance of time. Then:
WClicks(query, url , usr , session, time) =


1
ClickOrder(query,url,usr ,session,time)
0
if url was clicked
(3)
if url was not clicked
WClicks calculates a weighted click metric that captures the idea that a URL that is clicked first
is more relevant than one that is clicked second, and so on. This is in contrast to Clicks that
assigns the same weight to URLs irrespective of the order in which they are clicked. WClicks is
used to compute weighted click-through rates (WCTRs) as follows:
WCTR(query, url , usr , session, time) =
WClicks(query, url , usr , session, time)
Listings(query, url , usr , session, time)
(4)
For instance, a URL that is always clicked first when it is displayed has WCTR = CTR, whereas
for a URL that is always clicked second, WCTR = 12 CTR, and for one that is always clicked
third, WCTR = 13 CTR.
6. PosnClicks(query, url , usr , session, posn, time). The position of a URL in the SERP is
denoted by posn. Then:
PosnClicks(query, url , usr , session, posn, time) =
15
posn
× Clicks(query, url , usr , session, time) (5)
10
If a URL in a low position gets clicked, it is likely to be more relevant than one that is in a
high position and gets clicked. This stems from the idea that users usually go through the list
of URLs from top to bottom and that they only click on URLs in lower positions if they are
especially relevant. Using this, we can calculate the position weighted CTR as follows:
PosnCTR(query, url , usr , session, posn, time) =
PosnClicks(query, url , usr , session, posn, time)
(6)
Listings(query, url , usr , session, posn, time)
7. Skips(query, url , usr , session, time): Previous research suggests that users usually go over
the list of results in a SERP sequentially from top to bottom (Craswell et al., 2008). Thus, if the
user skips a top URL and clicks on one in a lower position, it is significant because the decision
to skip the top URL is a conscious decision. Thus, if a user clicks on position 2, after skipping
position 1, the implications for the URLs in positions 1 and 3 are different. The URL in position
1 has been examined and deemed irrelevant, whereas the URL in position 3 may not have been
clicked because the user never reached it. Thus, the implications of a lack of clicks are more
damning for the skipped URL in position 1 compared to the one in position 3.
The function Skips thus captures the number of times a specific url has been ‘skipped’ by a
given usr when it was shown in response to a specific query within a given session and before
time. For example, for a given query assume that the user clicks on the URL in position 5
followed by that in position 3. Then URLs in positions 1 and 2 were skipped twice, the URLs
in positions 3 and 4 were skipped once, and those in positions 6-10 were skipped zero times.
We then compute SkipRate as follows:
SkipRate(query, url , usr , session, time) =
Skips(query, url , usr , session, time)
Listings(query, url , usr , session, time)
(7)
8. Dwells(query, url , usr , session, time): This function calculates the total amount of time that
a given usr dwells on a specific url after clicking on it in response to a given query within a
given session and before time. Dwells can be used to compute the average dwell time (ADT),
which has additional information on the relevance of a url over and above the click-through
rate.
ADT(query, url , usr , session, time) =
Dwells(query, url , usr , session, time)
Clicks(query, url , usr , session, time)
(8)
As in the case of Listings, all the above functions can be invoked in their generalized forms without
specifying a query, user, or session, Similarly, they can also be invoked for terms and domains, except
16
SkipRate which is not invoked for domains. The features generated from these functions are listed
in Table A1 in the Appendix. Note that we use only the first four terms for features computed using
search terms (e.g., those features numbered 16 through 19). Given that 80% of the queries have four
or fewer search terms (see Table 2), the information loss from doing so is minimal while keeping the
number of features reasonable.
In addition to the above, we also use the following functions to generate additional features.
9. QueryTerms(query) is the total number of terms in a given query.
10. Day(usr , session, time) is the day of the week (Monday, Tuesday, etc.) on which the query
was made by a given usr in specific session before a given instance of time.
11. ClickProb(usr , posn, time). This is the probability that a given usr clicks on a search result
that appears at posn within a SERP before a given instance of time. This function captures the
heterogeneity in users’ preferences for clicking on results appearing at different positions.
12. NumSearchResults(query) is the total number of search results appearing on any SERP associated with a given query over the observation period. It is an entropy metric on the extent to
which search results associated with a query vary over time. This number is likely to be high
for news-related queries.
13. NumClickedURLs(query) is the total number of unique URLs clicked on any SERP associated
with that query. It is a measure of the heterogeneity in users’ tastes for information/results
within a query. For instance, for navigational queries this number is likely to be very low,
whereas for other queries like ‘Egypt’, it is likely to be high since some users may be seeking
information on the recent political conflict in Egypt, others on tourism to Egypt, while yet
others on the history/culture of Egypt.
14. SERPRank(url , query, usr , session). Finally, as mentioned earlier, we include the position/rank of a result for a given query by a given usr in a specific session. This is informative
of many aspects of the query, page, and context that are not visible to us. By using SERPRank as
one of the features, we are able to fold in this information into a personalized ranking algorithm.
Note that SERPRank is the same as the position posn of the URL in the results page. However,
we define it as a separate feature since the position can vary across observations.
As before, some of these functions can be invoked more generally, and all the features generated
through them are listed in Table A1 in the Appendix.
4.2
Categorization of Feature Set
The total number of features that can we can generate from these functions is very large (293 to be
specific). To aid exposition and experiment with the impact of including or excluding certain sets of
17
features, we group the above set of features into the following (partially overlapping) categories. This
grouping also provides a general framework that can be used in other contexts.
1. FP includes all user or person-specific features except ones that are at session-level granularity.
For example, FP includes Feature 76, which counts the number of times a specific url appeared
in a SERP for searches issued by a given usr for a specific query.
2. FS includes all user-specific features at a session-level granularity. For example, FS includes
Feature 151, which measures the number of times a given url appeared in a SERP for searches
issued by a given usr for a specific query in a given session.
3. FG includes features that provide global or system-wide statistics regarding searches (i.e.,
features which are not user-specific). For instance, FG includes Feature 1, which measures
the number of times a specific url appeared in a SERP for searches involving a specific query
irrespective of the user issuing the search.
4. FU includes features associated with a given url . For example, FU includes Features 1, 76, 151,
but not Feature 9, which is a function of the domain name associated with a search result, but
not URL.
5. FD includes features that are attributes associate with the domain name of the search result.
This set includes features such as Feature 9, which measures the number of times a given
domain is the domain name of a result in a SERP provided for a specific query.
6. FQ includes features that are computed based on the query used for the search. For example,
FQ includes Feature 1, but does not include Features 16–19 (which are based on the individual
terms used in the search) and Feature 226 (which is not query-specific).
7. FT includes features that are computed based on the individual terms comprising a search query.
For example, FT includes Features 16 − 19.
8. FC includes features that are based on the clicks received by the result. Examples include
Features 2, 5, and 6.
9. FN includes features that are based on weighted clicks; e.g., it includes Features 36 − 39.
10. FW includes features that are computed based on the dwell times associated with user clicks;
e.g., it includes Feature 3, which is the average dwell time for a given url in response to a
specific query.
11. FR includes features that are computed based on the rankings of a search result on different
SERPs. Examples include Features 7 and 15.
12. FK includes features that measure how often a specific search result was skipped over by a user
across different search queries, e.g., Feature 8.
18
The third column of Table A1 in the Appendix provides a mapping from features to the feature sets
that they belong to. It is not necessary to include polynomial or other higher-order transformations of
the features because the optimizer we use (MART) is insensitive to monotone transformations of the
input features (Murphy, 2012).
5
Ranking Methodology
We now present the second module of our empirical framework, which consists of a methodology for
re-ranking the top 10 results for a given query, by a given user, at a given point in time. The goal of
re-ranking is to promote URLs that are more likely to be relevant to the user and demote those that are
less likely to be relevant. Relevance is data-driven and defined based on clicks and dwell times.
5.1
Evaluation Metric
To evaluate whether a specific re-ranking method improves overall search quality or not, we first
need to define an evaluation metric. The most commonly used metric in the ranking literature is
Normalized Discounted Cumulative Gain (NDCG). It is also used by Yandex to evaluate the quality
of its search results. Hence, we use this metric for evaluation too.
NDCG is a list-level graded relevance metric, i.e., it allows for multiple levels of relevance. NDCG
measures the usefulness, or gain, of a document based on its position in the result list. NDCG rewards
rank-orderings that place documents in decreasing order of relevance. It penalizes mistakes in top
positions more than those in lower positions, because users are more likely to focus and click on the
top documents.
We start by defining the simpler metric, Discounted Cumulative Gain, DCG (Järvelin and
Kekäläinen, 2000; Järvelin and Kekäläinen, 2002). DCG is a list-level metric defined for the
truncation position P . For example, if we are only concerned about the top 10 documents, then we
would work with DCG10 . The DCG at truncation position P is defined as:
DCGP =
P
X
p=1
2rp − 1
log2 (p + 1)
(9)
where rp is the relevance rating of the document (or URL) in position p.
A problem with DCG is that the contribution of each query to the overall DCG score of the data
varies. For example, a search that retrieves four ordered documents with relevance scores 2, 2, 1, 1,
will have higher DCG scores compared to a different search that retrieves four ordered documents
with relevance scores 1, 1, 0, 0. This is problematic, especially for re-ranking algorithms, where we
only care about the relative ordering of documents. To address this issue, a normalized version of
DCG is used:
DCGP
NDCGP =
(10)
IDCGP
19
where IDCGP is the ideal DCG at truncation position P . The IDCGP for a list is obtained by rankordering all the documents in the list in decreasing order of relevance. NDCG always takes values
between 0 and 1. Since NDCG is computed for each search in the training data and then averaged
over all queries in the training set, no matter how poorly the documents associated with a particular
query are ranked, it will not dominate the evaluation process since each query contributes equally to
the average measure.
NDCG has two main advantages. First, it allows a researcher to specify the number of positions
that she is interested in. For example, if we only care about the first page of search results (as in our
case), then we can work with NDCG10 (henceforth referred to as N DCG for notational simplicity).
Instead, if we are interested in the first two pages, we can work with NDCG20 . Second, NDCG is a
list-level metric; it allows us to evaluate the quality of results for the entire list of results with built-in
penalties for mistakes at each position.
Nevertheless, NDCG suffers from a tricky disadvantage. It is based on discrete positions, which
makes its optimization very difficult. For instance, if we were to use an optimizer that uses continuous
scores, then we would find that the relative positions of two documents (based on NDCG) will
not change unless there are significant changes in their rank scores. So we could obtain the same
rank orderings for multiple parameter values of the model. This is an important consideration that
constrains the types of optimization methods that can be used to maximize NDCG.
5.2
General Structure of the Ranking Problem
We now present the general structure of the problem and some nomenclature to help with exposition.
We use i to index users, j to index sessions, and k to index the search within a session. Further, let q
denote the query associated with a search, and u denote the URL of a given search result.
F
th
Let xqu
ijk ∈ R denote a set of F features for document/URL u in response to a query q for the k
search in the j th session of user i. The function f (·) maps the feature vector to a score squ
ijk such that
qu
qu
qu
sijk = f (xijk ; w), where w is a set of parameters that define the function f (·). sijk can be interpreted
as the gain or value associated with document u for query q for user i in the k th search in her j th
qm
qn
session. Let Uijk
/ Uijk
denote the event that URL m should be ranked higher than n for query q for
th
user i in the k search of her j th session. For example, if m has been labeled highly relevant (rm = 2)
qm
qn
and n has been labeled irrelevant (rn = 0), then Uijk
/ Uijk
. Note that the labels for the same URLs
can vary with queries, users, and across and within the session; hence the super-scripting by q and
sub-scripting by i, j, and k.
Given two documents m and n associated with a search, we can write out the probability that
document m is more relevant than n using a logistic function as follows:
qmn
qm
qn
Pijk
= P (Uijk
/ Uijk
)=
20
1
1+e
qn
−σ(sqm
ijk −sijk )
(11)
qn
where {xqm
ijk , xijk } refer to the input features of the two documents and σ determines the shape of this
sigmoid function. Using these probabilities, we can define the log-Likelihood of observing the data or
the Cross Entropy Cost function as follows. This is interpreted as the penalty imposed for deviations
of model predicted probabilities from the desired probabilities.
qmn
qmn
qmn
qmn
qmn
Cijk
= −P̄ijk
log(Pijk
) − (1 − P̄ijk
) log(1 − Pijk
)
X X X X qmn
C =
Cijk
i
j
k
(12)
(13)
∀m,n
qmn
where P̄ijk
is the observed probability (in data) that m is more relevant than n. The summation in
Equation (13) is taken in this order – first, over all pairs of URLs ({m, n}) in the consideration set
for the k th search in session j for user i, next over all the searches within session j, then over all the
sessions for the user i, and finally over all users.
If the function f (·) is known or specified by the researcher, the cost C can be minimized using
commonly available optimizers such as Newton-Raphson or BFGS. Traditionally, this has been the
approach adopted by marketing researchers since they are usually interested in inferring the impact of
some marketing activity (e.g., price, promotion) on an outcome variable (e.g., awareness, purchase)
and therefore assume the functional form of f (·). In contrast, when the goal is to predict the best
rank-ordering that maximizes consumers’ click probability and dwell-time, assuming the functional
form f (·) is detrimental to the predictive ability of the model. Moreover, manual experimentation
to arrive at an optimal functional form becomes exponentially difficult as the number of features
expands. This is especially the case if we want the function to be sufficiently flexible (non-linear) to
capture the myriad interaction patterns in the data. To address these challenges, we turn to solutions
from the machine learning literature such as MART that infer both the optimal parameters w and
function f (·) from the data.7
5.3
MART – Multiple Additive Regression Trees
MART is a machine learning algorithm that models a dependent or output variable as a linear
combination of a set of shallow regression trees (a process known as boosting). In this section, we
introduce the concepts of Classification and Regression Trees (CART) and boosted CART (referred
to as MART). We present the high level overview of these models here and refer interested readers to
Murphy (2012) for details.
21
A
B
Figure 9: Example of a CART model.
5.3.1
Classification and Regression Trees
CART methods are a popular class of prediction algorithms that recursively partition the input space
corresponding to a set of explanatory variables into multiple regions and assign an output value for
each region. This kind of partitioning can be represented by a tree structure, where each leaf of the
tree represents an output region. Consider a dataset with two input variables {x1 , x2 }, which are used
to predict or model an output variable y using a CART. An example tree with three leaves (or output
regions) is shown in Figure 9. This tree first asks if x1 is less than or equal to a threshold t1 . If yes, it
assigns the value of 1 to the output y. If not (i.e., if x1 > t1 ), it then asks if x2 is less than or equal
to a threshold t2 . If yes, then it assigns y = 2 to this region. If not, it assigns the value y = 3 to this
region. The chosen y value for a region corresponds to the mean value of y in that region in the case of
a continuous output and the dominant y in case of discrete outputs.
Trees are trained or grown using a pre-defined number of leaves and by specifying a cost function
that is minimized at each step of the tree using a greedy algorithm. The greedy algorithm implies
that at each split, the previous splits are taken as given, and the cost function is minimized going
forward. For instance, at node B in Figure 9, the algorithm does not revisit the split at node A. It
however considers all possible splits on all the variables at each node. Thus, the split points at each
node can be arbitrary, the tree can be highly unbalanced, and variables can potentially repeat at latter
child nodes. All of this flexibility in tree construction can be used to capture a complex set of flexible
interactions, which are not predefined but are learned using the data.
CART is popular in the machine learning literature because it is scalable, is easy to interpret, can
handle a mixture of discrete and continuous inputs, is insensitive to monotone transformations, and
performs automatic variable selection (Murphy, 2012). However, it has accuracy limitations because
of its discontinuous nature and because it is trained using greedy algorithms. These drawbacks can be
addressed (while preserving all the advantages) through boosting, which gives us MART.
7
In standard marketing models, ws are treated as structural parameters that define the utility associated with an option. So
their directionality and magnitude are of interest to researchers. However, in this case there is no structural interpretation
of w. In fact, with complex trainers such as MART, the number of parameters is large and difficult to interpret.
22
5.3.2
Boosting
Boosting is a technique that can be applied to any classification or prediction algorithm to improve
its accuracy (Schapire, 1990). Applying the additive boosting technique to CART produces MART,
which has now been shown empirically to be the best classifier available (Caruana and NiculescuMizil, 2006; Hastie et al., 2009). MART can be viewed as performing gradient descent in the function
space using shallow regression trees (with a small number of leaves). MART works well because it
combines the positive aspects of CART with those of boosting. CART, especially shallow regression
trees, tend to have high bias, but have low variance. Boosting CART models addresses the bias
problem while retaining the low variance. Thus, MART produces high quality classifiers.
MART can be interpreted as a weighted linear combination of a series of regression trees, each
trained sequentially to improve the final output using a greedy algorithm. MART’s output L(x) can
be written as:
N
X
LN (x) =
αn ln (x, βn )
(14)
n=1
where ln (x, βn ) is the function modeled by the nth regression tree and αn is the weight associated
with the nth tree. Both l(·)s and αs are learned during the training or estimation. MARTs are also
trained using greedy algorithms. ln (x, βn ) is chosen so as to minimize a pre-specified cost function,
which is usually the least-squared error in the case of regressions and an entropy or logit loss function
in the case of classification or discrete choice models.
5.4
Learning Algorithms for Ranking
We now discuss two Learning to Rank algorithms that can be used to improve search quality – RankNet
and LambdaMART. Both of these algorithms can use MART as the underlying classifier.
5.4.1
RankNet – Pairwise learning with gradient descent
In an influential paper, Burges et al. (2005) proposed one of the earliest ranking algorithms – RankNet.
Versions of RankNet are used by commercial search engines such as Bing (Liu, 2011).
RankNet uses a gradient descent formulation to minimize the Cross-entropy function defined in
Equations (12) and (13). Recall that we use i to index users, j to index sessions, k to index the search
within a session, q to denote the query associated with a search, and u to denote the URL of a given
search result. The score and entropy functions are defined for two documents m and n.
qmn
qmn
Then, let Sijk
∈ {0, 1, −1}, where Sijk
= 1 if m is more relevant than n, −1 if n is more
qmn
qmn
relevant than m, and 0 if they are equally relevant. Thus, we have P̄ijk
= 21 (1 + Sijk
), and we can
qmn
rewrite Cijk as:
qn
1
qmn
qmn
qn
−σ(sqm
ijk −sijk ) )
Cijk
= (1 − Sijk
)σ(sqm
ijk − sijk ) + log(1 + e
2
23
(15)
Note that Cross Entropy function is symmetric; it is invariant to swapping m and n and flipping the
qmn 8
sign of Sijk
. Further, we have:
qm
qn
qn
qm
qmn
qmn
Cijk
= log(1 + e−σ(sijk −sijk ) if Sijk
=1
qmn
qmn
Cijk
= log(1 + e−σ(sijk −sijk ) if Sijk
= −1
qn
1
qmn
qn
qmn
−σ(sqm
ijk −sijk ) ) if S
Cijk
=
σ(sqm
ijk − sijk ) + log(1 + e
ijk = 0
2
(16)
The gradient of the Cross Entropy Cost function is:
qmn
qnm
∂Cijk
∂Cijk
=−
=σ
∂sqm
∂sqn
ijk
ijk
1
1
qmn
(1 − Sijk
)−
qm
qn
−σ(s
ijk −sijk )
2
1+e
(17)
We can now employ a stochastic gradient descent algorithm to update the parameters, w, of the model,
as follows:
!
qmn
qnm
qmn
qn
∂Cijk
∂C
∂Cijk
∂sqm
∂s
ijk
ijk
ijk
=w−η
+
(18)
w →w−η
qn
∂w
∂sqm
∂w
∂s
∂w
ijk
ijk
where η is a learning rate specified by the researcher. The advantage of splitting the gradient as shown
qmn
in Equation (18) instead of working with ∂Cijk
/∂w is that it speeds up the gradient calculation by
allowing us to use the analytical formulations for the component partial derivatives.
Most recent implementations of RankNet use the following factorization to improve speed.
qmn
∂Cijk
= λqmn
ijk
∂w
where
λqmn
ijk
qmn
∂Cijk
=
=σ
∂sqm
ijk
∂sqm
∂sqn
ijk
ijk
−
∂w
∂w
1
1
qmn
(1 − Sijk
)−
qm
qn
−σ(s
ijk −sijk )
2
1+e
(19)
(20)
Given this cost and gradient formulation, MART can now be used to construct a sequence of
regression trees that sequentially minimize the cost function defined by RankNet. The nth step (or
tree) of MART can be conceptualized as the nth gradient descent step that refines the function f (·) in
order to lower the cost function. At the end of the process, both the optimal f (·) and ws are inferred.
While RankNet has been successfully implemented in many settings, it suffers from three
significant disadvantages. First, because it is a pairwise metric, in cases where there are multiple
levels of relevance, it doesn’t use all the information available. For example, consider two sets of
documents {m1 , n1 } and {m2 , n2 } with the following relevance values: {1, 0} and {2, 0}. In both
cases, RankNet only utilizes the information that m should be ranked higher than n and not the actual
relevance scores, which makes it inefficient. Second, it does not differentiate between mistakes at the
8
qmn
Asymptotically, Cijk
becomes linear if the scores give the wrong ranking and zero if they give the correct ranking.
24
top of the list from those at the bottom. This is problematic if the goal is to penalize mistakes at the top
of the list more than those at the bottom. Third, it optimizes the Cross Entropy Cost function, and not
NDCG, which is our true objective function. This mismatch between the true objective function and
the optimized objective function implies that RankNet is not the best algorithm to optimize NDCG.
5.4.2
LambdaMART: Optimizing for NDCG
Note that RankNet’s speed and versatility comes from its gradient descent formulation, which requires
the loss or cost function to be twice differentiable in model parameters. Ideally, we would like to
preserve the gradient descent formulation because of its attractive convergence properties. However,
we need to find a way to optimize for NDCG within this framework, even though it is a discontinuous
optimization metric that does not lend itself easily to gradient descent easily. LambdaMART addresses
this challenge (Burges et al., 2006; Wu et al., 2010).
The key insight of LambdaMART is that we can directly train a model without explicitly specifying
a cost function, as long as we have gradients of the cost and these gradients are consistent with the
cost function that we wish to minimize. The gradient functions, λqmn
ijk s, in RankNet can be interpreted
as directional forces that pull documents in the direction of increasing relevance. If a document m is
more relevant than n, then m gets an upward pull of magnitude |λqmn
ijk | and n gets a downward pull of
the same magnitude.
LambdaMART expands this idea of gradients functioning as directional forces by weighing them
with the change in NDCG caused by swapping the positions of m and n, as follows:
λqmn
ijk =
−σ
1+e
qn
−σ(sqm
ijk −sijk )
|∆NDCGqmn
ijk |
(21)
qmn
qmn
Note that NDCGqmn
ijk = NDCGijk = 0 when m and n have the same relevance; so λijk = 0 in those
qmn
cases. (This is the reason why the term σ2 (1 − Sijk
) drops out.) Thus, the algorithm does not spend
effort swapping two documents of the same relevance.
qmn
Given that we want to maximize NDCG, suppose that there exists a utility or gain function Cijk
qmn
that satisfies the above gradient equation. (We refer to Cijk
as a gain function in this context, as
opposed to a cost function, since we are looking to maximize NDCG.) Then, we can continue to use
gradient descent to update our parameter vector w as follows:
qmn
∂Cijk
w →w+η
∂w
(22)
Using Poincare’s Lemma and a series of empirical results, Yue and Burges (2007) and Donmez et al.
(2009), have shown that the gradient functions defined in Equation (21) correspond to a gain function
that maximizes NDCG. Thus, we can use these gradient formulations, and follow the same set of
25
steps as we did for RankNet to maximize NDCG.9
LambdaMART has been successfully used in many settings (Chapelle and Chang, 2011) and we
refer interested readers to Burges (2010) for a detailed discussion on the implementation and relative
merits of these algorithms. In our analysis, we use the RankLib implementation of LambdaMART
(Dang, 2011), which implements the above algorithm and learns the functional form of f (·) and the
parameters (ws) using MART.
6
Implementation
We now discuss the details of implementation and data preparation.
6.1
Datasets – Training, Validation, and Testing
Supervised machine learning algorithms (such as ours) typically require three datasets – training,
validation, and testing. Training data is the set of pre-classified data used for learning or estimating
the model, and inferring the model parameters. In marketing settings, usually the model is optimized
on the training dataset and its performance is then tested on a hold-out sample. However, this leads the
researcher to pick a model with great in-sample performance, without optimizing for out-of-sample
performance. In contrast, in most machine learning settings (including ours), the goal is to pick
a model with the best out-of-sample performance, hence the validation data. So at each step of
the optimization (which is done on the training data), the model’s performance is evaluated on the
validation data. In our analysis, we optimize the model on the training data, and stop the optimization
only when we see no improvement in its performance on the validation data for at least 100 steps.
Then we go back and pick the model that offered the best performance on the validation data. Because
both the training and validation datasets are used in model selection, we need a new dataset for testing
the predictive accuracy of the model. This is done on a separate hold-out sample, which is labeled
as the test data. The results from the final model on the test data are considered the final predictive
accuracy achieved by our framework.
The features described in §4.1 can be broadly categorized into two sets – (a) global features,
which are common to all the users in the data, and (b) user-specific features that are specific to a user
(could be within or across session features). In real business settings, managers typically have data
on both global and user-level features at any given point in time. Global features are constructed
by aggregating the behavior of all the users in the system. Because this requires aggregation over
millions of users and data stored across multiple servers and geographies, it is not feasible to keep this
up-to-date on a real-time basis. So firms usually update this data on a regular (weekly or biweekly)
basis. Further, global data is useful only if has been aggregated over a sufficiently long period of time.
In our setting, we cannot generate reliable global features for the first few days of observation because
9
The original version of LambdaRank was trained using neural networks. To improve its speed and performance, Wu et al.
(2010) proposed a boosted tree version of LambdaRank – LambdaMART, which is faster and more accurate.
26
Global Data
User A
User B
User C
Session A1
Session A2
Session B1
Session A3
Validation
Data
Session B2
Session C1
Session C2 Session C3
Days 1--25
Training
Data
Test
Data
Days 26--27
Figure 10: Data-generation framework.
of insufficient history. In contrast, user-specific features have to be generated on a real-time basis and
need to include all the past activity of the user, including earlier activity in the same session. This
is necessary to ensure that we can make personalized recommendations for each user. Given these
constraints and considerations, we now explain how we partition the data and generate these features.
Recall that we have a snapshot of 27 days of data from Yandex. Figure 10 presents a schema of
how the three datasets (training, validation, test) are generated from the 27 days of data. First, using
the entire data from the first 25 days, we generate the global features for all three datasets. Next, we
split all the users observed in the last two days (days 26 − 27) into three random samples (training,
validation, and testing).10 Then, for each observation of a user, all the data up till that observation
is used to generate the user/session level features. For example, Sessions A1, A2, B1, B2, and C1
are used to generate global features (such as the listing frequency and CTRs at the query, URL, and
domain level) for all three datasets. Thus, the global features for all three datasets do not include
information from sessions in days 26 − 27 (Sessions A3, C2, and C3). Next, for user A in the training
data, the user level features at any given point in Session A3 are calculated based on her entire history
starting from Session A1 (when she is first observed) till her last observed action in Session A3. User
B is not observed in the last two days. So her sessions are only used to generate aggregate global
features, i.e., she does not appear at the user-level in the training, validation, or test datasets. For user
C, the user-level features at any given point in Sessions C2 or C3 are calculated based on her entire
history starting from Session C1.
10
We observe a total of 1174096 users in days 26 − 27, who participate in 2327766 sessions. Thus, each dataset (training,
validation, and testing) has approximately one third of these users and sessions in it.
27
Dataset
Dataset 1
Dataset 2
NDCG Baseline
0.7937
0.5085
NDCG with Personalization
0.8121
0.5837
% Gain
2.32%
14.78%
Table 4: Gains from Personalization
6.2
Data Preparation
Before constructing the three datasets (training, validation, and test) described above, we clean the
data to accommodate the test objective. Testing whether personalization is effective or not is only
possible if there is at least one click. So following Yandex’s recommendation, we filter the datasets
in two ways. First, for each user observed in days 26 − 27, we only include all her queries from the
test period with at least one click with a dwell time of 50 units or more (so that there is at least one
relevant or highly relevant document). Second, if a query in this period does not have any short-term
context (i.e., it is the first one in the session) and the user is new (i.e., she has no search sessions in the
training period), then we do not consider the query since it has no information for personalization.
Additional analysis of the test data shows that about 66% of the clicks observed are already for
the first position. Personalization cannot further improve user experience in these cases because the
most relevant document already seems to have been ranked at the highest position. Hence, to get a
clearer picture of the value of personalization, we need to focus on situations where the user clicks on
a document in a lower position (in the current ranking) and then examine if our proposed personalized
ranking improves the position of the clicked documents. Therefore, we consider two test datasets – 1)
the first is the entire test data, which we denote as Dataset 1, and 2) the second is a subset that only
consists of queries for which clicks are at position 2 or lower, which we denote as Dataset 2.
6.3
Model Training
We now comment on some issues related to training our model. The performance of LambdaMART
depends on a set of tunable parameters. For instance, the researcher has control over the number of
regression trees used in the estimation, the number of leaves in a given tree, the type of normalization
employed on the features, etc. It is important to ensure that the tuning parameters lead to a robust
model. A non-robust model usually leads to overfitting; that is, it leads to higher improvement in the
training data (in-sample fit), but lower improvements in validation and testing datasets (out-of-sample
fits). We take two key steps to ensure that the final model is robust. First, we normalize the features
using their zscores to make the feature scales comparable. Second, we experimented with a large
number of specifications for the number of trees and leaves in a tree, and found that 1000 trees and 20
leaves per tree was the most robust combination.
28
7
Results
7.1
Gains from Personalization
After training and validating the model, we test it on the two datasets described earlier – Dataset 1
and Dataset 2. The results are presented in Table 4. First, for Dataset 1, we find that personalization
improves the NDCG from a baseline (with SERPRank as the only feature) of 0.7937 to 0.8121, which
represents a 2.32% improvement over the baseline. Since 66% of the clicks in this dataset are already
for the top-most position, it is hard to interpret the value of personalization from these numbers. So
we focus on Dataset 2, which consists of queries that received clicks at positions 2 or lower, i.e., only
those queries where personalization can play a role in improving the search results. Here, we find the
results promising – compared to the baseline, our personalized ranking algorithm leads to a 14.78%
increase in search quality, as measured by NDCG.
These improvements are significant in the search industry. Recall that the baseline ranking
captures a large number of factors including the relevance of page content, page rank, and the linkstructure of the web-pages, which are not available to us. It is therefore notable that we are able to
further improve on the ranking accuracy by incorporating personalized user/session-specific features.
Our overall improvement is higher than that achieved by winning teams from the Kaggle contest
(Kaggle, 2014).This is in spite of the fact that the contestants used a longer dataset for training their
model.11 We conjecture that this is due to our use of a comprehensive set of personalized features in
conjunction with the parallelized LambdaMART algorithm that allows us to exploit the complete
scale of the data. The top three teams in the contest released a technical report each discussing their
respective methodologies. Interested readers please refer to Masurel et al. (2014); Song (2014);
Volkovs (2014) for details on their approaches.
7.2
Value of Personalization from the Firm’s and Consumers’ Perspectives
So far we presented the results purely in terms of NDCG. However, the NDCG metric is hard to
interpret, and it is not obvious how changes in NDCG translate to differences in rank-orderings and
click-through-rates (CTRs) in a personalized page. So we now present some results on these metrics.
The first question we consider is this – From a consumer’s perspective, how different does the
SERP look when we deploy the personalized ranking algorithm compared to what it looks like under
the current non-personalized algorithm? Figures 11 and 12 summarize the shifts after personalization
for Dataset 1 and Dataset 2, respectively. For both datasets, over 55% of the documents are re-ordered,
which is a staggering number of re-orderings. (Note that the percentage of results (URLs) with zero
11
The contestants used all the data from days 1 − 27 for building their models and had their models tested by Yandex on
an unmarked dataset (with no information on clicks made by users) from days 28 − 30. In contrast, we had to resort to
using the dataset from days 1 − 27 for both model building and testing. So the amount of past history we have for each
user in the test data is smaller, putting us at a relative disadvantage (especially since we find that longer histories lead to
higher NDCGs). Thus, it is noteworthy that our improvement is higher than that achieved by the winning teams.
29
50.00
37.50
37.50
Percentage
Percentage
50.00
25.00
12.50
12.50
0.00
25.00
0.00
-5
-4
-3
-2
-1
0
1
2
3
4
5
-5
-4
-3
-2
-1
0
1
2
3
4
5
Change in Position
Change in Position
Figure 12: Histogram of change in position for the
results in Dataset 2.
Figure 11: Histogram of change in position for the
results in Dataset 1.
change in position is 44.32% for Dataset 1 and 41.85% for Dataset 2.) Approximately 25% of the
documents are re-ordered by one position (increase of one position or decrease of one position). The
rest 30% are re-ordered by two positions or more. Re-orderings by more than five positions are rare.
Overall, these results suggest that the rank-ordering of results in the personalized SERP is likely to
be quite different from the non-personalized SERP. These findings highlight the fact that even small
improvements in NDCG lead to significant changes in rank-orderings, as perceived by consumers.
4.00
18.00
13.50
Increase in CTR
Increase in CTR
3.00
2.00
1.00
0.00
-1.00
9.00
4.50
0.00
-4.50
1
2
3
4
5
6
7
8
9
-9.00
10
1
2
3
4
5
6
7
8
9
10
Position
Position
Figure 14: Expected increase in CTR as a function
of position for Dataset 2. CTRs are expressed in
percentages; therefore the change in CTRs are also
expressed in percentages.
Figure 13: Expected increase in CTR as a function
of position for Dataset 1. CTRs are expressed in
percentages; therefore the change in CTRs are also
expressed in percentages.
Next, we consider the returns to personalization from the firm’s perspective. User satisfaction
depends on the search engine’s ability to serve relevant results at the top of a SERP. So any algorithm
that increases CTRs for top positions is valuable. We examine how position-based CTRs are expected
to change with personalization compared to the current non-personalized ranking. Figures 13 and 14
summarize our findings for Datasets 1 and 2. The y-axis depicts the difference in CTRs (in percentage)
30
Percentage improvement over baseline
Percentage improvement over baseline
3.5
3
2.5
2
1.5
1
0.5
0
0
20
40
60
Number of prior user queries
80
100
Figure 15: NDCG improvements over the baseline
for different lower-bounds on the number of prior
queries by the user for Dataset 1.
20
18
16
14
12
10
0
20
40
60
Number of prior user queries
80
100
Figure 16: NDCG improvements over the baseline
for different lower-bounds on the number of prior
queries by the user for Dataset 2.
after and before personalization, and the x-axis depicts the position. Since both datasets only include
queries with one or more clicks, the total number of clicks before and after personalization are
maintained to be the same. (The sum of the changes in CTRs across all positions is zero.) However,
after personalization, the documents are re-ordered so that more relevant documents are displayed
higher up. Thus, the CTRs for the topmost position increases, while that of lower positions decrease.
In both datasets, we see a significant increase in the CTR for position 1. For Dataset 1, CTR for
position 1 changes (in a positive direction) by 3.5%. For Dataset 2, which consists of queries receiving
clicks to positions 2 or lower, this increase is even more substantial. We find that the CTR to first
position is now higher by 18% . For Dataset 1, the increase in CTR for position 1 mainly comes from
positions 3 and 10, whereas for Dataset 2, it comes from positions 2 and 3.
Overall, we find that deploying a personalization algorithm would lead to increase in CTRs
for higher positions, thereby reducing the need for consumers to scroll further down the SERP for
relevant results. This in turn would increase user satisfaction and reduce switching – both of which
are valuable outcomes from the search engine’s perspective.
7.3
Value of Search History
Next, we examine the returns to using longer search histories for personalization. Even though we
have less than 26 days of history on average, the median query (by a user) in the test datasets has 49
prior queries, and the 75th percentile has 100 prior queries.12 We find that this accumulated history
pays significant dividends in improving search quality, as shown in Figures 15 and 16. When we set
the lower-bound for past queries to zero, we get the standard 2.32% improvement over the baseline
12
The number of past queries for a median query (50) is a different metric than the one discussed earlier, median of the
number of queries per user (shown in Figure 4 as 16), because of two reasons. First, this number is calculated only
for queries which have one or more prior queries (see data preparation in §6.2). Second, not all users submit the same
number of queries. Indeed, a small subset of users account for most queries. So the average query has a larger median
history than the average user.
31
for Dataset 1 and 14.78% for Dataset 2 (Table 4), as shown by the points on the y-axis in Figures 15
and 16. When we restrict the test data to only contain queries for which the user has issued at least 50
past queries (which is close to the median), we find that the NDCG improves by 2.76% for Dataset
1 and 17% for Dataset 2. When we increase the lower-bound to 100 queries (the 75th percentile for
our data), NDCG improves by 3.03% for Dataset 1 and 18.51% for Dataset 2, both of which are quite
significant when compared to the baseline improvements for the entire datasets.
These findings can be summarized into two main patterns. First, the search history is more
valuable when we focus on searches where the user has to go further down in the SERP to look for
relevant results. Second, there is a monotonic, albeit concave increase in NDCG as a function of
the length of search history. These findings re-affirm the value of storing and using personalized
browsing histories to improve users’ search experience. Indeed, firms are currently struggling with
the decision of how much history to store, especially since storage of these massive personal datasets
can be extremely costly. By quantifying the improvement in search results for each additional piece
of past history, we provide firms with tools to decide whether the costs associated with storing longer
histories are offset by the improvements in consumer experience.
7.4
Comparison of Features
We now examine the value of different types of features in improving search results using a comparative analysis. In order to perform this analysis, we start with the full model, then remove a specific
feature set (as described in §4.2) and evaluate the loss in NDCG prediction accuracy as a consequence
of the removal. The percentage loss is defined as:
N DCGf ull − N DCGcurrent
× 100
N DCGf ull − N DCGbase
(23)
Note that all the models always have SERPRank as a feature. Table 5 provides the results from this
experiment.
We observe that by eliminating the Global features, we lose 33.15% of the improvement that we
gain from using the full feature set. This suggests that the interaction effects between global and
personalized features plays an important role in improving the accuracy of our algorithm. Next, we
find that eliminating user-level features results in a loss of a whopping 53.26%. Thus, over 50% of
the improvement we provide comes from using personalized user-history. In contrast, eliminating
session-level features while retaining across-session user-level features results in a loss of only 3.81%.
This implies that across session features are more valuable than within-session features and that
user behavior is stable over longer time periods. Thus, recommendation engines can forgo real-time
session-level data (which is costly to incorporate in terms of computing time and deployment) without
too much loss in prediction accuracy.
We also observe that eliminating domain name based features results in a significant loss in
32
Full set
− Global features (FG )
− User features (FP )
− Session features (FS )
− URL features (FU )
− Domain features (FD )
− Query features (FQ )
− Term features (FT )
− Click features (FC )
− Position features (FR )
− Skip features (FK )
− Posn-click features (FN )
− Dwell features (FW )
NDCG
0.8121
0.8060
0.8023
0.8114
0.8101
0.8098
0.8106
0.8119
0.8068
0.8116
0.8118
0.8120
0.8112
% Loss
33.15
53.26
3.80
10.87
12.50
8.15
1.09
28.80
2.72
1.63
0.54
4.89
Table 5: Loss in prediction accuracy when different feature sets are removed from the full model.
prediction accuracy, as does the elimination of URL level features. This is because users prefer search
results from domains and URLs that they trust. Search engines can thus track user preferences over
domains and use this information to optimize search ranking. Query-specific features can also be
quite informative (8.15% loss in accuracy). This could be because they capture differences in user
behavior across queries.
We also find that features based on users’ past clicking and dwelling behavior can be informative,
with Click-specific features being the most valuable (with a 28.8% loss in accuracy). This is likely
because users tend to exhibit persistence in how they click on links – both across positions (some
users may only click on the first two positions while others may click on multiple positions) and
across URLs and domains. (For a given query, some users may have persistence in taste for certain
URLs; e.g., some users might be using a query for navigational purposes while others might be using
it to get information/news.) History on past clicking behavior can inform us of these proclivities and
help us improve the results shown to different users.
In sum, these findings give us insight into the relative value of different types of features. This can
help businesses decide which types of information to incorporate into their search algorithms and
which types of information to store for longer periods of time.
7.5
Comparison with Pairwise Maximum Likelihood
We now compare the effectiveness of machine learning based approaches to this problem to the
standard technique used in the marketing literature – Maximum Likelihood. We estimate a pairwise
Maximum Likelihood model with a linear combination of the set of 293 features used previously.
Note that this is similar to a RankNet model where the functional form f (·) is specified as a linear
combination of the features, and we are simply inferring the parameter vector w. Thus, it is a pairwise
optimization that does not use any machine learning algorithms.
33
NDCG on training data
NDCG on validation data
NDCG on test data
Pairwise Maximum Likelihood
Single feature
Full features
0.7930
0.7932
0.7933
0.7932
0.7937
0.7936
LambdaMART
Single feature Full features
0.7930
0.8141
0.7933
0.8117
0.7937
0.8121
Table 6: Comparison with Pairwise Maximum Likelihood
We present the results of this analysis in Table 6. For both Maximum Likelihood and LambdaMART, we report results for a single feature model (with just SERPRank) and a model with all the
features. In both methods, the single feature model produces the baseline NDCG of 0.7937 on the
test data, which shows that both methods perform as expected when used with no additional features.
Also, as expected, in the case of full feature models, both methods show improvement on the training
data. However, the improvement is very marginal in the case of pairwise Maximum Likelihood, and
completely disappears when we take the model to the test data. While LambdaMART also suffers
from some loss of accuracy when moving from training to test data, much of its improvement is
sustained in the test data, which turns out to be quite significant. Together these results suggest that
Maximum Likelihood is not an appropriate technique for large scale predictive models.
The lackluster performance of the Maximum Likelihood model stems from two issues. First,
because it is a pairwise gradient-based method it does not maximize NDCG. Second, it does not
allow for the fully flexible functional form and a large number of non-linear interactions that machine
learning techniques such as MART do.
7.6
Scalability
We now examine the scalability of our approach to large datasets. Note that there are three aspects
of modeling – feature generation, testing, and training – all of which are computing and memory
intensive. In general, feature generation is more memory intensive (requiring at least 140GB of
memory for the full dataset), and hence it is not parallelized. The training module is more computing
intensive. The RankLib package that we use for training and testing is parallelized and can run on
multi-core servers, which makes it scalable. All the stages (feature generation, training, and testing)
were run on Amazon EC2 nodes since the computational load and memory requirements for these
analyses cannot be handled by regular PCs. The specification of the compute server used on Amazon’s
EC2 node is – Intel(R) Xeon(R) CPU E5 − 2670 v2 @ 2.50GHz with 32 cores and 256GB of memory.
This was the fastest server with the highest memory available at Amazon EC2.
We now discuss the scalability of each module – feature generation, training, and testing – one by
one, using Figures 17, 18 and 19. The runtimes are presented as the total CPU time added up over all
the 32 cores. To construct the 20% dataset, we continue to use all the users and sessions from days
1 − 25 to generate the global features, but use only 20% of the users from days 26 − 27 to generate
34
1000000
5000
Training time (secs)
Feature generation time (secs)
6000
4000
3000
2000
1000
0
0
20
40
60
Percentage of dataset
80
600000
400000
200000
0
100
Figure 17: Runtimes for generating the features the
model on different percentages of the data.
0
20
40
60
Percentage of dataset
80
100
Figure 18: Runtimes for training the model on different percentages of the data.
2500
0.814
0.813
2000
0.812
1500
NDCG
Testing time (secs)
800000
1000
0.811
0.81
500
0
0.809
0
20
40
60
Percentage of dataset
80
0.808
100
Figure 19: Runtimes for testing the model on different percentages of the data.
0
20
40
60
Percentage of dataset
80
100
Figure 20: NDCG improvements from using different percentages of the data.
user and session specific features.
For the full dataset, the feature generation module takes about 83 minutes on a single-core
computer (see Figure 17), the training module takes over 11 days (see Figure 18), and the testing
module takes only 35 minutes (see Figure 19). These numbers suggest that training is the primary
bottleneck in implementing the algorithm, especially if the search engine wants to update the trained
model on a regular basis. So we parallelize the training module over a 32-core server. This reduces the
training time to about 8.7 hours. Overall, with sufficient parallelization, we find that the framework
can be scaled to large datasets in reasonable time-frames.
Next, we compare the computing cost of going from 20% of the data to the full dataset. We find
that it increases by 2.38 times for feature generation, by 20.8 times for training, and 11.6 times for
testing. Thus, training and testing costs increase super-linearly whereas it is sub-linear for feature
generation. These increases in computing costs however brings a significant improvement in model
performance, as measured by NDCG (see Figure 20). Overall, these measurements affirm the value of
using large scale datasets for personalized recommendations, especially since the cost of increasing
35
the size of the data is paid mostly at the training stage, which is easily parallelizable.
8
Feature Selection
We now discuss the third and final step of our empirical approach. As we saw in the previous
section, even when parallelized on 32-core servers, personalization algorithms can be very computingintensive. Therefore, we now examine feature selection, where we optimally choose a subset of
features that maximize the predictive accuracy of the algorithm with minimal loss in accuracy. We
start by describing the background literature and frameworks for feature selection, then describe its
application to our context, followed by the results.
Feature selection is a standard step in all machine learning settings that involve supervised
learning because of two reasons (Guyon and Elisseeff, 2003). First, it can provide a faster and
cost-effective, or a more efficient, model by eliminating less relevant features with minimal loss in
accuracy. Efficiency is particularly relevant when training large datasets such as ours. Second, it
provides more comprehensible models which offer a better understanding of the underlying data
generating process.13
The goal of feature selection is to find the smallest set of features that can provide a fixed predictive
accuracy. In principle, this is a straightforward problem since it simply involves an exhaustive search
of the feature space. However, even with a moderately large number of features, this is practically
impossible. With F features, an exhaustive search requires 2F runs of the algorithm on the training
dataset, which is exponentially increasing in F . In fact, this problem is known to be NP-hard (Amaldi
and Kann, 1998).
8.1
Wrapper Approach
The Wrapper method addresses this problem by using a greedy algorithm. We describe it briefly
here and refer interested readers to Kohavi and John (1997) for a detailed review. Wrappers can
be categorized into two types – forward selection and backward elimination. In forward selection,
features are progressively added until a desired prediction accuracy is reached or until the incremental
improvement is very small. In contrast, a backward elimination Wrapper starts with all the features
and sequentially eliminates the least valuable features. Both these Wrappers are greedy in the sense
that they do not revisit former decisions to include (in forward selection) or exclude features (in
backward elimination). In our implementation, we employ a forward-selection Wrapper.
13
In cases where the datasets are small and the number of features large, feature selection can actually improve the
predictive accuracy of the model by eliminating irrelevant features whose inclusion often results in overfitting. Many
machine learning algorithms, including neural networks, decision trees, CART, and Naive Bayes learners have been
shown to have significantly worse accuracy when trained on small datasets with superfluous features (Duda and Hart,
1973; Aha et al., 1991; Breiman et al., 1984; Quinlan, 1993). In our setting, we have an extremely large dataset, which
prevents the algorithm from making spurious inferences in the first place. So feature selection is not accuracy-improving
in our case; however it leads to significantly lower computing times.
36
To enable a Wrapper algorithm, we also need to specify a selection as well as a stopping rule. We
use the robust Best-first selection rule (Ginsberg, 1993), wherein the most promising node is selected
at every decision point. For example, in a forward-selection algorithm with 10 features, at the first
node, this algorithm considers 10 versions of the model (each with one of the features added) and
then picks the feature whose addition offers the highest prediction accuracy. Further, we use the
following stopping rule, which is quite conservative. Let N DCGbase be the N DCG from the base
model currently used by Yandex, which only uses SERPRank. Similarly, let N DCGcurrent be the
improved N DCG at the current iteration, and N DCGf ull be the improvement that can be obtained
from using a model with all the 293 features. Then our stopping rule is:
N DCGcurrent − N DCGbase
> 0.95
N DCGf ull − N DCGbase
(24)
According to this stopping rule, the Wrapper algorithm stops adding features when it achieves 95% of
the improvement offered by the full model.
Wrappers offer many advantages. First, they are agnostic to the underlying learning algorithm
and the accuracy metric used for evaluating the predictor. Second, greedy Wrappers have been shown
to be computationally advantageous and robust against overfitting (Reunanen, 2003).
8.2
Feature Classification for Wrapper
Wrappers are greedy algorithms that offers computational advantages. They reduce the number of
runs of the training algorithm from 2F to F (F2+1) . While this is significantly less than 2F , it is still
prohibitively expensive for large feature sets. For instance, in our case, this would amount to 42, 778
runs of the training algorithm, which is not feasible. Hence, we use a coarse Wrapper – that runs on
sets or classes of features as opposed to individual features. We use domain-specific knowledge to
identify groups of features and perform the search process on these groups of features as opposed to
individual ones. This allows us to limit the number of iterations required for the search process to a
reasonable number.
Recall that we have 293 features. We classify these into 39 classes. First, we explain how
we classify the first 270 features listed in Table A1 in the Appendix. These features encompass
the following group of eight features – Listings, AvgPosn, CTR, HDCTR, WCTR, PosnCTR, ADT,
SkipRate. We classify these in to 36 classes as shown in Figure 21. At the lowest denomination, this
set of features can be obtained at the URL or domain level.14 This gives us two sets of features. At the
next level, these features can be aggregated at the query level, at the term level (term1, term2, term3,
or term4), or aggregated over all queries and terms (referred to as Null). This gives us six possible
options. At a higher level, they can be further aggregated at the session-level, over users (across all
14
Note that SkipRate is not calculated at the domain level; hence at the domain level this set consists of seven features.
37
URL features
Listings
Avg.Posn
CTR
HDCTR
WCTR
PosnCTR
ADT
SkipRate
Query
Global
Term1
Term2
User
Term3
User-session
Domain features
Listings
Avg.Posn
CTR
HDCTR
WCTR
PosnCTR
ADT
Term4
Null
3
X
6
X
2
=
36
Figure 21: Classes of features used in feature-selection.
their sessions), or over all users and sessions (referred to as Global). Thus, we have 2 × 6 × 3 = 36
sets of features.
Next, we classify features 271 − 278, which consist of Listings, CTR, HDCTR, ADT at user and
session levels, aggregated over all URLs and domains, into the 37th class. The 38th class consists
of user-specific click probabilities for all ten positions, aggregated over the user history (features
279 − 288 in Table A1 in the Appendix). Finally, the remaining features 289 − 292 form the 39th class.
Note that SERPRank is not classified under any class, since it is included in all models by default.
8.3
Results: Feature Selection
We use a search process that starts with a minimal singleton feature set that comprises just the
SERPRank feature. This produces the baseline NDCG of 0.7937. Then we sequentially explore the
feature classes using the first-best selection rule. Table 7 shows the results of this exploration process,
along with the class of features added in each iteration. We stop at iteration six (by which time, we
have run the model 219 times), when the improvement in NDCG satisfies the condition laid out in
Equation (24). The improvement in NDCG at each iteration is also shown in Figure 22, along with the
maximum NDCG obtainable with a model that includes all the features. Though the condition for the
stopping rule is satisfied at iteration 6, we show the results till iteration 11 in Figure 22, to demonstrate
38
Iteration no.
0
1
2
3
4
5
6
No. of features
1
9
17
24
32
40
47
Feature class added at iteration
SERPRank
User–Query–URL
Global–Query–URL
User–Null–Domain
Global–Null–URL
User-session–Null–URL
Global–Null–Domain
NDCG
0.7937
0.8033
0.8070
0.8094
0.8104
0.8108
0.8112
Table 7: Set of features added at each iteration of the Wrapper algorithm based on Figure 21. The corresponding
N DCGs are also shown.
0.816
0.8120
NDCG
0.808
0.804
0.800
0.796
0.792
0
10
20
30
40
50
60
Number of features
70
80
90
Figure 22: Improvement in N DCG at each iteration of the Wrapper algorithm. The tick marks in the red
line refer to the iteration number and the blue line depicts the maximum NDCG obtained with all the features
included.
the fact that the improvement in NDCG after iteration 6 is minimal. Note that at iteration six, the
pruned model with 47 features produces an NDCG of 0.8112, which equals 95% of the improvement
offered by the full model. This small loss in accuracy brings a huge benefit in scalability as shown in
Figures 23 and 24. When using the entire dataset, the training time for the pruned model is ≈ 15% of
the time it takes to train the full model. Moreover, the pruned model scales to larger datasets much
better than the full model. For instance, the difference in training times for full and pruned models
is small when performed on 20% of the dataset. However, as the data size grows, the training time
for the full model increases more steeply than that for the pruned model. While training-time is the
main time sink in this application, training is only done once. In contrast, testing runtimes have more
consequences for the model’s applicability since testing/prediction has to be done in real-time for
most applications. As with training times, we find that the pruned model runs significantly faster on
the full dataset (≈ 50%) as well as scaling better with larger datasets.
Together, these findings suggest that feature selection is a valuable and important step in reducing
runtimes and making personalization implementable in real-time applications when we have big data.
39
2500
Full
Pruned
800000
Testing time (secs)
Training time (secs)
1000000
600000
400000
200000
0
0
20
40
60
Percentage of dataset
80
Figure 23: Runtimes for training the full and pruned
models on different percentages of the data.
9
1500
1000
500
0
100
Full
Pruned
2000
0
20
40
60
Percentage of dataset
80
100
Figure 24: Runtimes for testing the full and pruned
models on different percentages of the data.
Conclusion
Personalization of digital services remains the holy grail of marketing. In this paper, we study
an important personalization problem – that of search rankings in the context of online search or
information retrieval. Query-based web search is now ubiquitous and used by a large number of
consumers for a variety of reasons. Personalization of search results can therefore have significant
implications for consumers, businesses, governments, and for the search engines themselves.
We present a three-pronged machine learning framework that improves search results through
automated personalization using gradient-based ranking. We apply our framework to data from the
premier eastern European search engine, Yandex, and provide evidence of substantial improvements
in search quality using personalization. In contrast, standard maximum likelihood based procedures
provide zero improvement, even when supplied with features generated by our framework. We
quantify the value of the length of user history, the relative effectiveness of short-term and long-term
context, and the impact of other factors on the returns to personalization. Finally, we show that our
framework can perform efficiently at scale, making it well suited for real-time deployment.
Our paper makes four key contributions to the marketing literature. First, it presents a general
machine learning framework that marketers can use to rank recommendations using personalized
data in many different settings. Second, it presents empirical evidence in support of the returns to
personalization in the online search context. Third, it demonstrates how big data can be leveraged to
improve marketing outcomes without compromising the speed or real-time performance of digital
applications. Finally, it provides insights on how machine learning methods can be adapted to
solve important marketing problems that have been technically unsolvable so far (using traditional
econometric or analytical models). We hope our work will spur more research on the application of
machine learning methods to not only personalization-related issues, but also on a broad array of
marketing issues.
40
A
Appendix
Table A1: List of features.
Feature No.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16-19
20-23
24-27
28-31
32-35
36-39
40-43
44-47
48-51
52-55
56-59
60-63
64-67
68-71
72-75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91-94
95-98
Feature Name
Listings(query, url , −, −, −)
CTR(query, url , −, −, −)
ADT(query, url , −, −, −)
AvgPosn(query, url , −, −, −)
HDCTR(query, url , −, −, −)
WCTR(query, url , −, −, −)
PosnCTR(query, url , −, −, −)
SkipRate(query, url , −, −, −)
Listings(query, domain, −, −, −)
CTR(query, domain, −, −, −)
ADT(query, domain, −, −, −)
AvgPosn(query, domain, −, −, −)
HDCTR(query, domain, −, −, −)
WCTR(query, domain, −, −, −)
PosnCTR(query, domain, −, −, −)
Listings(term 1..4 , url , −, −, −)
CTR(term 1..4 , url , −, −, −)
ADT(term 1..4 , url , −, −, −)
AvgPosn(term 1..4 , url , −, −, −)
HDCTR(term 1..4 , url , −, −, −)
WCTR(term 1..4 , url , −, −, −)
PosnCTR(term 1..4 , url , −, −, −)
SkipRate(term 1..4 , url , −, −, −)
Listings(term 1..4 , domain, −, −, −)
CTR(term 1..4 , domain, −, −, −)
ADT(term 1..4 , domain, −, −, −)
AvgPosn(term 1..4 , domain, −, −, −)
HDCTR(term 1..4 , domain, −, −, −)
WCTR(term 1..4 , domain, −, −, −)
PosnCTR(term 1..4 , domain, −, −, −)
Listings(query, url , usr , −, time)
CTR(query, url , usr , −, time)
ADT(query, url , usr , −, time)
AvgPosn(query, url , usr , −, time)
HDCTR(query, url , usr , −, time)
WCTR(query, url , usr , −, time)
PosnCTR(query, url , usr , −, time)
SkipRate(query, url , usr , −, time)
Listings(query, domain, usr , −, time)
CTR(query, domain, usr , −, time)
ADT(query, domain, usr , −, time)
AvgPosn(query, domain, usr , −, time)
HDCTR(query, domain, usr , −, time)
WCTR(query, domain, usr , −, time)
PosnCTR(query, domain, usr , −, time)
Listings(term 1..4 , url , usr , −, time)
CTR(term 1..4 , url , usr , −, time)
i
Feature Classes
FG , FQ , FU , FC
FG , FQ , FU , FC
FG , FQ , FU , FW
FG , FQ , FU , FR
FG , FQ , FU , FC , FW
FG , FQ , FU , FC , FN
FG , FQ , FU , FC , FR
FG , FQ , FU , FK
FG , FQ , FD , FC
FG , FQ , FD , FC
FG , FQ , FD , FW
FG , FQ , FD , FR
FG , FQ , FD , FC , FW
FG , FQ , FD , FC , FN
FG , FQ , FD , FC , FR
FG , FT , FU , FC
FG , FT , FU , FC
FG , FT , FU , FW
FG , FT , FU , FR
FG , FT , FU , FC , FW
FG , FT , FU , FC , FN
FG , FT , FU , FC , FR
FG , FT , FU , FK
FG , FT , FD , FC
FG , FT , FD , FC
FG , FT , FD , FW
FG , FT , FD , FR
FG , FT , FD , FC , FW
FG , FT , FD , FC , FN
FG , FT , FD , FC , FR
FP , FQ , FU , FC
FP , FQ , FU , FC
FP , FQ , FU , FW
FP , FQ , FU , FR
FP , FQ , FU , FC , FW
FP , FQ , FU , FC , FN
FP , FQ , FU , FC , FR
FP , FQ , FU , FK
FP , FQ , FD , FC
FP , FQ , FD , FC
FP , FQ , FD , FW
FP , FQ , FD , FR
FP , FQ , FD , FC , FW
FP , FQ , FD , FC , FN
FP , FQ , FD , FC , FR
FP , FT , FU , FC
FP , FT , FU , FC
Continued on next page
Feature No.
99-102
103-106
107-110
111-114
115-118
119-122
123-126
127-130
131-134
135-138
139-142
143-146
147-150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166-169
170-173
174-177
178-181
182-185
186-189
190-193
194-197
198-201
202-205
206-209
210-213
214-217
218-221
222-225
226
227
228
229
230
231
232
233
Table A1 – continued from previous page
Feature Name
Feature Classes
ADT(term 1..4 , url , usr , −, time)
FP , FT , FU , FW
AvgPosn(term 1..4 , url , usr , −, time)
FP , FT , FU , FR
HDCTR(term 1..4 , url , usr , −, time)
FP , FT , FU , FC , FW
WCTR(term 1..4 , url , usr , −, time)
FP , FT , FU , FC , FN
PosnCTR(term 1..4 , url , usr , −, time)
FP , FT , FU , FC , FR
SkipRate(term 1..4 , url , usr , −, time)
FP , FT , FU , FK
Listings(term 1..4 , domain, usr , −, time)
FP , FT , FD , FC
CTR(term 1..4 , domain, usr , −, time)
FP , FT , FD , FC
ADT(term 1..4 , domain, usr , −, time)
FP , FT , FD , FW
AvgPosn(term 1..4 , domain, usr , −, time)
FP , FT , FD , FR
HDCTR(term 1..4 , domain, usr , −, time)
FP , FT , FD , FC , FW
WCTR(term 1..4 , domain, usr , −, time)
FP , FT , FD , FC , FN
PosnCTR(term 1..4 , domain, usr , −, time)
FP , FT , FD , FC , FR
Listings(query, url , usr , session, time)
FS , FQ , FU , FC
CTR(query, url , usr , session, time)
FS , FQ , FU , FC
ADT(query, url , usr , session, time)
FS , FQ , FU , FW
AvgPosn(query, url , usr , session, time)
FS , FQ , FU , FR
HDCTR(query, url , usr , session, time)
FS , FQ , FU , FC , FW
WCTR(query, url , usr , session, time)
FS , FQ , FU , FC , FN
PosnCTR(query, url , usr , session, time)
FS , FQ , FU , FC , FR
SkipRate(query, url , usr , session, time)
FS , FQ , FU , FK
Listings(query, domain, usr , session, time)
FS , FQ , FD , FC
CTR(query, domain, usr , session, time)
FS , FQ , FD , FC
ADT(query, domain, usr , session, time)
FS , FQ , FD , FW
AvgPosn(query, domain, usr , session, time)
FS , FQ , FD , FR
HDCTR(query, domain, usr , session, time)
FS , FQ , FD , FC , FW
WCTR(query, domain, usr , session, time)
FS , FQ , FD , FC , FN
PosnCTR(query, domain, usr , session, time)
FS , FQ , FD , FC , FR
Listings(term 1..4 , url , usr , session, time)
FS , FT , FU , FC
CTR(term 1..4 , url , usr , session, time)
FS , FT , FU , FC
ADT(term 1..4 , url , usr , session, time)
FS , FT , FU , FW
AvgPosn(term 1..4 , url , usr , session, time)
FS , FT , FU , FR
HDCTR(term 1..4 , url , usr , session, time)
FS , FT , FU , FC , FW
WCTR(term 1..4 , url , usr , session, time)
FS , FT , FU , FC , FN
PosnCTR(term 1..4 , url , usr , session, time)
FS , FT , FU , FC , FR
SkipRate(term 1..4 , url , usr , session, time)
FS , FT , FU , FK
Listings(term 1..4 , domain, usr , session, time)
FS , FT , FD , FC
CTR(term 1..4 , domain, usr , session, time)
FS , FT , FD , FC
ADT(term 1..4 , domain, usr , session, time)
FS , FT , FD , FW
AvgPosn(term 1..4 , domain, usr , session, time)
FS , FT , FD , FR
HDCTR(term 1..4 , domain, usr , session, time)
FS , FT , FD , FC , FW
WCTR(term 1..4 , domain, usr , session, time)
FS , FT , FD , FC , FN
PosnCTR(term 1..4 , domain, usr , session, time) FS , FT , FD , FC , FR
Listings(−, url , −, −, −)
FG , FU , FC
CTR(−, url , −, −, −)
FG , FU , FC
ADT(−, url , −, −, −)
FG , FU , FW
AvgPosn(−, url , −, −, −)
FG , FU , FR
HDCTR(−, url , −, −, −)
FG , FU , FC , FW
WCTR(−, url , −, −, −)
FG , FU , FC , FN
PosnCTR(−, url , −, −, −)
FG , FU , FC , FR
SkipRate(−, url , −, −, −)
FG , FU , FK
Continued on next page
ii
Feature No.
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279-288
289
290
291
292
293
Table A1 – continued from previous page
Feature Name
Listings(−, domain, −, −, −)
CTR(−, domain, −, −, −)
ADT(−, domain, −, −, −)
AvgPosn(−, domain, −, −, −)
HDCTR(−, domain, −, −, −)
WCTR(−, domain, −, −, −)
PosnCTR(−, domain, −, −, −)
Listings(−, url , usr , −, time)
CTR(−, url , usr , −, time)
ADT(−, url , usr , −, time)
AvgPosn(−, url , usr , −, time)
HDCTR(−, url , usr , −, time)
WCTR(−, url , usr , −, time)
PosnCTR(−, url , usr , −, time)
SkipRate(−, url , usr , −, time)
Listings(−, domain, usr , −, time)
CTR(−, domain, usr , −, time)
ADT(−, domain, usr , −, time)
AvgPosn(−, domain, usr , −, time)
HDCTR(−, domain, usr , −, time)
WCTR(−, domain, usr , −, time)
PosnCTR(−, domain, usr , −, time)
Listings(−, url , usr , session, time)
CTR(−, url , usr , session, time)
ADT(−, url , usr , session, time)
AvgPosn(−, url , usr , session, time)
HDCTR(−, url , usr , session, time)
WCTR(−, url , usr , session, time)
PosnCTR(−, url , usr , session, time)
SkipRate(−, url , usr , session, time)
Listings(−, domain, usr , session, time)
CTR(−, domain, usr , session, time)
ADT(−, domain, usr , session, time)
AvgPosn(−, domain, usr , session, time)
HDCTR(−, domain, usr , session, time)
WCTR(−, domain, usr , session, time)
PosnCTR(−, domain, usr , session, time)
Listings(−, −, usr , −, time)
CTR(−, −, usr , −, time)
HDCTR(−, −, usr , −, time)
ADT(−, −, usr , −, time)
Listings(−, −, usr , session, time)
CTR(−, −, usr , session, time)
HDCTR(−, −, usr , session, time)
ADT(−, −, usr , session, time)
ClickProb(usr , posn, time)
NumSearchResults(query)
NumClickedURLs(query)
Day(usr , session, time)
QueryTerms(query)
SERPRank(url , usr , session, search)
iii
Feature Classes
FG , FD , FC
FG , FD , FC
FG , FD , FW
FG , FD , FR
FG , FD , FC , FW
FG , FD , FC , FN
FG , FD , FC , FR
FP , FU , FC
FP , FU , FC
FP , FU , FW
FP , FU , FR
FP , FU , FC , FW
FP , FU , FC , FN
FP , FU , FC , FR
FP , FU , FK
FP , FD , FC
FP , FD , FC
FP , FD , FW
FP , FD , FR
FP , FD , FC , FW
FP , FD , FC , FN
FP , FD , FC , FR
FS , FU , FC
FS , FU , FC
FS , FU , FW
FS , FU , FR
FS , FU , FC , FW
FS , FU , FC , FN
FS , FU , FC , FR
FS , FU , FK
FS , FD , FC
FS , FD , FC
FS , FD , FW
FS , FD , FR
FS , FD , FC , FW
FS , FD , FC , FN
FS , FD , FC , FR
FP , FC
FP , FC
FP , FC
FP , FW
FS , FC
FS , FC
FS , FC
FS , FW
FP , FC
FG , FQ , FC
FG , FQ , FC
FS , FU
FQ
References
D. W. Aha, D. Kibler, and M. K. Albert. Instance-based Learning Algorithms. Machine learning, 6(1):37–66,
1991.
E. Amaldi and V. Kann. On the Approximability of Minimizing Nonzero Variables or Unsatisfied Relations in
Linear Systems. Theoretical Computer Science, 209(1):237–260, 1998.
A. Ansari and C. F. Mela. E-customization. Journal of Marketing Research, pages 131–145, 2003.
N. Arora and T. Henderson. Embedded Premium Promotion: Why it Works and How to Make it More Effective.
Marketing Science, 26(4):514–531, 2007.
A. Atkins-Krüger.
Yandex Launches Personalized Search Results For Eastern
Europe,
December
2012.
URL
http://searchengineland.com/
yandex-launches-personalized-search-results-for-eastern-europe-142186.
S. Banerjee, S. Chakrabarti, and G. Ramakrishnan. Learning to Rank for Quantity Consensus Queries. In
Proceedings of the 32nd international ACM SIGIR Conference on Research and Development in Information
Retrieval, pages 243–250. ACM, 2009.
P. N. Bennett, R. W. White, W. Chu, S. T. Dumais, P. Bailey, F. Borisyuk, and X. Cui. Modeling the Impact of
Short- and Long-term Behavior on Search Personalization. In Proceedings of the 35th International ACM
SIGIR Conference on Research and Development in Information Retrieval, pages 185–194, 2012.
L. Breiman, J. Friedman, C. Stone, and R. Olshen. Classification and Regression Trees. The Wadsworth and
Brooks-Cole statistics-probability series. Taylor & Francis, 1984.
C. Burges. From RankNet to LambdaRank to LambdaMART: An Overview. Technical report, Microsoft
Research Technical Report MSR-TR-2010-82, 2010.
C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender. Learning to Rank
Using Gradient Descent. In Proceedings of the 22nd International Conference on Machine Learning, pages
89–96. ACM, 2005.
C. J. Burges, R. Ragno, and Q. V. Le. Learning to Rank with Nonsmooth Cost Functions. In Advances in
Neural Information Processing Systems, pages 193–200, 2006.
R. Caruana and A. Niculescu-Mizil. An Empirical Comparison of Supervised Learning Algorithms. In
Proceedings of the 23rd International Conference on Machine Learning, pages 161–168. ACM, 2006.
O. Chapelle and Y. Chang. Yahoo! Learning to Rank Challenge Overview. Journal of Machine Learning
Research-Proceedings Track, 14:1–24, 2011.
M. Ciaramita, V. Murdock, and V. Plachouras. Online Learning from Click Data for Sponsored Search. In
Proceedings of the 17th International Conference on World Wide Web, pages 227–236. ACM, 2008.
N.
Clayton.
Yandex
Overtakes
Bing
as
Worlds
Fourth
Search
Engine,
2013.
URL
http://blogs.wsj.com/tech-europe/2013/02/11/
yandex-overtakes-bing-as-worlds-fourth-search-engine/.
comScore. December 2013 U.S. Search Engine Rankings, 2014. URL http://investor.google.
com/financial/tables.html.
N. Craswell, O. Zoeter, M. Taylor, and B. Ramsey. An Experimental Comparison of Click Position-bias Models.
In Proceedings of the 2008 International Conference on Web Search and Data Mining, pages 87–94, 2008.
V. Dang. RankLib. Online, 2011. URL http://www.cs.umass.edu/˜{}vdang/ranklib.html.
P. Donmez, K. M. Svore, and C. J. Burges. On the Local Optimality of LambdaRank. In Proceedings of the
32nd international ACM SIGIR Conference on Research and Development in Information Retrieval, pages
460–467. ACM, 2009.
R. O. Duda and P. E. Hart. Pattern Recognition and Scene Analysis, 1973.
iv
D. Dzyabura and J. R. Hauser. Active Machine Learning for Consideration Heuristics. Marketing Science, 30
(5):801–819, 2011.
C. Eickhoff, K. Collins-Thompson, P. N. Bennett, and S. Dumais. Personalizing Atypical Web Search Sessions.
In Proceedings of the Sixth ACM International Conference on Web Search and Data Mining, pages 285–294,
2013.
T. Evgeniou, C. Boussios, and G. Zacharia. Generalized Robust Conjoint Estimation. Marketing Science, 24
(3):415–429, 2005.
T. Evgeniou, M. Pontil, and O. Toubia. A Convex Optimization Approach to Modeling Consumer Heterogeneity
in Conjoint Estimation. Marketing Science, 26(6):805–818, 2007.
M. Feuz, M. Fuller, and F. Stalder. Personal Web Searching in the Age of Semantic Capitalism: Diagnosing the
Mechanisms of Personalisation. First Monday, 16(2), 2011.
J. Friedman, T. Hastie, R. Tibshirani, et al. Additive Logistic Regression: A Statistical View of Boosting (with
Discussion and a Rejoinder by the Authors). The Annals of Statistics, 28(2):337–407, 2000.
E. Gabbert. Keywords vs. Search Queries: What’s the Difference?, 2011. URL http://www.wordstream.
com/blog/ws/2011/05/25/keywords-vs-search-queries.
A. Ghose, P. G. Ipeirotis, and B. Li. Examining the Impact of Ranking on Consumer Behavior and Search
Engine Revenue. Management Science, 2014.
M. Ginsberg. Essentials of Artificial Intelligence. Newnes, 1993.
A. Göker and D. He. Analysing Web Search Logs to Determine Session Boundaries for User-oriented Learning.
In Adaptive Hypermedia and Adaptive Web-Based Systems, pages 319–322, 2000.
Google.
How Search Works, 2014.
URL http://www.google.com/intl/en_us/
insidesearch/howsearchworks/thestory/.
Google Official Blog. Personalized Search for Everyone, 2009. URL http://googleblog.blogspot.
com/2009/12/personalized-search-for-everyone.html.
P. Grey. How Many Products Does Amazon Sell?, 2013. URL http://export-x.com/2013/12/15/
many-products-amazon-sell/.
I. Guyon and A. Elisseeff. An Introduction to Variable and Feature Selection. The Journal of Machine Learning
Research, 3:1157–1182, 2003.
A. Hannak, P. Sapiezynski, A. Molavi Kakhki, B. Krishnamurthy, D. Lazer, A. Mislove, and C. Wilson.
Measuring Personalization of Web Search. In Proceedings of the 22nd International Conference on World
Wide Web, pages 527–538. International World Wide Web Conferences Steering Committee, 2013.
E. F. Harrington. Online Ranking/Collaborative Filtering using the Perceptron Algorithm. In ICML, volume 20,
pages 250–257, 2003.
T. Hastie, R. Tibshirani, J. Friedman, T. Hastie, J. Friedman, and R. Tibshirani. The Elements of Statistical
Learning, volume 2. Springer, 2009.
D. Huang and L. Luo. Consumer Preference Elicitation of Complex Products using Fuzzy Support Vector
Machine Active Learning. Marketing Science, 2015.
InstantWatcher. InstantWatcher.com, 2014. URL http://instantwatcher.com/titles/all.
K. Järvelin and J. Kekäläinen. IR Evaluation Methods for Retrieving Highly Relevant Documents. In
Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in
Information Retrieval, SIGIR ’00, pages 41–48. ACM, 2000.
K. Järvelin and J. Kekäläinen. Cumulated Gain-based Evaluation of IR Techniques. ACM Transactions on
Information Systems (TOIS), 20(4):422–446, 2002.
Kaggle.
Personalized Web Search Challenge, 2013.
URL http://www.kaggle.com/c/
yandex-personalized-web-search-challenge.
Kaggle. Private Leaderboard - Personalized Web Search Challenge , 2014. URL https://www.kaggle.
com/c/yandex-personalized-web-search-challenge/leaderboard.
v
R. Kohavi and G. H. John. Wrappers for Feature Subset Selection. Artificial intelligence, 97(1):273–324, 1997.
A. Lambrecht and C. Tucker. When Does Retargeting Work? Information Specificity in Online Advertising.
Journal of Marketing Research, 50(5):561–576, 2013.
H. Li and P. Kannan. Attributing Conversions in a Multichannel Online Marketing Environment: An Empirical
Model and a Field Experiment. Journal of Marketing Research, 51(1):40–56, 2014.
T. Y. Liu. Learning to Rank for Information Retrieval. Springer, 2011.
P. Masurel, K. Lefvre-Hasegawa, C. Bourguignat, and M. Scordia. Dataiku’s Solution to Yandexs Personalized
Web Search Challenge. Technical report, Dataiku, 2014.
D. Metzler and T. Kanungo. Machine Learned Sentence Selection Strategies for Query-biased Summarization.
In SIGIR Learning to Rank Workshop, 2008.
K. P. Murphy. Machine Learning: A Probabilistic Perspective. The MIT Press, 2012. ISBN 0262018020,
9780262018029.
O. Netzer, R. Feldman, J. Goldenberg, and M. Fresko. Mine Your Own Business: Market-Structure Surveillance
Through Text-Mining. Marketing Science, 31(3):521–543, 2012.
H.
Pavliva.
Google-Beater
Yandex
Winning
Over
Wall
Street
on
Ad
View,
2013.
URL
http://www.bloomberg.com/news/2013-04-25/
google-tamer-yandex-amasses-buy-ratings-russia-overnight.html.
T. Qin, T.-Y. Liu, J. Xu, and H. Li. LETOR: A Benchmark Collection for Research on Learning to Rank for
Information Retrieval. Information Retrieval, 13(4):346–374, 2010.
F. Qiu and J. Cho. Automatic Identification of User Interest for Personalized Search. In Proceedings of the
15th International Conference on World Wide Web, pages 727–736, 2006.
J. R. Quinlan. C4. 5: Programs for Machine Learning, volume 1. Morgan kaufmann, 1993.
J. Reunanen. Overfitting in Making Comparisons Between Variable Selection Methods. The Journal of
Machine Learning Research, 3:1371–1382, 2003.
P. E. Rossi and G. M. Allenby. Bayesian Statistics and Marketing. Marketing Science, 22(3):304–328, 2003.
P. E. Rossi, R. E. McCulloch, and G. M. Allenby. The Value of Purchase History Data in Target Marketing.
Marketing Science, 15(4):321–340, 1996.
R. E. Schapire. The Strength of Weak Learnability. Machine learning, 5(2):197–227, 1990.
B. Schwartz. Google: Previous Query Used On 0.3% Of Searches, 2012. URL http://www.
seroundtable.com/google-previous-search-15924.html.
D. Silverman. IAB Internet Advertising Revenue Report, October 2013. URL http://www.iab.net/
media/file/IABInternetAdvertisingRevenueReportHY2013FINALdoc.pdf.
A. Singhal. Some Thoughts on Personalization, 2011. URL http://insidesearch.blogspot.com/
2011/11/some-thoughts-on-personalization.html.
G. Song. Point-Wise Approach for Yandex Personalized Web Search Challenge. Technical report, IEEE.org,
2014.
D. Sullivan. Bing Results Get Localized & Personalized, 2011. URL http://searchengineland.
com/bing-results-get-localized-personalized-64284.
M. Surdeanu, M. Ciaramita, and H. Zaragoza. Learning to Rank Answers on Large Online QA Collections. In
ACL, pages 719–727, 2008.
J. Teevan, S. T. Dumais, and D. J. Liebling. To Personalize or Not to Personalize: Modeling Queries with
Variation in User Intent. In Proceedings of the 31st Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval, pages 163–170, 2008.
S. Tirunillai and G. J. Tellis. Extracting Dimensions of Consumer Satisfaction with Quality from Online
Chatter: Strategic Brand Analysis of Big Data Using Latent Dirichlet Allocation. Working paper, 2014.
O. Toubia, D. I. Simester, J. R. Hauser, and E. Dahan. Fast Polyhedral Adaptive Conjoint Estimation. Marketing
Science, 22(3):273–303, 2003.
vi
O. Toubia, J. R. Hauser, and D. I. Simester. Polyhedral Methods for Adaptive Choice-based Conjoint Analysis.
Journal of Marketing Research, 41(1):116–131, 2004.
O. Toubia, T. Evgeniou, and J. Hauser. Optimization-Based and Machine-Learning Methods for Conjoint
Analysis: Estimation and Question Design. Conjoint Measurement: Methods and Applications, page 231,
2007.
M. Volkovs. Context Models For Web Search Personalization. Technical report, University of Toronto, 2014.
Q. Wu, C. J. Burges, K. M. Svore, and J. Gao. Adapting Boosting for Information Retrieval Measures.
Information Retrieval, 13(3):254–270, 2010.
Yandex. It May Get Really Personal We have Rolled Out Our Second–Generation Personalised Search Program,
2013. URL http://company.yandex.com/press_center/blog/entry.xml?pid=20.
Yandex. Yandex Announces Third Quarter 2013 Financial Results, 2013. URL http://company.yandex.
com/press_center/blog/entry.xml?pid=20.
YouTube. YouTube Statistics, 2014. URL https://www.youtube.com/yt/press/statistics.
html.
Y. Yue and C. Burges. On Using Simultaneous Perturbation Stochastic Approximation for IR measures, and
the Empirical Optimality of LambdaRank. In NIPS Machine Learning for Web Search Workshop, 2007.
J. Zhang and M. Wedel. The Effectiveness of Customized Promotions in Online and Offline Stores. Journal of
Marketing Research, 46(2):190–206, 2009.
vii
Download