Research Microsoft Improving Web Search Ranking by Incorporating User Behavior Information

advertisement
Improving Web Search Ranking by
Incorporating User Behavior Information
Eugene Agichtein
Eric Brill
Susan Dumais
Microsoft
Research
Web Search Ranking

Rank pages relevant for a query
– Content match

e.g., page terms, anchor text, term weights
– Prior document quality

e.g., web topology, spam features
– Hundreds of parameters

Tune ranking functions on explicit
document relevance ratings
2
Query: SIGIR 2006

Users can help indicate most relevant results
3
Web Search Ranking: Revisited

Incorporate user behavior information
– Millions of users submit queries daily
– Rich user interaction features (earlier talk)
– Complementary to content and web topology

Some challenges:
– User behavior “in the wild” is not reliable
– How to integrate interactions into ranking
– What is the impact over all queries
4
Outline

Modelling user behavior for ranking

Incorporating user behavior into ranking

Empirical evaluation

Conclusions
5
Related Work

Personalization
– Rerank results based on user’s clickthrough and
browsing history

Collaborative filtering
– Amazon, DirectHit: rank by clickthrough

General ranking
– Joachims et al. [KDD 2002], Radlinski et al. [KDD
2005]: tuning ranking functions with clickthrough
6
Rich User Behavior Feature Space

Observed and distributional features
– Aggregate observed values over all user interactions
for each query and result pair
– Distributional features: deviations from the “expected”
behavior for the query

Represent user interactions as vectors in user
behavior space
– Presentation: what a user sees before a click
– Clickthrough: frequency and timing of clicks
– Browsing: what users do after a click
7
Some User Interaction Features
Presentation
ResultPosition
Position of the URL in Current ranking
QueryTitleOverlap
Fraction of query terms in result Title
Clickthrough
DeliberationTime
Seconds between query and first click
ClickFrequency
Fraction of all clicks landing on page
ClickDeviation
Deviation from expected click frequency
Browsing
DwellTime
Result page dwell time
DwellTimeDeviation Deviation from expected dwell time for query
8
Training a User Behavior Model


Map user behavior features to relevance
judgements
RankNet: Burges et al., [ICML 2005]
– Scalable Neural Net implementation
– Input: user behavior + relevance labels
– Output: weights for behavior feature values
– Used as testbed for all experiments
9
Training RankNet

For query results 1 and 2, present pair of
vectors and labels, label(1) > label(2)
10
RankNet [Burges et al. 2005]

For query results 1 and 2, present pair
of vectors and labels, label(1) > label(2)
Feature Vector1 Label1
NN output 1
11
RankNet [Burges et al. 2005]

For query results 1 and 2, present pair
of vectors and labels, label(1) > label(2)
Feature Vector2 Label2
NN output 1
NN output 2
12
RankNet [Burges et al. 2005]

For query results 1 and 2, present pair
of vectors and labels, label(1) > label(2)
NN output 1
NN output 2
Error is function of both outputs
(Desire output1 > output2)
13
Predicting with RankNet

Present individual vector and get score
Feature Vector1
NN output
14
Outline

Modelling user behavior

Incorporating user behavior into ranking

Empirical evaluation

Conclusions
15
User Behavior Models for Ranking

Use interactions from previous instances of query
– General-purpose (not personalized)
– Only available for queries with past user interactions

Models:
– Rerank, clickthrough only:
reorder results by number of clicks
– Rerank, predicted preferences (all user behavior features):
reorder results by predicted preferences
– Integrate directly into ranker:
incorporate user interactions as features for the ranker
16
Rerank, Clickthrough Only

Promote all clicked results to the top of
the result list
– Re-order by click frequency

Retain relative ranking of un-clicked
results
17
Rerank, Preference Predictions


Re-order results by function of preference
prediction score
Experimented with different variants
– Using inverse of ranks
– Intuition: scores not comparable  merge ranks
1
1
Score( I d , Od )  wI

I d  1 Od  1
18
Integrate User Behavior Features
Directly into Ranker

For a given query
– Merge original feature set with user
behavior features when available
– User behavior features computed from
previous interactions with same query

Train RankNet on enhanced feature set
19
Outline

Modelling user behavior

Incorporating user behavior into ranking

Empirical evaluation

Conclusions
20
Evaluation Metrics

Precision at K: fraction of relevant in top K

NDCG at K: norm. discounted cumulative gain
– Top-ranked results most important
K
N q  M q  (2
r( j)
 1) / log( 1  j )
j 1

MAP: mean average precision
– Average precision for each query: mean of the
precision at K values computed after each relevant
document was retrieved
21
Datasets

8 weeks of user behavior data from
anonymized opt-in client instrumentation

Millions of unique queries and interaction
traces

Random sample of 3,000 queries
– Gathered independently of user behavior
– 1,500 train, 500 validation, 1,000 test

Explicit relevance assessments for top 10
results for each query in sample
22
Methods Compared

Content only: BM25F

Full Search Engine: RN
– Hundreds of parameters for content match and
document quality
– Tuned with RankNet

Incorporating User Behavior
– Clickthrough: Rerank-CT
– Full user behavior model predictions: Rerank-All
– Integrate all user behavior features directly: +All
23
Content, User Behavior:
Precision at K, queries with interactions
BM25
Rerank-CT
Rerank-All
BM25+All
0.63
Precision
0.58
0.53
0.48
0.43
0.38
1
3
K
5
10
BM25 < Rerank-CT < Rerank-All < +All
24
Content, User Behavior: NDCG
0.68
0.66
0.64
NDCG
0.62
0.6
0.58
BM25
Rerank-CT
Rerank-All
BM25+All
0.56
0.54
0.52
0.5
1
2
3
4
5
K
6
7
8
9
BM25 < Rerank-CT < Rerank-All < +All
10
25
Full Search Engine, User Behavior:
NDCG, MAP
0.74
0.72
0.7
NDCG
0.68
0.66
0.64
0.62
RN
Rerank-All
RN+All
0.6
0.58
0.56
1
2
3
4
MAP
RN
0.270
RN+ALL
0.321
BM25
0.236
BM25+ALL
0.292
5
K
6
7
8
9
10
Gain
0.052 (19.13%)
0.056 (23.71%)
26
Impact: All Queries, Precision at K
0.7
RN
Rerank-All
RN+All
Precision
0.65
0.6
0.55
0.5
0.45
0.4
1
3
K
5
10
< 50% of test queries w/ prior interactions
+0.06-0.12 precision over all test queries
27
Impact: All Queries, NDCG
0.7
0.68
NDCG
0.66
0.64
0.62
0.6
RN
Rerank-All
RN+All
0.58
0.56
1
2
3
4
5
K
6
7
8
9
10
+0.03-0.05 NDCG over all test queries
28
Which Queries Benefit Most
Frequency
Average Gain
350
0.2
0.15
0.1
0.05
0
-0.05
-0.1
-0.15
-0.2
-0.25
-0.3
-0.35
-0.4
300
250
200
150
100
50
0
0.1
0.2
0.3
0.4
0.5
0.6
Most gains are for queries with poor ranking
29
Conclusions



Incorporating user behavior into web
search ranking dramatically improves
relevance
Providing rich user interaction features
to ranker is the most effective strategy
Large improvement shown for up to
50% of test queries
30
Thank you
Text Mining, Search, and Navigation group:
http://research.microsoft.com/tmsn/
Adaptive Systems and Interaction group:
http://research.microsoft.com/adapt/
Microsoft
Research
31
Content,User Behavior:
All Queries, Precision at K
0.65
BM25
Rerank-CT
0.6
Rerank-All
Precision
0.55
All
0.5
0.45
0.4
0.35
1
3
K
5
10
BM25 < Rerank-CT < Rerank-All < All
32
Content, User Behavior:
All Queries, NDCG
0.68
0.66
0.64
NDCG
0.62
0.6
0.58
BM25
Rerank-CT
Rerank-All
All
0.56
0.54
0.52
0.5
1
2
3
4
5
K
6
7
8
9
BM25 << Rerank-CT << Rerank-All < All
10
33
Results Summary


Incorporating user behavior into web search
ranking dramatically improves relevance
Incorporating user behavior features into
ranking directly most effective strategy

Impact on relevance substantial

Poorly performing queries benefit most
34
Promising Extensions

Backoff (improve query coverage)

Model user intent/information need

Personalization of various degrees

Query segmentation
35
Download