Slides - Yisong Yue

advertisement
Practical and Reliable Retrieval Evaluation
Through Online Experimentation
WSDM Workshop on Web Search Click Data
February 12th, 2012
Yisong Yue
Carnegie Mellon University
Offline Post-hoc Analysis
• Launch some ranking function on live traffic
– Collect usage data (clicks)
– Often beyond our control
Offline Post-hoc Analysis
• Launch some ranking function on live traffic
– Collect usage data (clicks)
– Often beyond our control
• Do something with the data
– User modeling, learning to rank, etc
Offline Post-hoc Analysis
• Launch some ranking function on live traffic
– Collect usage data (clicks)
– Often beyond our control
• Do something with the data
– User modeling, learning to rank, etc
• Did we improve anything?
– Often only evaluated on pre-collected data
Evaluating via Click Logs
Suppose our model
swaps results 1 and 6
Did retrieval quality
improve?
180
# times result selected
160
time spent in abstract
1
0.9
0.8
140
0.7
120
0.6
100
0.5
80
0.4
60
0.3
40
0.2
20
0.1
0
mean time (s)
# times rank selected
Time spent
in each result
frequencyView/Click?
of doc selected
What
Results
dobyUsers
0
1
2
3
4
5
6
7
8
9
10
11
Rank of result
[Joachims et al. 2005, 2007]
Online Evaluation
• Try out new ranking function on real users
• Collect usage data
• Interpret usage data
• Conclude whether or not quality has improved
Challenges
• Establishing live system
• Getting real users
• Needs to be practical
– Evaluation shouldn’t take too long
– I.e., a sensitive experiment
• Needs to be reliable
– Feedback needs to be properly interpretable
– Not too systematically biased
Challenges
• Establishing live system
• Getting real users
Interleaving Experiments!
• Needs to be practical
– Evaluation shouldn’t take too long
– I.e., a sensitive experiment
• Needs to be reliable
– Feedback needs to be properly interpretable
– Not too systematically biased
Team Draft Interleaving
1.
2.
3.
4.
5.
6.
Ranking A
Ranking B
Napa Valley – The authority for lodging...
1. Napa Country, California – Wikipedia
www.napavalley.com
en.wikipedia.org/wiki/Napa_Valley
Napa Valley Wineries - Plan your wine...
2. Napa Valley – The authority for lodging...
www.napavalley.com/wineries
www.napavalley.com
Napa Valley College
3. Napa: The Story of an American Eden...
www.napavalley.edu/homex.asp
books.google.co.uk/books?isbn=...
Been There | Tips | Napa Valley
4. Napa Valley Hotels – Bed and Breakfast...
www.ivebeenthere.co.uk/tips/16681 Presented Rankingwww.napalinks.com
Napa Valley Wineries and1.
Wine
5. for
NapaValley.org
Napa Valley – The authority
lodging...
www.napavintners.com
www.napavalley.org
www.napavalley.com
Napa Country, California –2.Wikipedia
The Napa Valley Marathon
Napa Country, California6.
– Wikipedia
en.wikipedia.org/wiki/Napa_Valley
www.napavalleymarathon.org
en.wikipedia.org/wiki/Napa_Valley
3. Napa: The Story of an American Eden...
books.google.co.uk/books?isbn=...
4. Napa Valley Wineries – Plan your wine...
www.napavalley.com/wineries
5. Napa Valley Hotels – Bed and Breakfast...
A
B
www.napalinks.com
6. Napa Balley College
www.napavalley.edu/homex.asp
7 NapaValley.org
[Radlinski et al., 2008]
www.napavalley.org
Team Draft Interleaving
1.
2.
3.
4.
5.
6.
Ranking A
Ranking B
Napa Valley – The authority for lodging...
1. Napa Country, California – Wikipedia
www.napavalley.com
en.wikipedia.org/wiki/Napa_Valley
Napa Valley Wineries - Plan your wine...
2. Napa Valley – The authority for lodging...
www.napavalley.com/wineries
www.napavalley.com
Napa Valley College
3. Napa: The Story of an American Eden...
www.napavalley.edu/homex.asp
books.google.co.uk/books?isbn=...
Been There | Tips | Napa Valley
4. Napa Valley Hotels – Bed and Breakfast...
www.ivebeenthere.co.uk/tips/16681 Presented Rankingwww.napalinks.com
Napa Valley Wineries and1.
Wine
5. for
NapaValley.org
Napa Valley – The authority
lodging...
www.napavintners.com
www.napavalley.org
www.napavalley.com
Napa Country, California –2.Wikipedia
The Napa Valley Marathon
Napa Country, California6.
– Wikipedia
en.wikipedia.org/wiki/Napa_Valley
www.napavalleymarathon.org
en.wikipedia.org/wiki/Napa_Valley
3. Napa: The Story of an American Eden...
books.google.co.uk/books?isbn=...
Tie!
4. Napa Valley Wineries – Plan your wine...
www.napavalley.com/wineries
5. Napa Valley Hotels – Bed and Breakfast...
www.napalinks.com
6. Napa Balley College
www.napavalley.edu/homex.asp
7 NapaValley.org
[Radlinski et al., 2008]
www.napavalley.org
Simple Example
• Two users, Alice & Bob
– Alice clicks a lot,
– Bob clicks very little,
• Two retrieval functions, r1 & r2
– r1 > r2
• Two ways of evaluating:
– Run r1 & r2 independently,
measure absolute metrics
– Interleave r1 & r2, measure
pairwise preference
Simple Example
• Two users, Alice & Bob
– Alice clicks a lot,
– Bob clicks very little,
• Two retrieval functions, r1 & r2
– r1 > r2
• Two ways of evaluating:
– Run r1 & r2 independently,
measure absolute metrics
– Interleave r1 & r2, measure
pairwise preference
• Absolute metrics:
User
Ret Func
#clicks
Alice r2
5
Bob
1
r1
Higher chance of falsely
concluding that r2 > r1
• Interleaving:
User
#clicks on r1 #clicks on r2
Alice 4
1
Bob
0
1
Comparison with Absolute Metrics (Online)
ArXiv.org Pair 2
Probability
Disagreement
p-value
ArXiv.org Pair 1
Query set size
•Experiments on arXiv.org
•About 1000 queries per experiment
•Interleaving is more sensitive and more reliable
Clicks@1 diverges in
preference estimate
Interleaving achieves
significance faster
[Radlinski et al. 2008; Chapelle et al., 2012]
Comparison with Absolute Metrics (Online)
Yahoo! Pair 2
Probability
Disagreement
p-value
Yahoo! Pair 1
Query set size
•Experiments on Yahoo! (smaller differences in quality)
•Large scale experiment
•Interleaving is sensitive and more reliable (~7K queries for significance)
[Radlinski et al. 2008; Chapelle et al., 2012]
Benefits & Drawbacks of Interleaving
• Benefits
– A more direct way to elicit user preferences
– A more direct way to perform retrieval evaluation
– Deals with issues of position bias and calibration
• Drawbacks
– Can only elicit pairwise ranking-level preferences
– Unclear how to interpret at document-level
– Unclear how to derive user model
Demo!
http://www.yisongyue.com/downloads/sigir_tutorial_demo_scripts.tar.gz
Story So Far
• Interleaving is an efficient and consistent
online experiment framework.
• How can we improve interleaving
experiments?
• How do we efficiently schedule multiple
interleaving experiments?
Not All Clicks Created Equal
• Interleaving constructs a paired test
– Controls for position bias
– Calibrates clicks
• But not all clicks are equally informative
– Attractive summaries
– Last click vs first click
– Clicks at rank 1
Click Percentage on Bottom
Title Bias Effect
Adjacent Rank Positions
• Bars should be equal if no title bias
[Yue et al., 2010]
Not All Clicks Created Equal
• Example: query session with 2 clicks
– One click at rank 1 (from A)
– Later click at rank 4 (from B)
– Normally would count this query session as a tie
Not All Clicks Created Equal
• Example: query session with 2 clicks
–
–
–
–
–
One click at rank 1 (from A)
Later click at rank 4 (from B)
Normally would count this query session as a tie
But second click is probably more informative…
…so B should get more credit for this query
Linear Model for Weighting Clicks
• Feature vector φ(q,c):
1 always




1
if
click
led
to
download



 ( q, c )  
1 if last click


1 if higher rank than previous click 



• Weight of click is wTφ(q,c)
[Yue et al., 2010; Chapelle et al., 2012]
Example
 1 if c is last click; 0 else 
 ( q, c )  

1
if
c
is
not
last
click;
0
else


• wTφ(q,c) differentiates last clicks and other clicks
[Yue et al., 2010; Chapelle et al., 2012]
Example
 1 if c is last click; 0 else 
 ( q, c )  

1
if
c
is
not
last
click;
0
else


• wTφ(q,c) differentiates last clicks and other clicks
• Interleave A vs B
– 3 clicks per session
– Last click 60% on result from A
– Other 2 clicks random
[Yue et al., 2010; Chapelle et al., 2012]
Example
 1 if c is last click; 0 else 
 ( q, c )  

1
if
c
is
not
last
click;
0
else


• wTφ(q,c) differentiates last clicks and other clicks
• Interleave A vs B
– 3 clicks per session
– Last click 60% on result from A
– Other 2 clicks random
• Conventional w = (1,1) – has significant variance
• Only count last click w = (1,0) – minimizes variance
[Yue et al., 2010; Chapelle et al., 2012]
Learning Parameters
• Training set: interleaved click data on pairs of
retrieval functions (A,B)
– We know A > B
[Yue et al., 2010; Chapelle et al., 2012]
Learning Parameters
• Training set: interleaved click data on pairs of
retrieval functions (A,B)
– We know A > B
• Learning: train parameters w to maximize sensitivity
of interleaving experiments
[Yue et al., 2010; Chapelle et al., 2012]
Learning Parameters
• Training set: interleaved click data on pairs of
retrieval functions (A,B)
– We know A > B
• Learning: train parameters w to maximize sensitivity
of interleaving experiments
• Example: z-test depends on z-score = mean / std
– The larger the z-score, the more confident the test
[Yue et al., 2010; Chapelle et al., 2012]
Learning Parameters
• Training set: interleaved click data on pairs of
retrieval functions (A,B)
– We know A > B
• Learning: train parameters w to maximize sensitivity
of interleaving experiments
• Example: z-test depends on z-score = mean / std
– The larger the z-score, the more confident the test
– Inverse z-test learns w to maximize z-score on training set
[Yue et al., 2010; Chapelle et al., 2012]
Inverse z-Test
ù
1é
Y q = ê å f (q, c) - å f (q, c)ú
q êëcÎA(q)
úû
cÎB(q)
mean(w) =
Aggregate features of
all clicks in a query
1
T
w
Yq
å
n q
2
1
T
std(w) =
w Y q - mean(w))
å
(
n q
[Yue et al., 2010; Chapelle et al., 2012]
Inverse z-Test
ù
1é
Y q = ê å f (q, c) - å f (q, c)ú
q êëcÎA(q)
úû
cÎB(q)
Aggregate features of
all clicks in a query
1
mean(w) = å wT Y q
n q
2
1
T
std(w) =
w Y q - mean(w))
å
(
n q
mean(w)
S-1Y
w* = arg max
=
w
std(w)
Y T S-1Y
Choose w* to maximize
the resulting z-score
[Yue et al., 2010; Chapelle et al., 2012]
Learned
ArXiv.org Experiments
Baseline
Trained on 6 interleaving experiments
Tested on 12 interleaving experiments
[Yue et al., 2010; Chapelle et al., 2012]
Ratio Learned / Baseline
ArXiv.org Experiments
Baseline
Trained on 6 interleaving experiments
Tested on 12 interleaving experiments
Median relative score of 1.37
Baseline requires 1.88 times more data
[Yue et al., 2010; Chapelle et al., 2012]
Learned
Yahoo! Experiments
Baseline
16 Markets, 4-6 interleaving experiments
Leave-one-market-out validation
[Yue et al., 2010; Chapelle et al., 2012]
Ratio Learned / Baseline
Yahoo! Experiments
Baseline
16 Markets, 4-6 interleaving experiments
Leave-one-market-out validation
Median relative score of 1.25
Baseline requires 1.56 times more data
[Yue et al., 2010; Chapelle et al., 2012]
Improving Interleaving Experiments
• Can re-weight clicks based on importance
– Reduces noise
– Parameters correlated so hard to interpret
– Largest weight on “single click at rank > 1”
• Can alter the interleaving mechanism
– Probabilistic interleaving [Hofmann et al., 2011]
• Reusing interleaving usage data
Story So Far
• Interleaving is an efficient and consistent
online experiment framework.
• How can we improve interleaving
experiments?
• How do we efficiently schedule multiple
interleaving experiments?
Information
Systems
Interleave A vs B
…
Left wins
Right wins
A vs B
1
0
A vs C
0
0
B vs C
0
0
Interleave A vs C
…
Left wins
Right wins
A vs B
1
0
A vs C
0
1
B vs C
0
0
Interleave B vs C
…
Left wins
Right wins
A vs B
1
0
A vs C
0
1
B vs C
1
0
Interleave A vs B
…
Left wins
Right wins
A vs B
1
1
A vs C
0
1
B vs C
1
0
Interleave A vs B
Exploration / Exploitation Tradeoff!
…
Left wins
Right wins
A vs B
1
1
A vs C
0
1
B vs C
1
0
Identifying Best Retrieval Function
• Tournament
– E.g., tennis
– Eliminated by an arbitrary player
• Champion
– E.g., boxing
– Eliminated by champion
• Swiss
– E.g., group rounds
– Eliminated based on overall record
Tournaments are Bad
• Two bad retrieval functions are dueling
• They are similar to each other
– Takes a long time to decide winner
– Can’t make progress in tournament until deciding
• Suffer very high regret for each comparison
– Could have been using better retrieval functions
Champion is Good
• The champion gets better fast
– If starts out bad, quickly gets replaced
– Duel against each competitor in round robin
• Treat sequence of champions as a random walk
– Log number of rounds to arrive at best retrieval function
One of these will become next champion
[Yue et al., 2009]
Champion is Good
• The champion gets better fast
– If starts out bad, quickly gets replaced
– Duel against each competitor in round robin
• Treat sequence of champions as a random walk
– Log number of rounds to arrive at best retrieval function
One of these will become next champion
[Yue et al., 2009]
Champion is Good
• The champion gets better fast
– If starts out bad, quickly gets replaced
– Duel against each competitor in round robin
• Treat sequence of champions as a random walk
– Log number of rounds to arrive at best retrieval function
One of these will become next champion
[Yue et al., 2009]
Champion is Good
• The champion gets better fast
– If starts out bad, quickly gets replaced
– Duel against each competitor in round robin
• Treat sequence of champions as a random walk
– Log number of rounds to arrive at best retrieval function
[Yue et al., 2009]
Swiss is Even Better
• Champion has a lot of variance
– Depends on initial champion
• Swiss offers low-variance alternative
– Successively eliminate retrieval function with
worst record
• Analysis & intuition more complicated
[Yue & Joachims, 2011]
Interleaving for Online Evaluation
• Interleaving is practical for online evaluation
– High sensitivity
– Low bias (preemptively controls for position bias)
• Interleaving can be improved
– Dealing with secondary sources of noise/bias
– New interleaving mechanisms
• Exploration/exploitation tradeoff
– Need to balance evaluation with servicing users
References:
Large Scale Validation and Analysis of Interleaved Search Evaluation (TOIS 2012)
Olivier Chapelle, Thorsten Joachims, Filip Radlinski, Yisong Yue
A Probabilistic Method for Inferring Preferences from Clicks (CIKM 2012)
Katja Hofmann, Shimon Whiteson, Maarten de Rijke
Evaluating the Accuracy of Implicit Feedback from Clicks and Query Reformulations (TOIS 2007)
Thorsten Joachims, Laura Granka, Bing Pan, Helen Hembrooke, Filip Radlinski, Geri Gay
How Does Clickthrough Data Reflect Retrieval Quality? (CIKM 2008)
Filip Radlinski, Madhu Kurup, Thorsten Joachims
The K-armed Dueling Bandits Problem (COLT 2009)
Yisong Yue, Josef Broder, Robert Kleinberg, Thorsten Joachims
Learning More Powerful Test Statistics for Click-Based Retrieval Evaluation (SIGIR 2010)
Yisong Yue, Yue Gao, Olivier Chapelle, Ya Zhang, Thorsten Joachims
Beat the Mean Bandit (ICML 2011)
Yisong Yue, Thorsten Joachims
Beyond Position Bias: Examining Result Attractiveness as a Source of Presentation Bias in
Clickthrough Data (WWW 2010)
Yisong Yue, Rajan Patel, Hein Roehrig
Papers and demo scripts available at www.yisongyue.com
Download