Inferring Commercial Intent

advertisement
Towards Inferring Searcher Intent
Eugene Agichtein
Intelligent Information Access Lab (IRLab)
•
•
•
•
Text and data mining
Modeling information seeking behavior
Web search and social media search
Tools for medical informatics and public health
Ablimit Aji
(2nd year PhD)
Qi Guo
(3rd year Phd)
In collaboration with:
- Beth Buffalo (Neurology)
- Charlie Clarke (Waterloo)
- Ernie Garcia (Radiology)
- Phil Wolff (Psychology)
- Hongyuan Zha (GaTech)
1st year graduate students: Julia Kiseleva,
Dmitry Lagun, Qiaoling Liu, Wang Yu
Eugene Agichtein, Emory University, IR Lab
2
Online Behavior and Interactions
Information sharing:
blogs, forums, discussions
Search logs:
queries, clicks
Client-side behavior:
Gaze tracking, mouse
movement, scrolling
Eugene Agichtein, Emory University, IR Lab
3
Research Overview
Discover Models of Behavior
(machine learning/data mining)
Intelligent
search
Information
sharing
Health
Informatics
Eugene Agichtein, Emory University, IR Lab
Cognitive
4
Diagnostics
4
Main Application Areas
• Search: ranking, evaluation, advertising, search
interfaces, medical search (clinicians, patients)
• Collaborative information sharing: searcher intent,
success, expertise, content quality
• Health informatics: self reporting of drug side effects,
co-morbidity, outreach/education
• Automatic cognitive diagnostics: stress, frustration,
other impairments …
Eugene Agichtein, Emory University, IR Lab
5
Talk Outline
Overview of the Emory IR Lab
Intent-centric Web Search
Classifying intent of a query
Contextualized search intent detection
Eugene Agichtein, Emory University, IR Lab
6
Web Retrieval Architecture
[from Baeza-Yates and Jones, WWW 2008 tutorial]
Example centralized parallel architecture
Web
Crawlers
Eugene Agichtein, Emory University, IR Lab
Information Retrieval Process (User view)
Source
Selection
Resource
Query
Formulation
Query
Search
Ranked List
Selection
query reformulation,
vocabulary learning,
relevance feedback
source reselection
Eugene Agichtein, Emory University, IR Lab
Documents
Examination
Documents
Delivery
8
Some Key Challenges for Web Search
• Query interpretation (infer intent)
• Ranking (high dimensionality)
• Evaluation (system improvement)
• Result presentation (information visualization)
Eugene Agichtein, Emory University, IR Lab
9
Intent is “Hidden State” Generating Actions
Satisfied
Unsatisfied
Intent “States”
• First (naïve) generative model of user actions:
– Given a state (e.g., “Unsatisfied” with results)
 User generates actions such as query, click, browse
Eugene Agichtein, Emory University, IR Lab
10
Problem Statement
• Given: Sequence of user actions and background
knowledge, Predict user intent and future actions
– Will define intent classes, actions next
• Example applications:
– Predict document relevance (ranking, result
presentation, summarization)
– Predict next query (query suggestion, spelling
correction)
– Predict user satisfaction (market share)
Eugene Agichtein, Emory University, IR Lab
11
Intent Classes (top level only)
[from SIGIR 2008 Tutorial, Baeza-Yates and Jones]
User intent taxonomy (Broder 2002)
– Informational – want to learn about something (~40% / 65%)
History nonya food
– Navigational – want to go to that page (~25% / 15%)
Singapore Airlines
– Transactional – want to do something (web-mediated) (~35% / 20%)
• Access a serviceDownloads
• Shop
– Gray areas
Jakarta weather
Kalimantan satellite images
Nikon Finepix
• Find a good hub
Car rental
• Exploratory search “see what’s there”
Eugene Agichtein, Emory University, IR Lab
Kuala Lumpur
Search Actions
• Keystrokes
– query, scroll, CTRL-C, …)
• GUI:
All of these can be
easily captured on
SERP (javascript)
– scrolling, button press, clicks
• Mouse:
– moving, scrolling, down/up, scroll
• Browser:
– new tab, close, back/forward
Eugene Agichtein, Emory University, IR Lab
13
Problem 1: Detect Query Intent
[Ashkan et al., ECIR 2009]
• Query Intent Detection in multiple dimensions:
Commercial, if assumed purpose of query is to make
an immediate or future purchase
Navigational, if assumed purpose of query is to
locate a specific Website, informational otherwise
• Clickthrough Calculation - Estimating the average ad
clickthrough rate for each query type
Eugene Agichtein, Emory University, IR Lab
14
Dataset Construction
[Ashkan et al., ECIR 2009]
• Microsoft adCenter Search query log
~100M search impressions
~8M ad clicks associated with the impressions
• Seed: 1700 queries labeled by three researchers
– Examine query, search result page (SRP)
• MTurk: 3000 new queries + 1000 Seed queries
– 40 batches of 100 queries, each with 25 Seed, 75 MT
– If agreement < 60%  reject, redo; if >75%  bonus
• Results after resolution:
– 42% Commercial; 55% Navigational
Eugene Agichtein, Emory University, IR Lab
15
Amazon Mechanical Turk Service
Eugene Agichtein, Emory University, IR Lab
16
Use Support Vector Machine (SVM) Classifier
• SVMs maximize the margin around
the separating hyperplane.
• A.k.a. large margin
classifiers
• The decision function is fully
specified by a subset of training
samples, the support vectors.
• Quadratic programming problem
• Seen by many as most successful
current text classification method
Eugene Agichtein, Emory University, IR Lab
Support vectors
Maximize
margin
17
Features for Classification
Category
Query
Specific
Content
Clickthrough
Feature
[Ashkan et al., ECIR 2009]
Description
Query length
Number of characters in the query string
Query segments
Number of words in the query string
URL-element
Whether the query string contains any URL element, such as .com,
.org
Organic domain
Total number of domains listed among the organic results of which
the query string is a substring
SERP
Frequency of keywords extracted from the first search result page
Host #
Number of different target ad hosts clicked as results of the query
Click per host
Total number of ad clicks recorded for the query divided by Host #
Top host significance
Number of times a click happens on the most frequent target host
as a result of the query, divided by click per host
Decrease level for top
two hosts
Number of times a click happens on the most frequent target host
divided by the number of times the second most frequent target
host receives a click
Average substring #
Number of target hosts of which the query is a substring divided by
total number of different hosts clicked for the query
Substring ratio
Total number of clicks on target hosts of which the query is a
substring divided by total number of ad clicks for the query
Deliberation time
The average time between entering a query and an ad click for that
query
Eugene Agichtein, Emory University, IR Lab
18
Intent Classification: Results
[Ashkan et al., ECIR 2009]
Setting
Query + SERP +
Clickthrough
Classifier
Precision
Recall
Accuracy
Commercial
0.90
0.89
0.83
0.94
90%
0.85
0.86
0.80
0.90
85.5%
Classifier
Precision
Recall
Accuracy
Navigational
0.86
0.81
0.87
0.80
84.5%
0.83
0.79
0.84
0.81
83.7%
Noncommercial
Commercial
Query + SERP
Setting
Query + SERP +
Clickthrough
Noncommercial
Informational
Navigational
Query + SERP
Informational
Eugene Agichtein, Emory University, IR Lab
19
Clickthrough for Varying Intent
[Ashkan et al., ECIR 2009]
Eugene Agichtein, Emory University, IR Lab
20
Talk Outline
Overview of the Emory IR Lab
Intent-centric Web Search
Classifying intent of a query
Contextualized search intent detection
Eugene Agichtein, Emory University, IR Lab
21
How Do We Know “True” User Intent?
Adapted from [Daniel M. Russell, 2007]
• Ask the user (surveys, field
studies, pop-ups)
– Does not scale, users get annoyed
• Observe user actions and guess
– Intent usually obvious to humans
but not always
• Detect signals from user’s brain
(fMRI, EEEG) and attempt to
interpret neuron activity
Eugene Agichtein, Emory University, IR Lab
22
“Eyes are a Window to the Soul”
• Eye tracking gives information
Camera
about search interests:
– Eye position
– Pupil diameter
– Seekads and fixations
Reading
Visual
Search
Eugene Agichtein, Emory University, IR Lab
23
“An Eye Tracker on Every Table”
• And “nuclear reactor in every back yard”… Unlikely.
• Eye tracking equipment is bulky and expensive
• Can we infer gaze position from observable actions?
• Exploratory study from Google (Rodden et al.) says
maybe: mouse position is sometimes related to eye
position
Eugene Agichtein, Emory University, IR Lab
24
Relationship Between Mouse and Gaze Position
[K. Rodden, X. Fu, A. Aula, and I. Spiro, Eye-mouse coordination patterns
on web search results pages, Extended Abstracts of ACM CHI 2008]
• Searchers might use
the mouse to focus
reading attention,
bookmark promising
results, or not at all.
• Behavior varies with
task difficulty and
user expertise
Eugene Agichtein, Emory University, IR Lab
25
Assume “Transitivity” Holds
• Given:
– Gaze position ==> user intent
and Mouse movement ==> gaze position
Mouse movement ==> user intent
 Restate problem:
 Given user actions, infer current user’s intent, focusing
on Individual User’s actions
Eugene Agichtein, Emory University, IR Lab
26
From Query Type to Search Intent
• “obama”
 navigational
 informational
• Other examples:
– Query bookmarks (refinding): ~40% of queries (J. Teevan
et al., SIGIR 2007)
– Research vs. Immediate Purchase
• Incorrect to classify the query into a single intent
 Classify user goals for each query instance
Eugene Agichtein, Emory University, IR Lab
27
Dataset Creation: EMU
Train Prediction
Models
HTTP Server
Usage Data
HTTP Log
Data Mining &
Management
• Firefox + LibX plugin
• Track whitelisted sites e.g., Emory, Google, Yahoo search…
• All SERP events logged (asynchronous http requests)
•150 public use machines, ~5,000 opted-in users
Eugene Agichtein, Emory University, IR Lab
28
EMU: Querying Behavior Data
Eugene Agichtein, Emory Univesity, IR Lab
29
Playback Example
Eugene Agichtein, Emory University, IR Lab
30
Problem 1: Search Intent Classification
• Infer “personalized” intent {NAV, INFO, TRANSACT}
for each search instance using EMU instrumentation
– “obama” instance 1  NAV, but instance 2  INFO
• Focus:
– Contribution of client/GUI events (Mouse movements)
Eugene Agichtein, Emory University, IR Lab
31
Navigational query: “facebook”
Eugene Agichtein, Emory University, IR Lab
32
Informational query: “spanish wine”
Eugene Agichtein, Emory University, IR Lab
33
Transactional query: “integrator”
Eugene Agichtein, Emory University, IR Lab
34
Mouse Features: Simple
• First representation:
– Trajectory length
– Horizontal range
– Vertical range
Eugene Agichtein, Emory University, IR Lab
Horizontal range
Trajectory length
Vertical range
35
Mouse Features: Full
• Second representation:
– 5 segments:
initial, early, middle, late,
and end
– Each segment:
speed, acceleration, rotation,
slope, etc.
1
2
3
4
5
Eugene Agichtein, Emory University, IR Lab
36
Learning to recover single search intent
• Represent full client-side interactions with each SERP page as
feature vectors
• Apply standard machine learning classification methods
Feature
Vectors
(training,
test)
Manual
labels
CSIP (using
WEKA SVM,
Decision Tree
etc.)
Eugene Agichtein, Emory University, IR Lab
Predictions
for test
instances
37
Experimental Setup
• Dataset:
– Gathered from mid-January 2008 until mid-March 2008
from the public-used machines in Emory University
libraries.
– Consist of ~1500 initial query instances/search sessions
– Randomly sample 300 initial query instances
• Behavioral pattern for follow-up queries might be different
Eugene Agichtein, Emory University, IR Lab
38
Creating “Truth” Labels
• Use our best guess based on clues:
– Query terms
– Next URL (eg. clicked result)
– How user behaves before click/exit
Eugene Agichtein, Emory University, IR Lab
39
Intent Statistics in Labeled Sample
Eugene Agichtein, Emory University, IR Lab
40
Results: Classifying Search Intent
CSIP > CF >> CS > S
Eugene Agichtein, Emory University, IR Lab
41
Results II: {Info/Transact} vs. Nav
All improved.
Still, CSIP > CF >> CS > S
Eugene Agichtein, Emory University, IR Lab
42
Salient features (by Info Gain)
Eugene Agichtein, Emory University, IR Lab
43
Case Studies Summary
• CSIP can help identify:
– Relatively rare navigational queries (re-finding queries or
queries for obscure websites)
– Informational queries that resemble navigational queries
(coincides with a name of a website)
Eugene Agichtein, Emory University, IR Lab
44
Outline
• Overview of research at the Emory IR Lab
• Dimensions of (commercial) search intent
• Classifying intent of a query
• Contextualized search intent detection
Eugene Agichtein, Emory University, IR Lab
45
Informational vs. Transactional:
Research vs. Purchase Intent
• 10 Users (grad students and staff) asked to
– 1. Search for a best deal on an item they want to purchase
immediately (Purchase intent)
– 2. Research a product they want to purchase eventually
(Research intent)
• Eye tracking and browser instrumentation performed in
parallel
• EyeTech sysstems TM3 (integrated)
– At reasonable resolution, samples reliably at ~12-15 Hz
Eugene Agichtein, Emory University, IR Lab
46
Research Intent
Eugene Agichtein, Emory University, IR Lab
47
Purchase Intent
Eugene Agichtein, Emory University, IR Lab
48
Relationship between behavior and intent?
• Search intent is contextualized within a
search session
• Implication 1: model session-level state
• Implication 2: improve detection based on clientside interactions
Eugene Agichtein, Emory University, IR Lab
49
Contextualized Intent Inference
•
•
•
•
SERP text
Mouse trajectory, hovering/dynamics
Scrolling
Clicks
Eugene Agichtein, Emory University, IR Lab
50
Model: Linear Chain CRF
Eugene Agichtein, Emory University, IR Lab
51
Conditional Random Fields (CRFs)
[from Lafferty, McCallum, Pereira 2001]
From HMMs to MEMMs to CRFs

s  s1 , s2 ,...sn
HMM

o  o1 , o2 ,...on
St
St+1
...

|o|
 
P( s , o )   P( st | st 1 ) P(ot | st )
t 1

|o |
MEMM
St-1
 
P ( s | o )   P ( st | st 1 , ot )
t 1
   j f j ( st , st 1 ) 
 j

1

exp 

t 1 Z st 1 ,ot
    k g k ( st , ot ) 
 k

   j f j ( st , st 1 ) 

|o |
 j

1
 
P( s | o ) 
exp 


Z o t 1
    k g k ( st , ot ) 
 k

Ot-1
Ot
St-1
Ot+1
St
...
St+1
...

|o |
CRF
Eugene Agichtein, Emory University, IR Lab
Ot-1
Ot
St-1
Ot-1
Ot+1
St
Ot
...
St+1
...
Ot+1
...
52
Problem 2: Search Ad Receptiveness
• Hypothesis: the right time to serve any search ads:
when searcher is receptive to seeing ads
• Receptiveness ≈ some search intent
– Commercial? (navigational or informational)
– Non-commercial?
– “Background” interest
Eugene Agichtein, Emory University, IR Lab
53
Predict Future Ad Clicks Within Session
Eugene Agichtein, Emory University, IR Lab
54
Dataset: 440 Emory College Students
Eugene Agichtein, Emory University, IR Lab
55
Results: Ad Click Prediction
• 200%+ precision improvement (within mission)
Eugene Agichtein, Emory University, IR Lab
56
Varying Model Structure
Eugene Agichtein, Emory University, IR Lab
57
Feature Analysis
Eugene Agichtein, Emory University, IR Lab
58
Error Analysis: Mouse Noise
Eugene Agichtein, Emory University, IR Lab
59
Within-mission intent
change/frustration/digression
Eugene Agichtein, Emory University, IR Lab
60
Current and Future Work
• Unsupervised intent clustering
• User vs. task
• Personalized behavior models
• Long-term interests/effects
• User mental state (frustration, satisfaction, …)
Eugene Agichtein, Emory University, IR Lab
61
Challenges
• Separate context from intent (e.g., smart phones)
• User variability: individual differences, tasks
• Scale of data: representation, compression
• Privacy: client-side data similar to other PII
– Can be abused and must be protected
• Obtaining realistic user data: see above
– EMU toolbar tracking since 2007 in Emory Libraries (biased)
Eugene Agichtein, Emory University, IR Lab
62
Other Application Areas
• Search: ranking, evaluation, advertising, search
interfaces, medical search (clinicians, patients)
• Collaborative information sharing: searcher intent,
success, expertise, content quality
• Health informatics: self reporting of drug side effects,
co-morbidity, outreach/education
• Automatic cognitive diagnostics: stress, frustration,
other impairments….
Eugene Agichtein, Emory University, IR Lab
63
Summary: From Behavior to State of Mind
• Approach:
– Machine learning methods for detecting searcher intent
– Calibrated and augmented with lab studies
• Foundational contributions:
– Methods to mine and integrate wide range of interactions
– Data-driven discovery of user state-of-mind
• Impact:
– Intelligent, intuitive search and information sharing
Eugene Agichtein, Emory University, IR Lab
64
Main References
• Classifying and Characterizing Query Intent, Azin Ashkan,
Charles L. A. Clarke, Eugene Agichtein, Qi Guo, In ECIR 2009.
• Qi Guo and Eugene Agichtein, Exploring Client-Side
Instrumentation for Personalized Search Intent Inference:
Preliminary Experiments, Proc. of AAAI 2008 Workshop on
Intelligent Techniques for Web Personalization (ITWP 2008)
• Qi Guo, Eugene Agichtein, Azin Ashkan and Charles L. A.
Clarke: In the Mood to Click? Inferring Searcher Advertising
Receptiveness, in Proc. of WI 2009
• Other papers here:
http://www.mathcs.emory.edu/~eugene/publications.html
Eugene Agichtein, Emory University, IR Lab
65
Thank you!
• Yandex (for hosting my visit)
Supported by:
Eugene Agichtein, Emory University, IR Lab
66
Download