Enrich Query Representation by Query Understanding

advertisement
Enrich Query Representation by
Query Understanding
Gu Xu
Microsoft Research Asia
Mismatching Problem
• Mismatching is Fundamental Problem in Search
– Examples:
• NY ↔ New York, game cheats ↔ game cheatcodes
• Search Engine Challenges
– Head or frequent queries
• Rich information available: clicks, query sessions, anchor texts, and
etc.
– Tail or infrequent queries
• Information becomes sparse and limited
• Our Proposal
– Enrich both queries and documents and conduct matching
on the enriched representation.
Matching at Different Semantic Levels
Level of Semantics
Match intent with answers (structures of query and document)
Microsoft Office home
find homepage of Microsoft Office
21 movie
find movie named 21
buy laptop less than 1000
find online dealers to buy
laptop with less than 1000 dollars
Structure
Match topics of query and documents
Topic
Microsoft Office
Topic: PC Software
… working for Microsoft … my office is in …
Topic: Personal Homepage
Match terms with same meanings
Sense
utube
youtube
motherboard
NY
New York
mainboard
Match exactly same terms
Term
NY
disk
New York
disc
Enrich Query Representation
<person-name>
michael jordan</person-name>
<location>berkeley</location>
Query Parsing
Named entity segmentation and disambiguation
Large-scale knowledge base
Structure Level
Query Classification
<query-topics>
academic </query-topics>
Definition of classes
Accuracy & efficiency
<correction token =“berkele”>
berkeley</correction>
<similar-queries>
michael I. jordan berkeley
</ similar-queries >
Query Refinement ill-formed
Alternative Query Finding
<token>michael</token>
<token>jordan</token>
<token>berkele</token>
Tokenization
michael jordan berkele
Representation
Topic Level
well-formed
Ambiguity: msil or mail
Equivalence (or dependency): department or
dept, login or sign on
Sense Level
C#
C
MAX_PATH
1,000
1 000
MAX PATH
Term Level
Understanding
QUERY REFINEMENT USING CRF-QR
(SIGIR’08)
Query Refinement
Papers on Machin Learn
Spelling Error
Correction
Inflection
Papers on “Machine Learning”
Phrase Segmentation
Operations are mutually dependant:
Spelling Error Correction
Inflection
Phrase Segmentation
Conventional CRF
papers machin on
learn …… ……
papers
X
x0
on
x1
papers
papers learn machin
Y
y 00
learns on paper in
upon machine learning
machines …… ……
…… …… …… ……
…… …… …… ……
Intractable
machin
x2
on
y 10
paper
x3
machin
y 20
in
learn
learn
y 30
y 01
y11
machine learning
y 21
y 31
…
…
…
…
CRF for Query Refinement
X
O
Y
Operation
Description
Deletion
Delete a letter in a word
Insertion
Insertha letter into a word
Substitution
Replace one letter with another
Exchange
Switch two letters in a word
CRF for Query Refinement
X
O
Y
lean walk machined super soccer machining data
the learning paper mp3 book think macin clearn
machina lyrics learned machi new pc com lear
harry machine journal university net blearn course
… … … … … … … … … … …
y3
y2
o2
x2
o3
x3
machin
1. O constrains the mapping from X to Y (Reduce Space)
learn
CRF for Query Refinement
walk
X
the
O
harry
… …
Y
super soccer
data
paper mp3 book think
lyrics
new pc com
journal university net
course
… … … … … … … … …
machined machi macin
machine machina machining
y2
y2
y2
Insertion
y2
learned lear clearn
blearn lean learning
y3
y3
Insertion
+ed
+ing
Deletion
y3
+ed
+ing
Deletion
x2
x3
machin
1. O constrains the mapping from X to Y (Reduce Space)
2. O indexes the mapping from X to Y (Sharing Parameters)
y3
learn
NAMED ENTITY RECOGNITION IN
QUERY (SIGIR’09, SIGKDD’09)
Named Entity Recognition in Query
harry potter
harry potter film
harry potter author
harry potter – Movie (0.5)
harry potter – Book (0.4)
harry potter – Game (0.1)
harry potter film
harry potter – Movie (0.95)
harry potter author
harry potter – Book (0.95)
Challenges
• Named Entity Recognition in Document
• Challenges
– Queries are short (2-3 words on average)
• Less context features
– Queries are not well-formed (typos, lower cased, …)
• Less content features
• Knowledge Database
– Coverage and Freshness
– Ambiguity
Our Approach to NERQ
Harry Potter Walkthrough
“Harry Potter” (Named Entity)
“Game” Class
e
+
q
“# Walkthrough” (Context)
t
c
• Goal of NERQ becomes to find the best triple
(e, t, c)* for query q satisfying
(e, t, c) *  arg max
 arg max
( e ,t , c )
p e , t , c , q 
( e , t , c ) G ( q )
p e  p c e  p t c 
Training With Topic Model
• Ideal Training Data T = {(ei, ti, ci)}
max
 p e , t , c 
i
i
i
i
• Real Training Data T = {(ei, ti, *)}
– Queries are ambiguous (harry potter, harry potter review)
– Training data are a relatively few
max
 
i
c
p e i , t i , c   max
 max
 p e  p c e  p t c 
p e  p c e  p t c 
 
i
i
e
i ei  e
c
i
i
c
i
Training With Topic Model (cont.)
max

e
p e 
i ei  e
e
t
harry potter
kung fu panda
iron man
……………………
……………………
……………………
……………………
……………………
# wallpapers
# movies
# walkthrough
# book price
……………………
……………………
……………………
……………………

# is a placeholder for
name entity. Here #
means “harry potter”
c
p c e  p t i c 
c
Movie
Game
Book
……………………
Topics
Weakly Supervised Topic Model
• Introducing Supervisions
– Supervisions are always better
– Alignment between Implicit Topics and Explicit Classes
• Weak Supervisions
– Label named entities rather than queries (doc. class labels)
– Multiple class labels (binary Indicator)
Kung Fu Panda
?
?
Movie Game Book
Distribution Over Classes
WS-LDA
• LDA + Soft Constraints (w.r.t. Supervisions)
L  w , y   log p w  ,     C  y  , 
Soft Constraints
LDA Probability
• Soft Constraints
C y  ,   
zi
zi
1
1
0
yi


i
yi zi
Document Probability
on i-th Class
Document Binary Label
on i-th Class
1
1
0
yi
Extension: Leveraging Clicks
Context
t
t’
# wallpapers
# movies
# walkthrough
URL words
# book
Title price
words
……………………
Snippet words
Content words
Clicked
Host
Name
Other
features
www.imdb.com
www.wikipedia.com
www.gamespot.com
www.sparknotes.com
cheats.ign.com
……………………
Movie
Game
Book
Summary
The goal of query understanding is to
enrich query representation and
essentially solve the problem of
term mismatching.
THANKS!
Download