Bruce Croft
Center for Intelligent Information Retrieval
UMass Amherst
• Query Representation and Understanding Workshop at SIGIR 2010
• Research projects in the CIIR
• “Query intent” has become a popular phrase at conferences and at companies
• Research with query logs = acceptance of paper
• Few standards in these papers about test collections, metrics, even tasks
• Query processing has been part of IR for a long time
– e.g., stemming, expansion, relevance feedback
• Most retrieval models say little about queries
• So, what’s going on and what’s interesting?
• Query intent (or search intent) is the same thing as information need
– The notion of an information need or problem underlying a query has been discussed in the IR literature for many years, and it was generally agreed that query intent is another way of referring to the same idea
• Query representation involves modeling the intent or need
– Query understanding refers to the process of identifying the underlying intent or need based on a particular representation
• Intent classes, intent dimensions, and query classes
– terms used to talk about the many different types of information needs and problems
• Query rewriting, query transformation, query
refinement, query alteration, and query reformulation
– names given to the process of changing the original query to better represent the underlying intent (and consequently improve ranking)
• Query expansion, substitution, reduction, segmentation
– some of the techniques or steps used in the query transformation process
• Query
– most research assumes the query is the string entered by user.
Transformation can produce many different representations of the query. Difference between explicit and implicit query is important
• How to develop a unified and general framework for query understanding?
• How to formally define a query representation?
• How to develop new system architectures for query understanding?
• How to combine query understanding with other components in information retrieval systems?
• How to conduct evaluations of query understanding?
• How to make effective use of both human knowledge and machine learning in query understanding?
• Long query relevance
• Query reduction
• Similar query finding
• Query classification
• Named entity recognition in queries
• Context-aware search
– Intent-aware search
• Must agree on tasks, evaluation metrics, and text collections
• TREC-style vs. “black-box” evaluations
• Crowdsourcing for annotations
• Resources such as query collections, document collections, query logs, etc. differ widely in their availability in academic and industry settings
• Document collections – TREC ClueWeb collection preferred
• Query collections – need collections of different query types (e.g. long, location, product…) validated by industry
• Query logs – critical resource for some approaches, not available in academia. Alternatives include MSN/AOL logs, KDD queries, anchor text logs, logs from other applications (Wikipedia), logs from some restricted environment (e.g. academic library)
• N-grams, etc. – corpus and query language statistics from web collections
• Modeling structure in queries
• Modeling distributions of queries
• Modeling diversity in queries
• Transforming long queries
• Generating queries from documents
• Generating query logs from anchor text
• Finding similar queries
• User inputs a string of characters
• Query structure is never explicitly observed and is difficult to infer
– Short and ambiguous search queries new york times square
– Idiosyncratic grammar do grover cleveland have kids
– No capitalization and punctuation talking to heaven movie
• A query Q has a hierarchical representation
– A query is a set of structures
= {
1
– Each structure is a set of concepts
,…,
={
1
, n
}
2
,…}
• Hierarchical representation allows to
– Model arbitrary term dependencies as concepts
– Group concepts by structures
– Assign weights to concepts/structures
Structures
Terms
Bigrams
Chunks
Key Concepts
Dependence members rock group nirvana
[members] [rock] [group] [nirvana]
[members rock] [rock group] [group nirvana]
[members] [rock group] [nirvana]
[members] [nirvana]
[members nirvana] [rock group]
Concepts
Document
Concepts
Structure 1
Concepts
Structure n
Weighted Sequential Dependence Model (WSD)
• Allow the parameters of the sequential dependence model to depend on the concept
• Assume the parameters take a simple parametric form
– maintains reasonable model complexity
w - free parameters
g - concept importance features
[Bendersky, Metzler, and Croft, 2009]
• Features g define the concept importance
• Depend on the concept (term/bigram)
• Independent of a specific document/document corpus
• Combine several sources for more accurate weighting
– Endogenous Features – collection dependent features
– Exogenous Features – collection independent features
• Score document D by:
Query “civil war battle reenactments”
Concept
GF civil war battle
16.9
17.9
16.6
reenactments civil war
10.8
14.5
war battle 9.5
battle reenactments 7.6
Importance Features
…
…
…
…
…
…
…
…
Weight
DF
14.1
0.0619
12.8
0.1947
12.6
0.0913
9.7
0.3487
10.8
0.1959
7.4
0.2458
4.7
0.0540
Concept weights may vary even if concept
DF is similar
Good segments do not necessarily predict important concepts
0,3
+6.3% +1.6%
0,25
+24.1%
0,2
QL
SD
WSD
0,15
0,1
ROBUST04 WT10G GOV2
Distribution of Terms (DOT) words + phrases : original or new
Relevance Model
[Lavrenko and Croft, SIGIR01]
DOT does not consider how
Single Reformulated Query (SRQ) a single reformulation operation
Query Segmentation
SRQ does not consider
Query Substitution information about alternative reformulations
Uncertainty in PRF
[Collins-Thompson and Callan, SIGIR07]
Distribution of Queries (DOQ) each query is the output of applying single or multiple reformulation operations.
Original TREC Query: oil industry history
Distribution of Terms (DOT)
Relevance Model
{ 0.44 ``industry'', 0.28 ``oil'', 0.08 ``petroleum'' ,
0.08 ``gas'' , 0.08 ``county'' , 0.04 ``history''...}
Single Reformulated Query (SRQ)
Query Substitution
`` petroleum industry history''
Sequential Dependence Model [Metzler, SIGIR05]
{ 0.28 ``oil'', 0.28 ``industry'', 0.28 ``history'',
0.08 ``oil industry'' , 0.08 ``industry history'' ...}
Query Segmentation
`` ( oil industry )( history ) ''
Distribution of Queries (DOQ)
0.28 `` ( oil industry )( history ) '',
0.24 `` ( petroleum industry )( history ) '',
0.20 `` ( oil and gas industry )( history ) '',
0.18 `` ( oil )( industrialized )( history ) '' …
• Reducing Long Queries [Xue, Huston, and Croft, CIKM2010]
– A novel CRF-based model learns distribution of subset queries, which directly optimizes retrieval performance
(1) using the top 1 subset query
(K) using the top K subset queries q, d indicate significantly
Better than QL and DM
• A context of a word is the unigram preceding it
• Context distribution
The probability that the term c i appears in w’s context
P ( c i
| w )
count
w
( c i
) count ( c w c j
C ( w )
• The translation model j
)
The KL divergence between the context distributions of w and s t ( s | w )
e
D ( P (.
| w )|| P (.
| s ))
Z
• The substitution model
How fit the new term is to the context of the current query
– Q= q
1
, … q i-2
, q i-1
, q i
, q i+1
, q i+2
, … q n
, candidate = s
P ( w i
s )
t ( s | w i
)
P ( q i
2 q i
1
_ q i
1 q i
2
| s )
• Probabilities are estimated from corpus or query log
– Using text passages nearly the same as pseudo relevance feedback
• Query Expansion is similar to substitution
– We add the new term and keep the original term substitution: “ cheap airfare ” → “cheap flight ” expansion: “ cheap airfare ” → “cheap airfare flight ”
• Stemming
– New terms are restricted to Porter-stemmed root terms
“ drive direction” → “ drive driving direction”
• Extract < anchor , url > pairs from the Gov-2 collection to create the anchor log [Dang and Croft, 2009]
# Total Queries
# Unique Queries
MSN Log
14 million
6 million
Anchor Log
526 million
20 million
Avg. Query Length 2.68
2.62
• The anchor log is very noisy
– “click here”, “print version”, … don’t represent the linked page
• Anchor text gives comparable performance to MSN log for substitution, expansion, stemming
[Dang, Bendersky, and Croft, 2010]
• Reformulating Short Queries [Xue et al, CIKM2010]
– Passage Information used to generate candidate queries and estimate probabilities
Gov2 o, w, m, a represents different methods to generate candidate queries.
q, d, r indicate significantly better than QL, SDM and RM.
• Studying query intent is not new, but more data is leading to many new insights
• Not just a web search issue, but more obvious in web search
• Lots of interesting research to do, but field needs more coherence in terms of research goals, testbeds