Generating Query Substitutions

advertisement
Generating Query
Substitutions
Alicia Wood
What is the problem
to be solved?
Problem
• Imperfect description of need
• Search engine not able to retrieve
documents matching query
• Need accurate and related query
substitutions
Problem (cont.)
• Given a query
• Want to generate modified query (related)
– Improvements (specification)
– Neutral (spelling change, synonym)
– Loss of original meaning (generalization)
Who cares about this
problem and why?
Who cares?
• User typing the query
• Want correct results with imperfect query
What have others
done to solve this
problem and why is
this inadequate?
Previous Work
•
•
•
•
Relevance/Pseudo relevance feedback
Query term deletion
Substituting query terms with related terms
Latent Semantic Indexing (LSI)
Relevance/Pseudo relevance feedback
• Submit query for initial retrieval
• Processing resulting documents
• Modify the query by expanding with
additional terms from documents
• Perform second retrieval with modified
query
• Can cause query drift
• Computationally expensive
Query term deletion
• Loss of specificity from original query
Substituting query terms
• Relies on an initial retrieval
Latent Semantic Indexing (LSI)
• Identify patterns in relationships between terms
and concepts in unstructured collection of text
• Computationally expensive
What is the proposed
solution to the
problem?
Solution
• Query modification based on precomputed query and phrase similarity,
– Ranking proposed queries
– Similar queries /phrases derived from user
query sessions
– Learned models used to re-rank
• Based on similarity of new query to original query
Contributions
1. Identification of new source of data to
identify similar queries and phrases
2. The definition of a scheme for scoring query
suggestions
3. An algorithm to combine query and phrase
suggestions
– Finds highly and broadly relevant phrases
4. Identification of features that are predictive
of highly relevant query suggestions
Classes of Suggestion Relevance
• Precise rewriting
– Match user’s intent, preserve core meaning
automobile insurance <-> automotive insurance
• Approximate rewriting
– direct close relationship to topic, scope narrowed or
broadened
Apple music player <-> ipod shuffle
• Possible rewriting
– Categorical relationship to initial query,
complementary product but distinct
Eye glasses <-> contact lenses
• Clear mismatch – no clear relationship
Jaguar xj6 <-> os x jaguar
Classes of Rewriting
• Specific Rewriting (1+2)
– closely related query
– highly relevant
• Broad Rewriting (1+2+3)
– query expansion
– relevant to user interests
Substitutables
• Initial query -> generate relevant queries
– Replace query as whole or phrases
– Segment query into phrases
– Find query pairs where one segment has
changed
• (britney spears) (mp3s) -> (britney spears) (lyrics)
• Pair Independence Hypothesis Likelihood
Ratio
– High value = strong dependence between two
terms
Validation
• 1000 initial queries
– Generate single suggestion (qj) for each
• Evaluate accuracy of approaches
• Train machine learned classifier
• Evaluate ability to produce higher quality
suggestions
– Word distance, normalized edit distance, number of substitutions
• Suggestions criteria:
– Some words from initial query
– Modifications shouldn’t be made at start of query
Future Work
• Build semantic classifier
– Predict semantic class of rewriting
• Take inspiration from machine translation
techniques
• Introduce language model
– Avoid producing nonsensical queries
Download