Generating Query Substitutions Alicia Wood What is the problem to be solved? Problem • Imperfect description of need • Search engine not able to retrieve documents matching query • Need accurate and related query substitutions Problem (cont.) • Given a query • Want to generate modified query (related) – Improvements (specification) – Neutral (spelling change, synonym) – Loss of original meaning (generalization) Who cares about this problem and why? Who cares? • User typing the query • Want correct results with imperfect query What have others done to solve this problem and why is this inadequate? Previous Work • • • • Relevance/Pseudo relevance feedback Query term deletion Substituting query terms with related terms Latent Semantic Indexing (LSI) Relevance/Pseudo relevance feedback • Submit query for initial retrieval • Processing resulting documents • Modify the query by expanding with additional terms from documents • Perform second retrieval with modified query • Can cause query drift • Computationally expensive Query term deletion • Loss of specificity from original query Substituting query terms • Relies on an initial retrieval Latent Semantic Indexing (LSI) • Identify patterns in relationships between terms and concepts in unstructured collection of text • Computationally expensive What is the proposed solution to the problem? Solution • Query modification based on precomputed query and phrase similarity, – Ranking proposed queries – Similar queries /phrases derived from user query sessions – Learned models used to re-rank • Based on similarity of new query to original query Contributions 1. Identification of new source of data to identify similar queries and phrases 2. The definition of a scheme for scoring query suggestions 3. An algorithm to combine query and phrase suggestions – Finds highly and broadly relevant phrases 4. Identification of features that are predictive of highly relevant query suggestions Classes of Suggestion Relevance • Precise rewriting – Match user’s intent, preserve core meaning automobile insurance <-> automotive insurance • Approximate rewriting – direct close relationship to topic, scope narrowed or broadened Apple music player <-> ipod shuffle • Possible rewriting – Categorical relationship to initial query, complementary product but distinct Eye glasses <-> contact lenses • Clear mismatch – no clear relationship Jaguar xj6 <-> os x jaguar Classes of Rewriting • Specific Rewriting (1+2) – closely related query – highly relevant • Broad Rewriting (1+2+3) – query expansion – relevant to user interests Substitutables • Initial query -> generate relevant queries – Replace query as whole or phrases – Segment query into phrases – Find query pairs where one segment has changed • (britney spears) (mp3s) -> (britney spears) (lyrics) • Pair Independence Hypothesis Likelihood Ratio – High value = strong dependence between two terms Validation • 1000 initial queries – Generate single suggestion (qj) for each • Evaluate accuracy of approaches • Train machine learned classifier • Evaluate ability to produce higher quality suggestions – Word distance, normalized edit distance, number of substitutions • Suggestions criteria: – Some words from initial query – Modifications shouldn’t be made at start of query Future Work • Build semantic classifier – Predict semantic class of rewriting • Take inspiration from machine translation techniques • Introduce language model – Avoid producing nonsensical queries