Enrich Query Representation by Query Understanding Gu Xu Microsoft Research Asia Mismatching Problem • Mismatching is Fundamental Problem in Search – Examples: • NY ↔ New York, game cheats ↔ game cheatcodes • Search Engine Challenges – Head or frequent queries • Rich information available: clicks, query sessions, anchor texts, and etc. – Tail or infrequent queries • Information becomes sparse and limited • Our Proposal – Enrich both queries and documents and conduct matching on the enriched representation. Matching at Different Semantic Levels Level of Semantics Match intent with answers (structures of query and document) Microsoft Office home find homepage of Microsoft Office 21 movie find movie named 21 buy laptop less than 1000 find online dealers to buy laptop with less than 1000 dollars Structure Match topics of query and documents Topic Microsoft Office Topic: PC Software … working for Microsoft … my office is in … Topic: Personal Homepage Match terms with same meanings Sense utube youtube motherboard NY New York mainboard Match exactly same terms Term NY disk New York disc Enrich Query Representation <person-name> michael jordan</person-name> <location>berkeley</location> Query Parsing Named entity segmentation and disambiguation Large-scale knowledge base Structure Level Query Classification <query-topics> academic </query-topics> Definition of classes Accuracy & efficiency <correction token =“berkele”> berkeley</correction> <similar-queries> michael I. jordan berkeley </ similar-queries > Query Refinement ill-formed Alternative Query Finding <token>michael</token> <token>jordan</token> <token>berkele</token> Tokenization michael jordan berkele Representation Topic Level well-formed Ambiguity: msil or mail Equivalence (or dependency): department or dept, login or sign on Sense Level C# C MAX_PATH 1,000 1 000 MAX PATH Term Level Understanding QUERY REFINEMENT USING CRF-QR (SIGIR’08) Query Refinement Papers on Machin Learn Spelling Error Correction Inflection Papers on “Machine Learning” Phrase Segmentation Operations are mutually dependant: Spelling Error Correction Inflection Phrase Segmentation Conventional CRF papers machin on learn …… …… papers X x0 on x1 papers papers learn machin Y y 00 learns on paper in upon machine learning machines …… …… …… …… …… …… …… …… …… …… Intractable machin x2 on y 10 paper x3 machin y 20 in learn learn y 30 y 01 y11 machine learning y 21 y 31 … … … … CRF for Query Refinement X O Y Operation Description Deletion Delete a letter in a word Insertion Insertha letter into a word Substitution Replace one letter with another Exchange Switch two letters in a word CRF for Query Refinement X O Y lean walk machined super soccer machining data the learning paper mp3 book think macin clearn machina lyrics learned machi new pc com lear harry machine journal university net blearn course … … … … … … … … … … … y3 y2 o2 x2 o3 x3 machin 1. O constrains the mapping from X to Y (Reduce Space) learn CRF for Query Refinement walk X the O harry … … Y super soccer data paper mp3 book think lyrics new pc com journal university net course … … … … … … … … … machined machi macin machine machina machining y2 y2 y2 Insertion y2 learned lear clearn blearn lean learning y3 y3 Insertion +ed +ing Deletion y3 +ed +ing Deletion x2 x3 machin 1. O constrains the mapping from X to Y (Reduce Space) 2. O indexes the mapping from X to Y (Sharing Parameters) y3 learn NAMED ENTITY RECOGNITION IN QUERY (SIGIR’09, SIGKDD’09) Named Entity Recognition in Query harry potter harry potter film harry potter author harry potter – Movie (0.5) harry potter – Book (0.4) harry potter – Game (0.1) harry potter film harry potter – Movie (0.95) harry potter author harry potter – Book (0.95) Challenges • Named Entity Recognition in Document • Challenges – Queries are short (2-3 words on average) • Less context features – Queries are not well-formed (typos, lower cased, …) • Less content features • Knowledge Database – Coverage and Freshness – Ambiguity Our Approach to NERQ Harry Potter Walkthrough “Harry Potter” (Named Entity) “Game” Class e + q “# Walkthrough” (Context) t c • Goal of NERQ becomes to find the best triple (e, t, c)* for query q satisfying (e, t, c) * arg max arg max ( e ,t , c ) p e , t , c , q ( e , t , c ) G ( q ) p e p c e p t c Training With Topic Model • Ideal Training Data T = {(ei, ti, ci)} max p e , t , c i i i i • Real Training Data T = {(ei, ti, *)} – Queries are ambiguous (harry potter, harry potter review) – Training data are a relatively few max i c p e i , t i , c max max p e p c e p t c p e p c e p t c i i e i ei e c i i c i Training With Topic Model (cont.) max e p e i ei e e t harry potter kung fu panda iron man …………………… …………………… …………………… …………………… …………………… # wallpapers # movies # walkthrough # book price …………………… …………………… …………………… …………………… # is a placeholder for name entity. Here # means “harry potter” c p c e p t i c c Movie Game Book …………………… Topics Weakly Supervised Topic Model • Introducing Supervisions – Supervisions are always better – Alignment between Implicit Topics and Explicit Classes • Weak Supervisions – Label named entities rather than queries (doc. class labels) – Multiple class labels (binary Indicator) Kung Fu Panda ? ? Movie Game Book Distribution Over Classes WS-LDA • LDA + Soft Constraints (w.r.t. Supervisions) L w , y log p w , C y , Soft Constraints LDA Probability • Soft Constraints C y , zi zi 1 1 0 yi i yi zi Document Probability on i-th Class Document Binary Label on i-th Class 1 1 0 yi Extension: Leveraging Clicks Context t t’ # wallpapers # movies # walkthrough URL words # book Title price words …………………… Snippet words Content words Clicked Host Name Other features www.imdb.com www.wikipedia.com www.gamespot.com www.sparknotes.com cheats.ign.com …………………… Movie Game Book Summary The goal of query understanding is to enrich query representation and essentially solve the problem of term mismatching. THANKS!