XML Keyword Search Refinement 郭青松 LOGO Outline Introduction Query Refinement in Traditional IR XML Keyword Query Refinement My work Why we need query refinement? User express their query intention by keywords, but their don’t know how to formulate good query Lack of experience Too many expression forms Unfamiliar with the system Have no idea about the data „Query Refinement Refine the query and get good results What is Query Refinement? Query expansion(query reformulation) Given an ill-formed query from the user, we refine the query and help the user to better retrieve documents. The goal is to improve precision and/or recall. Example: “cars” “car”, “automobile”, “auto” XML Search Tag + Keyword search book: xml Path Expression + Keyword search (CAS Queries) /book[./title about “xml db”] Structure query XPath, XQuery Keyword search (CO Queries) “xml” XML Keywords Search VS IR IR Flat HTML pages Whole page returned XML Model(tree、graph) Structural(semi-structural) Semantic-based query(LCA, SLCA…) Information fragment returned Need of XML Keyword Query Refinement Hard to know the XML content Especially big xml document Information fragments(LCA\SLCA) Easily affect the results(Precision ) Huge difference of query results IR style refinement methods is not suitable for xml Only content be considered Need structure information to form a good query Outline Introduction Query Refinement in Traditional IR XML Keyword Query Refinement My work Tasks Spelling Correction Word Splitting/Word Merging Phrase Segmentation Word Stemming Acronym Expansion Add/Delete Terms Substitution Classes of Query Refinement Relevance feedback Users mark documents(relevant, nonrelevant) Reweight the terms in the query Automatic query Refinement System analysis the relevance of documents and query, give refined query automatically Global analysis Local analysis Relevance Feedback Began in the 1960s Improvement in recall and precision Basic process as follows 1. The user issues their initial query 2. The system returns an initial result set. 3. The user then marks some returned documents as relevant or nonrelevant. 4. The system then re-weights the terms and refine the query results Relevance Feedback Models Boolean. Terms appear in document: relevance Vector Space. q=(t1, t2,…, tn) d=(w1, w2,…, wn) sim (q, d ) n q d i 1 i n i (q i 1 i n ) 2 (d i 1 Probabilistic. Relevance of a query and documents evaluate as probability Probabilistic ranking principle i )2 Rocchio algorithm for vector-space model qm :refined query vector q0: the original query vector Dr : relevant documents , Dnr: nonrelevant documents α, β, γ: weights attached to each term Average relevantdocument vector Average non-relevant document vector Global analysis(1) Using all documents to compute the similarity of query q and terms in the documents Similarity Thesaurus based d1 d 2 t1 w11 w12 K ( D ) t 2 w21 w22 ... ... ... t n wn1 wn 2 ... d n ... w1n ... w2 n ... ... ... wnn ti : term vector q k1 , k2 ,..., km , ks {t} wij : weight of term ti and document t j Global analysis(2) Similarity of terms Query vector cu ,v tu tv wu , j wv , j d j q wi ,q ti ti q Similarity of query and terms sim (q, k j ) q k j wi ,q ti t j wi ,q ci , j ti q k i q Select r terms with highest sim value and adding into initial query , reformulate the new query www.themegallery.com Local analysis Local analysis: Using initial query results(especially documents front ,local documents) to refine the query Local clustering Clustering the term of local documents Query refined with the relevant cluster Similarity of terms in query and terms in documents Local context analysis(LCA) Get the most similar term in local documents with the query q to expanse Similarity of q and terms in documents Company name Outline Introduction Query Refinement in Traditional IR XML Keyword Query Refinement My work www.themegallery.com XML Refinement Manner(1) Query refined form Keywords query New Keywords Query • Treat as traditional IR problem • IR with XML Keyword search Semantics Keywords Structural Query User participant Manually(User Interactive ) • Structural Feedback Automatic Company name XML Refinement Manner (2) Manually Refined to new Keywords Query IR(consider the structure of xml) Manually Transform to Structural Query Relevance Feedback Automatic Refined to new Keywords Query Lu jiaheng: Automatic Transform to Structural Query NLP Automatic Refined to new Keywords Query(1) Query Refined Query Rule based Operation Term merging: k1 , k2 ,..., kn k Original query Term splitting:Refined k k1query , k2 ,..., kn IR,2003,Mike Information Retrieval,2003,Mike ' ' ' Term substitution: k , k ,..., k k , k ,..., k 1 2 n 1 2 n Mike, publication Mike, publications Database, Term paper deletion Database, in-proceedings XML, John,2003 XML, John machin, learn machine, learning Hobby, news, paper Hobby, newspaper On, line, data, base Online, database Automatic Refined to new Keywords Query(2) Ranking Refined query candidates set S(RQ) Refinement cost Cost: the step of “op” from “Q” to “RQ” Dynamic programming Efficient Refinement Algorithms Avoid the multiple scan invert list stack-based ,stack-based, short-list-eager approach RQ candidates have the same refinement cost Q={XML, Jim, 2001}{XML, 2001}, {Jim, 2001} or {XML, Jim} NLPX Natural Language Query (NLQ) NEXI NEXI(Narrowed Extended XPath I) //A[about(//B,C)] A: path expression, B :relative path expression to A C is the content requirement. ‘about’ clause represents an individual information request. NLPX—Lexical and Semantic Tagging structural words: content requirements boundary words: Path expression instruction words R :return request , S :support request. Find sections about compression in articles about information retrieval Tagged: Find/XIN sections/XST about/XBD compression/NN in/IN articles/XST about/XBD information/NN retrieval/NN NLPX—Template Matching most queries correspond to a small set of patterns formulate grammar templates with patterns Query: Request+ Grammar Templates Request : CO_Request | CAS_Request CO_Request: NounPhrase+ CAS_Request: SupportRequest | ReturnRequest SupportRequest: Structure [Bound] NounPhrase+ ReturnRequest: Instruction Structure [Bound] NounPhrase+ Request 1 Structural: Content: Instruction: Request 2 Information Requests /article/sec /articlec compression information retrieval R S NLPX—NEXI Query Production merge the information request into NEXI query. A[about(.,C)] A :the request structural attribute and C : the request content attribute. //article[about(.,information retrieval)]//sec[about (.,compression)] Query generation process Create target component Break up the query into units Generate initial target combinations of input target components Generate queries modifying a target component combing two components Initialization Breaks up the input query into terms Structure( XML tags or attributes) Content term(refer to text) Create component Structure term unbound target Content term binding to a bound target Probability enumeration Target component and target sets Query: Papers by jennifer widom papers Jennifer widom {//article} {//inproceedings} 0.5000 0.5000 {//author[~’jennifer widom’]} {//editor[~’jennifer widom’]} {//title[~’jennifer widom’]} 0.6842 0.3150 0.0004 {//article} {//author[∼‘jennifer widom’]} {//inproceedings} {//author[∼‘jennifer widom’]} {//inproceedings} {//editor[∼‘jennifer widom’]} {//article} {//editor[∼‘jennifer widom’]} {//inproceedings} {//title[∼‘jennifer widom’]} {//article} {//title[∼‘jennifer widom’]} 0.3421 0.3421 0.1577 0.1577 0.0002 0.0002 Transformation Operators(1) Aggregation: merge targets with same tag {//a}, {//a[~’x’]} {//a[~’x’]} {//a[~’x’]} , {//a[~’y’]} {//a[~’x y’]} Prefix expansion: add an ancestor condition {//b} {//a//b} {//b[~’x’]} {//a//b[~’x’]} Ordering: combine targets {//a}, {//b} {//a//b} or {//a[//b]} {//a}, {//b[~’x’]} {//a//b[~’x’]} or {//a[//b[~’x’]]} Conclusion Two stronger assumption Keyword query non-ambiguity Availability of XML thesaurus Accuracy: terms classification didn’t consider specific XML context Time costly: Term classification Targets create scan the XML documents Outline Introduction Query Refinement in Traditional IR XML Keyword Query Refinement My work LOGO www.themegallery.com