The SphereSearch Engine for Unified Ranked Retrieval of Heterogeneous XML and Web Documents Ralf Schenkel joint work with Jens Graupmann and Gerhard Weikum Outline • Where existing search engines fail • SphereSearch Concepts • Transformation and Annotation • Query Language and Scoring • Experimental Evaluation • Summary VLDB 2005, Trondheim, Norway 2 Example query #1 Which professors from Saarbrücken do research on XML Different terminology in query and Web pages Director of Department 5 DBS & IS Professor at Saarland University Abstraction Awareness VLDB 2005, Trondheim, Norway 3 Example query #2 ? Conferences about XML in Norway 2005 Information is not present on a single page, but distributed across linked pages VLDB Conference 2005, Trondheim, Norway Call for Papers …XML… Context Awareness VLDB 2005, Trondheim, Norway 4 Example query #3 What are the publications of Max Planck? Max Planck should be instance of concept person, not of concept institute Concept Awareness VLDB 2005, Trondheim, Norway 5 SphereSearch Concepts Goal: Increase recall & precision for hard queries on linked and heterogeneous data • Unified search for unstructured, semistructured, structured data from heterogeneous sources • Graph-based model, including links • Annotation engines from NLP to recognize classes of named entities (persons, locations, dates, …) for concept-aware queries • Flexible yet simple abstraction-aware query language with context-aware scoring • Compactness-based scores VLDB 2005, Trondheim, Norway 6 Some Related Work • Web Query Languages e.g., W3QS [VLDB95], WebOQL [ICDE95],… • Web IR with thesauri e.g., Qiu et al.[SIGIR93], Liu et al.[SIGIR04],… • XML IR e.g., XXL [WebDB00], XIRQL [SIGIR01], XSearch [VLDB93], XRank [SIGMOD03], … • Information extraction e.g., Lixto, KnowItAll, … • Advanced Web graph IR e.g., BANKS [ICDE02], Hristidis et al.[VLDB03], … VLDB 2005, Trondheim, Norway 7 Outline • Where existing search engines fail • SphereSearch Concepts • Transformation and Annotation • Query Language and Scoring • Experimental Evaluation • Current and Future Work VLDB 2005, Trondheim, Norway 8 Unifying Search on Heterogeneous Data Web Intranet XML Heuristics, type-spec transformations Databases Enterprise Information Systems … VLDB 2005, Trondheim, Norway 9 Heuristic Transformation of HTML Goal: Transform layout tags to semantic annotations • Headlines <h1>Experiments</h1> <h2>Settings</h2> We evaluated... <h2>Results</h2> Our system... <Experiments> <Settings>...</Settings> <Results>...</Results> </Experiments> • Patterns <b>Topic:</b>XML <Topic>XML</Topic> • Rules for tables, lists, … VLDB 2005, Trondheim, Norway 10 (Almost) Generic XML Data Model <Professor> Gerhard Weikum <Course> IR </Course> Saarbrücken <Research> XML </Research> </Professor> person docid=1 tag=“Professor“ 1 content=“Gerhard Weikum Saarbrücken“ docid=1 2 tag=“Course“ content=“IR“ 3 docid=1 tag=“Research“ content=“XML“ location AutomaticTags annotation important annotateof content with concepts (persons, locations, dates, corresponding concept money amounts) with tools from Information Extraction VLDB 2005, Trondheim, Norway 11 Information Extraction (IE) • Named Entity Recognition (NER) • Named Entity ~ abstract datatype, concept (location, person,…, IP-address) • Mature (out-of-the-box products, e.g. GATE/ANNIE) • Extensible The Hotel in Salvador, by in The Pelican <company> Pelican Hotel operated </company> Roberto Cardoso, offers comfortable roomsbystarting at <location> Salvador </location>, operated $100 a night, including breakfast. <person> Roberto Cardoso </person>, offers Please checkrooms in before 7pm. comfortable starting at <price> $100 </price> a night, including breakfast. Please check in before <time> 7pm </time>. VLDB 2005, Trondheim, Norway 12 Unifying Search on Heterogeneous Data Web Intranet Databases XML Heuristics, type-spec transformations Annotation of named entities with IE tools (e.g., GATE) Enterprise Information Systems … Annotated XML VLDB 2005, Trondheim, Norway 13 Annotation-Aware Data Model <Professor> Gerhard Weikum <Course>IR</Course> Saarbrücken <Research>XML</Research> </Professor> 1 2 Annotation introduces new tags 2 docid=1 tag=“Course“ content=“IR“ docid=1 tag=“Professor“ content=“Gerhard Weikum Saarbrücken“ docid=1 tag=“Course“ content=“IR“ 3 docid=1 tag=“Research“ content=“XML“ Annotation with GATE: „Saarbrücken“ of type „location“ docid=1 tag=„Professor“ 1 content=“Gerhard Weikum“ docid=1 tag=“location“ 4 content=“Saarbrücken“ VLDB 2005, Trondheim, Norway 3 docid=1 tag=“Research“ content=“XML“ 14 Data Model for Links VLDB 2005, Trondheim, Norway 15 Architecture Search Search Engine INDEX Engine FROM=SIGIR Location= Frankfurt Location= Salvador Event=SIGIR … Price =89 $ Location=Salvador Person=Schenke l Time = 13:15 Annotators Adapters Annotation Module Annotation Module DATE PRICE Web Portal IE Processor … … Web Adapter Adapter Flight Schedule SUBJECT=Notificati on Location=Salvad or Annotation Module LOCATION XML EMail Adapter Adapter SIGIR Hotel Sources Date = 15-18 August Website Website Graupmann Homepage VLDB 2005, Trondheim, Norway Tourist Guide (XML) 16 Outline • Where existing search engines fail • SphereSearch Concepts • Transformation and Annotation • Query Language and Scoring • Experimental Evaluation • Current and Future Work VLDB 2005, Trondheim, Norway 17 SphereSearch Queries Extended keyword queries: • similarity conditions ~professor, ~Saarbrücken • concept-based conditions person=Max Planck, location=Trondheim • grouping • join conditions Ranked results with context-aware scoring VLDB 2005, Trondheim, Norway 18 Score Aggregation: SphereScore Local score sL(e) for each research XML element e (tf/idf, BM25,…) 1 2 2 1 s(1): Weighted aggregation of local scores in environment of element (sphere score): D s ( e) d 0 e ': dist ( e ,e ') d d sL (e '), 0 1 Rewards proximity Context of terms and compactness of awareness term distribution VLDB 2005, Trondheim, Norway 19 Similarity Conditions Similarity conditions like Thesaurus/Ontology: ~professor, ~Saarbrücken concepts, relationships, glosses from WordNet, Gazetteers, Web forms & tables, Wikipedia disambiguation Query expansion δ-exp(x)={w|sim(x,w)>δ} Local score: weighted max over all expansion terms sL(e,~professor) = max tδ-exp(professor) {sim(professor,t)*sL(e,t)} Abstraction awareness alchemist primadonna artist director wizard investigator intellectual researcher professor HYPONYM (0.7) educator scientist scholar academic, academician, faculty member lecturer mentor teacher relationships quantified by statistical co-occurence measures VLDB 2005, Trondheim, Norway 20 Concept-based conditions Goal: Exploit explicit (tags) and automatic annotations in documents location=Trondheim concept value docid=1 tag=„location“ e content=“Trondheim“ sL(e,c=v)= score for concept-tag match + score for value-content-match conceptspecific Allows similarity and range queries (for annotated concepts) like location~Trondheim 1970<date<1980 Concept with concept-specific distance awareness measures VLDB 2005, Trondheim, Norway 21 Query Groups Goal: Related terms should occur in the same context Group conditions that relate to the same „entity“ professor teaching IR research XML professor T(teaching IR) R(research XML) SphereScore computed for each group Find compact sets with one result for each group VLDB 2005, Trondheim, Norway 22 Scores for Query Results query result R: one result per query group score( R) si (ei ) (1 )compactness( R) ei R compactness ~ 1/size of a minimal spanning tree A 3 1 1 X 2 C ( N 1) 1 3 C ( N 2) 1 4 C ( N 3) 1 5 A A 3 4 X X 1 2 1 2 B 5 3 X 2 1 B B 5 6 X 1 2 Context awareness 1 C(N ) 4 6 VLDB 2005, Trondheim, Norway 23 Join conditions Goal: Connect results of different query groups A(research, XML) B(VLDB 2005 paper) A.person=B.person B A VLDB research 2005 1.0 XML Ralf Schenkel Dependent on database size, application • Precomputed • Computed during query execution 0.9 2004 2005 R.Schenkel •Join conditions do not change the score for a node •Join conditions create a new 24 VLDB 2005, Trondheim, link with Norway a specific weight Score for Join Conditions Join condition A.T=B.S: • For all nodes n1 with type T, n2 with type S, add edge (n1,n2) with weight sim(n1,n2))-1 • sim(n1,n2): content-based similarity A B 2 3 X 1 C ( N 4) 2 X 1 3 1 2 B VLDB 2005, Trondheim, Norway 25 Outline • Where existing search engines fail • SphereSearch Concepts • Transformation and Annotation • Query Language and Scoring • Experimental Evaluation • Current and Future Work VLDB 2005, Trondheim, Norway 26 Setup for Experiments No existing benchmark (INEX, TREC, …) fits Three corpora: • Wikipedia • extended Wikipedia with links to IMDB • extended DBLP corpus with links to homepages 50 Queries like • A(actor birthday 1970<date<1980) western • G(California,governor) M(movie) • A(Madonna,husband) B(director) A.person=B.director Opponent: keyword queries with standard TF/IDF-based score „simplified Google“ VLDB 2005, Trondheim, Norway 27 Incremental Language Levels SSE-Join (join conditions) SSE-QG (query groups) SSE-CV (concept-based conditions) SSE-basic (keywords, SphereScores) VLDB 2005, Trondheim, Norway 28 Experimental Results on Wikipdia VLDB 2005, Trondheim, Norway 29 Experimental Results on Wiki++ and DBLP++ • SphereScores better than local scores • New SSE features nearly double precision VLDB 2005, Trondheim, Norway 30 Current and Future Work • Improve graphical user interface • Refined type-specific similarity measures (like geographic distances) [SIGIR-WS 2005] • Deep Web search through automatic portal queries • Parameter tuning with relevance feedback • Efficiency of query evaluation through precomputation and integrated top-k (TopX talk this afternoon) VLDB 2005, Trondheim, Norway 31 Thank you! VLDB 2005, Trondheim, Norway 32