Databases & Information Retrieval Maya Ramanath (Further Reading: Combining Database and Information-Retrieval Techniques for Knowledge Discovery. G. Weikum, G. Kasneci, M. Ramanath and F.M. Suchanek, CACM, April 2009 DB & IR: Both Sides Now. G. Weikum, Keynote at SIGMOD 2007) DB and IR: Different Motivations • Both deal with large amounts of information, but… DB IR Applications online reservation, banking libraries Emphasis data consistency, efficiency result quality, user satisfaction Data structured records unstructured text Queries precise interpretations vary Results exact match/all results ranked/top-k results Why Combine Now? • The applications drive the need – The need to manage both structured and unstructured data in an integrated manner • Healthcare example – Find young patients in central Europe who have been reported, in the last two weeks, to have symptoms of tropical virus diseases and an indication of anomalies. • Newspaper archives, product catalogues, etc. Integrating DB & IR Untructured queries / ranked results (keywords/top-k) Structured queries / boolean match results (SQL) top-k processing, keyword search queryon processing IRforSystems graphs text search, effective query interfaces, ranking for structured extracting entities data DB Systems and relationships, ranking for entities Structured data (relational) Unstructured data (text) Modules 1. 2. 3. 4. 5. Top-k processing Query Processing and Interfaces Keyword Search on Graphs Entity and Relationship Extraction Ranking and Structured Data 1. Top-k Processing (1/2) • Structured data, with scores in multiple dimensions • Return the top-k “objects” Car Color Car Mileage Car Service BMW X1 0.9 Honda City 0.8 Tata Nano 0.7 Honda City 0.8 Maruti Swift 0.6 Maruti Swift 0.6 Maruti Swift 0.6 Tata Nano 0.3 Honda City 0.3 Tata Nano BMW X1 0.1 BMW X1 0.1 0.1 Score(O) = å iÎ {color, mileage, service} Si (O) 1. Top-k Processing (2/2) • Top-k Joins – Example: Return the best house-school pair Houses Rating Location Schools Rating Location H1 0.9 L1 S1 0.4 L2 H2 0.8 L2 S2 0.2 L2 H3 0.6 L3 S3 0.8 L3 H4 0.1 L3 S4 0.1 L3 2. Query Processing and Interfaces (1/3) • Given: Database of text documents and a textcentric task. – Extract information about disease outbreaks • Strategies – Scan all documents – very expensive – Filter promising documents – affects recall • Develop cost models and execution strategies appropriate for this setting 2. Query Processing and Interfaces (2/3) Querying with “typed” keywords • Keyword querying: Easy to use • Structured queries: Precise Find the middle ground… Instead of “german has won nobel award” q(X) :- GERMAN(x), hasWonPrize(x,y), NOBEL_PRIZE(y) “german, has won (nobel award)” 2. Query Processing and Interfaces (3/3) WWW 2010 • Full Paper April 26-30 • Raleigh • NC • US • Does the output have to be a boring list of ranked results? • Nope ! Figure 1: The faceted retrieval interface of Facetedpedia. 3. Keyword Search on Graphs (1/3) • Lots of graphs around – Relational DB (tuples+foreign keys) – XML data (elements/sub-elements/id/idrefs) – RDF (graph-structured knowledge-bases) • Easy to query with keywords, instead of SQL/XQuery/SPARQL • Results are the top-k interconnections between the keywords 3. Keyword Search on Graphs (2/3) 3. Keyword Search on Graphs (3/3) Query: “Einstein”, “Bohr” vegetarian isa isa Tom Cruise bornIn Einstein won Nobel Prize won 1962 Bohr diedIn 4. Entity and Relationship Extraction (1/2) Information Extraction (or Knowledge Harvesting) Bill Gates was the founder of Microsoft and later it’s CEO. Apple was established on April 1, 1976 by Steve Jobs, Steve Wozniak, and Ronald Wayne. Infosys was founded on 2 July 1981 by seven entrepreneurs: N. R. Narayana Murthy, Nandan Nilekani, … Company Founder Microsoft Bill Gates Apple Steve Jobs Apple Steve Wozniak Infosys N. R. Narayana Murthy 4. Entity and Relationship Extraction (2/2) • How to build a knowledge-base of facts? – Structurize Wikipedia – Construct rules for extraction • How do I acquire all the facts in the world? – Extract “everything” – Don’t stop extracting 5. Ranking and Structured Data • Not the same as top-k processing • Given: Data with stucture in it – Relational tables (flat) – XML (trees/graphs) – Text documents consisting of entities • Task: Rank the query results – SQL/Xquery/”typed” keywords QUESTIONS?