DB & IR: Both Sides Now Gerhard Weikum weikum@mpi-inf.mpg.de http://www.mpi-inf.mpg.de/~weikum/ in collaboration with Georgiana Ifrim, Gjergji Kasneci, Josiane Parreira, Maya Ramanath, Ralf Schenkel, Fabian Suchanek, Martin Theobald DB and IR: Two Parallel Universes Database Systems canonical application: data type: foundation: accounting Information Retrieval libraries numbers, text short strings parallel universes forever ? algebraic / probabilistic / logic based statistics based search paradigm: Boolean retrieval ranked retrieval (exact queries, result sets/bags) (vague queries, result lists) market leaders: Oracle, IBM DB2, MS SQL Server, etc. Google, Yahoo!, MSN, Verity, Fast, etc. Gerhard Weikum June 14, 2007 2/41 Why DB&IR Now? – Application Needs Simplify life for application areas like: • Global health-care management for monitoring epidemics • News archives for journalists, press agencies, etc. • Product catalogs for houses, cars, vacation places, etc. • Customer support & CRM in insurances, telcom, retail, software, etc. • Bulletin boards for social communities • Enterprise search for projects, skills, know-how, etc. • Personalized & collaborative search in digital libraries, Web, etc. • Comprehensive archive of blogs with time-travel search Typical data: Disease (DId, Name, Category, Pathogen …) UMLS-Categories ( … ) Patient (… Age, HId, Date, Report, TreatedDId) Hospital (HId, Address …) Typical query: symptoms of tropical virus diseases and reported anomalies with young patients in central Europe in the last two weeks Gerhard Weikum June 14, 2007 3/41 Why DB&IR Now? – Platform Desiderata Unstructured search (keywords) Structured search (SQL,XQuery) Keyword Search on Relational Graphs (IIT Bombay, UCSD, MSR, Hebrew U, CU Hong Kong, Duke U, ...) IR Systems Search Engines Integrated DB&IR Platform Querying entities & DB Systems Structured data (records) relations from IE (MSR Beijing, UW Seattle, IBM Almaden, UIUC, MPI, … ) Unstructured data (documents) Platform desiderata (from app developer‘s viewpoint): • Flexible ranking on text, categorical, numerical attributes • cope with „too many answers“ and „no answers“ • Ontologies (dimensions, facets) for products, locations, org‘s, etc. • for query rewriting (relaxation, strengthening) • Complex queries combining text & structured attributes • XPath/XQuery Full-Text with ranking • High update rate concurrently with high query load Gerhard Weikum June 14, 2007 4/41 Why DB&IR Forever? Turn the Web, Web2.0, and Web3.0 into the world‘s most comprehensive knowledge base („semantic DB“) ! • Data enrichment at very large scale • Text and speech are key sources of knowledge production (publications, patents, conferences, meetings, ...) indexed Web Flickr photos digital photos Wikipedia OECD researchers patents world-wide US Library of Congres Google Scholar 2000 2007 2 Bio. --? 8 000 7.4 Mio. ? 115 Mio. --- 20 Bio. 100 Mio. 150 Bio. 1.8 Mio. 8.4 Mio. 60 Mio. 134 Mio. 500 Mio. Gerhard Weikum June 14, 2007 5/41 Outline • Past : Matter, Antimatter, and Wormholes • Present : XML and Graph IR • Future : From Data to Knowledge Gerhard Weikum June 14, 2007 6/41 Parallel Universes: A Closer Look Matter Antimatter • user = programmer • query = precise spec. of info request • interaction via API • user = your kids • query = approximation of user‘s real info needs • interaction process via GUI • strength: indexing, QP • weakness: user model • strength: ranking model • weakness: interoperability • eval. measure: efficiency • eval. measure: effectiveness (throughput, response time, TPC-H, XMark, …) (precision, recall, F1, MAP, NDCG, TREC & INEX benchmarks, … Gerhard Weikum June 14, 2007 7/41 Web Query Languages: DB Uncertain & Prob. Relations: W3QS, WebOQL, Araneus … Prob. DB (Cavallo&Pittarelli) Mystiq, Trio … Semistructured Data: Lore, Xyleme … XPath 2nd Gen. XML IR: 1st Gen. XRank,Timber, TIJAH, XSearch, FleXPath, WHIRL XML IR: CoXML, TopX, (Cohen) XXL, Prob. Tuples (Barbara et al.) DB & IR: Both Sides Now VAGUE (Motro) MarkLogic, Fast … XIRQL, Elixir, JuruXML INEX Gerhard Weikum XPath Full-Text Deep Web Search Prob. Datalog Digital Libraries (Fuhr et al.) weikum@mpi-inf.mpg.de Struct. Docs http://www.mpi-inf.mpg.de/~weikum/ Multimedia IR Proximal Nodes (Baeza-Yates et al.) IR 1990 1995 Faceted Search: Flamenco … 2000 Graph IR Web Entity Search: Libra, Avatar, ExDB … 2005 WHIRL: IR over Relations [W.W. Cohen: SIGMOD’98] Add text-similarity selection and join to relational algebra Example: Select * From Movies M, Reviews R Where M.Plot ~ ”fight“ And M.Year > 1990 And R.Rating > 3 And M.Title ~ R.Title And M.Plot ~ R.Comment Movies Reviews … Title Plot Matrix In the near future … computer hacker Neofor … • DB&IR … fight training … Hero Year Title 1999 Matrix 1 Comment … … … cool fights … new techniques … integration query-time data Matrix … fights … • More recent work: MinorThird, Spider, DBLife, etc. Reloaded and more fights … In ancient China … fights 2002 … fairly boring … … sword fight … • But scoring models fairly ad hoc fights Broken Sword … Shrek 2 In Far Far Away … our lovely hero fights with cat killer … 2004 s (<x,y>, q: A~B) = cosine (x.A, y.B) s (<x,y>, q1 … qm) = m s ( x , y , qi ) 4 1 Matrix … matrix spectrum Eigenvalues … orthonormal … 5 … fight for peace … … sword fight … dramatic colors … 5 Ying xiong aka. Hero Scoring and ranking: Rating xj ~ tf (word j in x) idf (word j) with dampening & normalization i 1 Gerhard Weikum June 14, 2007 9/41 XXL: Early XML IR [Anja Theobald, GW: Adding Relevance toXML, WebDB’00] Union of heterogeneous sources without global schema Similarity-aware XPath: Which professors from Saarbruecken (SB) //~Professor [//* = ”~SB“] are teaching IR and have [//~Course [//* = ”~IR“] ] research projects on XML? [//~Research [//* = ”~XML“] ] Lecturer Professor Name: Gerhard Weikum Name: Ralf Schenkel Address ... Research City: SB Teaching Country: ... Germany Seminar Course Project Title: Contents: Title: IR Intelligent Ranked Syllabus retrieval … Search of Description: ... Heterogeneous Literature: … Information Book Article XML Data retrieval ... ... ... Funding: EU Gerhard Weikum June 14, 2007 Activities Address: Max-Planck Institute for Informatics, Germany Scientific Other … Name: Sponsor: INEX task coordinator EU (Initiative for the Evaluation of XML …) 10/41 XXL: Early XML IR [Anja Theobald, GW: Adding Relevance toXML, WebDB’00] Motivation: Union of heterogeneous sources has no schema Similarity-aware XPath: Which professors //~Professor [//* = ”~Saarbruecken“] from Saarbruecken (SB) are teaching IR and have [//~Course [//* = ”~IR“] ] research projects on XML? [//~Research [//* = ”~XML“] ] alchemist Professor primadonna magician artist director wizard Name: investigator Lecturer Name: Scoring and ranking: Activities Ralf Address: Address Schenkel • tf*idf for content condition Gerhard Max-Planck ... intellectual Weikum Institute for for RELATED (0.48) • ontological similarity Research City: SB Informatics, Teaching relaxed tag condition Germany professor Country: researcher ... Germany Seminar • score aggregationScientific with Other Course HYPONYM (0.749) Project probabilisticName: independence… scientist query expansion model: Title: Contents: Title: IR scholarofSyllabus disjunction tags mentor Description: academic, ... academician, Information Book Article faculty member retrieval ... ... ... lecturer Sponsor: Intelligent Ranked INEX task retrieval … Search of coordinator Wu&Palmer: |path| through EU lca(x,y) teacher (Initiative for the Heterogeneous Literature: … Dice coeff.: 2 #(x,y) / (#xof + XML #y) on…) Web Evaluation XML Data Funding: EU Gerhard Weikum June 14, 2007 11/41 The Past: Lessons Learned precision • DB&IR: added flexible ranking to (semi) structured querying to cope with schema and instance diversity • but ranking seems „ad hoc“ and not consistently good in benchmarks recall • to win benchmark: tuning needed, entity substance but tuning is easier if ranking is principled ! solid food • ontologies are mixed blessing: produce element edible fruit quality diverse, concept similarity subtle, pome danger of topic drift apple • ontology-based query expansion (into large disjunctions) poses efficiency challenge Gerhard Weikum June 14, 2007 Golden Delicious gold // ~Professor [...] // { Professor, Researcher, Lecturer, Scientist, Scholar, Academic, ... }[...] 12/41 Outline Past : Matter, Antimatter, and Wormholes • Present : XML and Graph IR • Future : From Data to Knowledge Gerhard Weikum June 14, 2007 13/41 TopX: 2nd Generation XML IR [Martin Theobald, Ralf Schenkel, GW: VLDB’05, VLDB Journal] • Exploit tags & structure for better precision • Can relax tag names & structure for better recall • Principled ranking by probabilistic IR (Okapi BM25 for XML) • Efficient top-k query processing (using improved TA) • Robust ontology integration (self-throttling to avoid topic drift) • Efficient query expansion (on demand, by extended TA) • Relevance feedback for automatic query rewriting ”Semantic“ XPath Full-Text query: /Article [ftcontains(//Person, ”Max Planck“)] [ftcontains(//Work, ”quantum physics“)] //Children[@Gender = ”female“]//Birthdates supported by TopX engine: http://infao5501.ag5.mpi-sb.mpg.de:8080/topx/ http://topx.sourceforge.net Gerhard Weikum June 14, 2007 14/41 Commercial Break [Martin Theobald, Ralf Schenkel, GW: VLDB’95] TopX demo today 3:30 – 5:30 Gerhard Weikum June 14, 2007 15/41 Principled Ranking by Probabilistic IR binary features, conditional independence of features [Robertson & Sparck-Jones 1976] related to but different from „God does not play dice.“ (Einstein) statistical language models IR does. P[d R(q ) | contents of d ] P[ R|d ] s( d , q ) P[d R(q ) | contents of d ] P[ R |d ] odds for item d with terms di being relevant for query q = {q1, …, qm} Relationship to tf*idf pi tasks)1 qi P [ d | R ] m i •~ led to Okapi BM25 (wins TREC ~ iq d log log i 1 P[ d i | R ] 1 pi df ( k )qi • adapted and extended tf ( i , d )to XML k log i ... in TopX, df ( i ) k (k , d ) Now estimate pi and qi values from •relevance feedback, •pseudo-relevance feedback, •corpus statistics by MLE (with statistical smoothing) and store precomputed pi, qi in index pi P[d i | R] with qi P[d i | R ] pˆ i (# rel . docs ) /# docs tf ( i , d ) ˆpi k tf (k , d ) qi P[d i | corpus] df ( i ) qˆ i k df (k ) Gerhard Weikum June 14, 2007 16/41 Probabilistic Ranking for SQL [S. Chaudhuri, G. Das, V. Hristidis, GW: TODS‘06] SQL queries that return many answers need ranking Examples: • Houses (Id, City, Price, #Rooms, View, Pool, SchoolDistrict, …) Select * From Houses Where View = ”Lake“ And City In (”Redmond“, ”Bellevue“) • Movies (Id, Title, Genre, Country, Era, Format, Director, Actor1, Actor2, …) Select * From Movies Where Genre = ”Romance“ And Era = ”90s“ P[ R|d ] P[d | R] P[ XY | R] s( d , q ) ~ P[ R |d ] P[d | R ] P[ XY | R ] P[Y | R ] 1 P[ X |Y ] P[Y ] odds for tuple d with attributes XY relevant for query q: X1=x1 … Xm=xm Estimate prob‘s, exploiting workload W: P[Y | R] P[Y | XW ] Example: frequent queries • … Where Genre = ”Romance“ And Actor1 = ”Hugh Grant“ • … Where Actor1 = ”Hugh Grant“ And Actor2 = ”Julia Roberts“ boosts HG and JR movies in ranking for Genre = ”Romance“ And Era = ”90s“ Gerhard Weikum June 14, 2007 17/41 From Tables and Trees to Graphs [BANKS, Discover, DBExplorer, KUPS, SphereSearch, BLINKS] Schema-agnostic keyword search over multiple tables: graph of tuples with foreign-key relationships as edges Example: Conferences (CId, Title, Location, Year) Journals (JId, Title) CPublications (PId, Title, CId) JPublications (PId, Title, Vol, No, Year) Authors (PId, Person) Editors (CId, Person) Select * From * Where * Contains ”Gray, DeWitt, XML, Performance“ And Year > 95 Related use cases: Result is connected tree with nodes that contain • XML beyond trees as many query keywords as possible • RDF graphs • ER graphs (e.g. from IE) Ranking: 1 • social networks s( tree, q ) nodeScore ( n, q ) (1 ) 1 edgeScore (e ) nodes n edges e with nodeScore based on tf*idf or prob. IR and edgeScore reflecting importance of relationships (or confidence, authority, etc.) Top-k querying: compute best trees, e.g. Steiner trees (NP-hard) Gerhard Weikum June 14, 2007 18/41 The Present: Observations & Opportunities • Probabilistic IR and statistical language models yield principled ranking and high effectiveness (related to prob. relational models (Suciu, Getoor, …) but different) • Structural similarity and ranking based on tree edit distance (FleXPath, Timber, …) actor movie movie movie plot director actor actor director plot • Aim for comprehensive XML ranking model capturing content, structure, ontologies • Aim to generate structure skeleton in XPath query from user feedback • Good progress on performance but still many open efficiency issues ”life physicist Max Planck“ //article[//person ”Max Planck“] [//category ”physicist“] //biography Gerhard Weikum June 14, 2007 19/41 Outline Past : Matter, Antimatter, and Wormholes Present : XML and Graph IR • Future : From Data to Knowledge Gerhard Weikum June 14, 2007 20/41 Knowledge Queries Turn the Web, Web2.0, and Web3.0 into the world‘s most comprehensive knowledge base („semantic DB“) ! Answer „knowledge queries“ such as: proteins that inhibit both protease and some other enzyme neutron stars with Xray bursts > 1040 erg s-1 & black holes in 10‘‘ differences in Rembetiko music from Greece and from Turkey connection between Thomas Mann and Goethe market impact of Web2.0 technology in December 2006 sympathy or antipathy for Germany from May to August 2006 Nobel laureate who survived both world wars and his children drama with three women making a prophecy to a British nobleman that he will become king Gerhard Weikum June 14, 2007 21/41 Three Roads to Knowledge • Handcrafted High-Quality Knowledge Bases (Semantic-Web-style ontologies, encyclopedias, etc.) • Large-scale Information Extraction & Harvesting: (using pattern matching, NLP, statistical learning, etc. for product search, Web entity/object search, ...) • Social Wisdom from Web 2.0 Communities (social tagging, folksonomies, human computing, e.g.: del.icio.us, flickr, answers.yahoo, iknow.baidu, ...) Gerhard Weikum June 14, 2007 22/41 High-Quality Knowledge Sources • universal „common-sense“ ontologies: • SUMO (Suggested Upper Merged Ontology): 60 000 OWL axioms • Cyc: 5 Mio. facts (OpenCyc: 2 Mio. facts) • domain-specific ontologies: • UMLS (Unified Medical Language System): 1 Mio. biomedical concepts 135 categories, 54 relations (e.g. virus causes disease | symptom) • GeneOntology, etc. • thesauri and concept networks: • WordNet: 200 000 concepts (word senses) and hypernym/hyponym relations • can be cast into OWL-lite (or typed graph with statistical weights) • lexical sources: • Wikipedia (1.8 Mio. articles, 40 Mio. links, 100 languages) etc. • hand-tagged natural-language corpora: • TEI (Text Encoding Initiative) markup of historic encyclopedia • FrameNet: sentences classified into frames with semantic roles growing with strong momentum Gerhard Weikum June 14, 2007 23/41 High-Quality Knowledge Sources General-purpose thesauri and concept networks: WordNet family can be cast into • OWL-lite or into • graph, with weights for relation strengths (derived from that co-occurrence enzyme -- (any of several complex proteins are producedstatistics) by cells and act as catalysts in specific biochemical reactions) => protein -- (any of a large group of nitrogenous organic compounds that are essential constituents of living cells; ...) => macromolecule, supermolecule ... => organic compound -- (any compound of carbon and another element or a radical) ... => catalyst, accelerator -- ((chemistry) a substance that initiates or accelerates a chemical reaction without itself being affected) => activator -- ((biology) any agency bringing about activation; ...) Gerhard Weikum June 14, 2007 24/41 High-Quality Knowledge Sources Wikipedia and other lexical sources Gerhard Weikum June 14, 2007 25/41 Exploit Hand-Crafted Knowledge Wikipedia, WordNet, and other lexical sources {{Infobox_Scientist | name = Max Planck | birth_date = [[April 23]], [[1858]] | birth_place = [[Kiel]], [[Germany]] | death_date = [[October 4]], [[1947]] | death_place = [[Göttingen]], [[Germany]] | residence = [[Germany]] | nationality = [[Germany|German]] | field = [[Physicist]] | work_institution = [[University of Kiel]]</br> [[Humboldt-Universität zu Berlin]]</br> [[Georg-August-Universität Göttingen]] | alma_mater = [[Ludwig-Maximilians-Universität München]] | doctoral_advisor = [[Philipp von Jolly]] | doctoral_students = [[Gustav Ludwig Hertz]]</br> … | known_for = [[Planck's constant]], [[Quantum mechanics|quantum theory]] | prizes = [[Nobel Prize in Physics]] (1918) … Gerhard Weikum June 14, 2007 26/41 YAGO: Yet Another Great Ontology [F. Suchanek, G. Kasneci, GW: WWW 2007] • Turn Wikipedia into explicit knowledge base (semantic DB) • Exploit hand-crafted categories and templates • Represent facts as explicit knowledge triples: relation (entity1, entity2) entity1 relation entity2 (in 1st-order logic, compatible with RDF, OWL-lite, XML, etc.) • Map (and disambiguate) relations into WordNet concept DAG Examples: Max_Planck bornIn Kiel Kiel isInstanceOf Gerhard Weikum June 14, 2007 City 27/41 YAGO Knowledge Representation Knowledge Base # Facts subclass KnowItAll 30 000 SUMO 60 000 WordNet 200 000 Person OpenCyc 300 000 subclass Cyc 5 000 000 Scientist YAGOsubclass 6 000 000 Accuracy: 97% Entity subclass subclass subclass Biologist concepts Location subclass City Country Physicist instanceOf instanceOf Erwin_Planck Nobel Prize hasWon October 4, 1947 bornIn Kiel FatherOf diedOn individuals Max_Planck bornOn means means “Max Planck” “Max Karl Ernst Ludwig Planck” April 23, 1858 means “Dr. Planck” words Online access and download at http://www.mpi-inf.mpg.de/~suchanek/yago/ Gerhard Weikum June 14, 2007 28/41 NAGA: Graph IR on YAGO [G. Kasneci et al.: WWW‘07] Graph-based search on YAGO-style knowledge bases with built-in ranking based on confidence and informativeness statistical language model for result graphs conjunctive queries Kiel bornIn $x isa scientist queries with regular expressions Ling hasFirstName | hasLastName (coAuthor | advisor)* Beng Chin Ooi $x isa scientist worksFor $y locatedIn* Gerhard Weikum June 14, 2007 Zhejiang 29/41 Ranking Factors Confidence: Prefer results that are likely to be correct Certainty of IE Authenticity and Authority of Sources Informativeness: bornIn (Max Planck, Kiel) from „Max Planck was born in Kiel“ (Wikipedia) livesIn (Elvis Presley, Mars) from „They believe Elvis hides on Mars“ (Martian Bloggeria) q: isa (Einstein, $y) Prefer results that are likely important May prefer results that are likely new to user Frequency in answer Frequency in corpus (e.g. Web) Frequency in query log isa (Einstein, scientist) isa (Einstein, vegetarian) q: isa ($x, vegetarian) isa (Einstein, vegetarian) isa (Al Nobody, vegetarian) Compactness: Prefer results that are tightly connected isa vegetarian Size of answer graph Einstein Tom isa Cruise won Nobel Prize Gerhard Weikum June 14, 2007 bornIn 1962 won Bohr diedIn 30/41 Information Extraction (IE): Text to Records Person BirthDate Max Planck 4/23, 1858 Albert Einstein 3/14, 1879 Mahatma Gandhi 10/2, 1869 BirthPlace ... Kiel Ulm Porbandar Person ScientificResult Max Planck Quantum Theory Constant Value Dimension Planck‘s constant 6.2261023 Js Person Collaborator Max Planck Albert Einstein Max Planck Niels Bohr combine NLP, pattern matching, lexicons, statistical learning Gerhard Weikum June 14, 2007 31/41 Knowledge Acquisition from the Web Learn Semantic Relations from Entire Corpora at Large Scale (as exhaustively as possible but with high accuracy) Examples: • all cities, all basketball players, all composers • headquarters of companies, CEOs of companies, synonyms of proteins • birthdates of people, capitals of countries, rivers in cities • which musician plays which instruments • who discovered or invented what • which enzyme catalyzes which biochemical reaction Existing approaches and tools (Snowball [Gravano et al. 2000], KnowItAll [Etzioni et al. 2004], …): almost-unsupervised pattern matching and learning: seeds (known facts) patterns (in text) (extraction) rule (new) facts Gerhard Weikum June 14, 2007 32/41 Methods for Web-Scale Fact Extration seeds text rules new facts Example: Example: city in in city (Seattle) (Seattle) in downtown downtown Seattle Seattle in downtown downtown X X city Seattle X city (Seattle) (Seattle) Seattle and and other other towns towns X and and other other towns towns city Las VegasLas andVegas otherand towns and other towns city (Las (Las Vegas) Vegas) otherXtowns X and other towns plays plays (Zappa, (Zappa, guitar) guitar) playing playing guitar: guitar: … … Zappa Zappa playing playing Y: Y: … …X X plays (Davis, trumpet) trumpet) Davis Davis … … blows blows trumpet trumpet X … blows Y plays (Davis, X… blows Y in downtown Beijing Coltrane blows sax city(Beijing) old center of Beijing plays(Coltrane, sax) sax player Coltrane city(Beijing) plays(C., sax) old center of X Y player X Assessment of facts & generation of rules based on statistics Rules can be more sophisticated: playing NN: (ADJ|ADV)* NP & class(NN)=instrument & class(head(NP))=person plays(head(NP), NN) Gerhard Weikum June 14, 2007 33/41 Performance of Web-IE State-of-the-art precision/recall results: relation countries cities scientists headquarters birthdates instanceOf precision 80% 80% 60% 90% 80% 40% recall 90% ??? ??? 50% 70% 20% corpus Web Web Web News Wikipedia Web systems KnowItAll KnowItAll KnowItAll Snowball, LEILA LEILA Text2Onto, LEILA Open IE 80% ??? Web TextRunner precision value-chain: entities 80%, attributes 70%, facts 60%, events 50% Anecdotic evidence: invented (A.G. Bell, telephone) married (Hillary Clinton, Bill Clinton) isa (yoga, relaxation technique) isa (zearalenone, mycotoxin) contains (chocolate, theobromine) contains (Singapore sling, gin) invented (Johannes Kepler, logarithm tables) married (Segolene Royal, Francois Hollande) isa (yoga, excellent way) isa (your day, good one) contains (chocolate, raisins) plays (the liver, central role) makes (everybody, mistakes) Gerhard Weikum June 14, 2007 34/41 Beyond Surface Learning with LEILA Learning to Extract Information by Linguistic Analysis [F.Suchanek, G.Ifrim, GW: KDD‘06] Limitation of surface patterns: who discovered or invented what “Tesla’s work formed the basis of AC electric power” “Al Gore funded more work for a better basis of the Internet” Almost-unsupervised Statistical Learning with Dependency Parsing (Cologne, Rhine), (Cairo, Nile), … (Cairo, Rhine), (Rome, 0911), (, [0..9]*), … NP outperforms PP NP VP other NP PP NP NP NP LEILA Web-IE methods Cologne lies on the banks of the Rhine People of in Cairo like wine fromF1, the but: Rhine valley in terms precision, recall, • dependency Mp Js parser Osis slow AN Ss MVp DMc Mp Dg Jp Js Sp Mvp Ds • one relation at a time NP VP PP NP NP PP NP NP Js NP VP VP PP NP NP PP NP NP Paris was founded on an island in the Seine Ss Pv MVp Ds DG Js (Paris, Seine) Js MVp Gerhard Weikum June 14, 2007 35/41 IE Efficiency and Accuracy Tradeoffs [see also tutorials by Cohen, Doan/Ramakrishnan/Vaithyanathan, Agichtein/Sarawagi] IE is cool, but what‘s in it for DB folks? • • • precision vs. recall: two-stage processing (filter pipeline) 1) recall-oriented harvesting 2) precision-oriented scrutinizing preprocessing • indexing: NLP trees & graphs, N-grams, PoS-tag patterns ? • exploit ontologies? exploit usage logs ? turn crawl&extract into set-oriented query processing • candidate finding • efficient phrase, pattern, and proximity queries • optimizing entire text-mining workflows [Ipeirotis et al.: SIGMOD‘06] Gerhard Weikum June 14, 2007 36/41 The Future: Challenges • Generalize YAGO approach (Wikipedia + WordNet) • Methods for comprehensive, highly accurate mappings across many knowledge sources • cross-lingual, cross-temporal • scalable in size, diversity, number of sources • Pursue DB support towards efficient IE (and NLP) • Achieve Web-scale IE throughput that can • sustain rate of new content production (e.g. blogs) • with > 90% accuracy and Wikipedia-like coverage • Integrate handcrafted knowledge with NLP/ML-based IE • Incorporate social tagging and human computing Gerhard Weikum June 14, 2007 37/41 Outline Past : Matter, Antimatter, and Wormholes Present : XML and Graph IR Future : From Data to Knowledge Gerhard Weikum June 14, 2007 38/41 Major Trends in DB and IR Database Systems Information Retrieval malleable schema (later) record linkage deep NLP, adding structure info extraction graph mining entity-relationship graph IR dataspaces ontologies Web objects statistical language models data uncertainty ranking programmability search as Web Service Web 2.0 Web 2.0 Gerhard Weikum June 14, 2007 39/41 Conclusion • DB&IR integration agenda: • models − ranking, ontologies, prob. SQL ?, graph IR ? • languages and APIs − XQuery Full-Text++ ? • systems − drop SQL, go light-weight ? − combine with P2P, Deep Web, ... ? • Rethink progress measures and experimental methodology • Address killer app(s) and grand challenge(s): • from data to knowledge (Web, products, enterprises) • integrate knowledge bases, info extraction, social wisdom • cope with uncertainty; ranking as first-class principle • Bridge cultural differences between DB and IR: • co-locate SIGIR and SIGMOD Gerhard Weikum June 14, 2007 40/41 DB&IR: Both Sides Now Joni Mitchell (1969): Both Sides Now … I've looked at life from both sides now, From up and down, and still somehow It's life's illusions i recall. I really don't know life at all. Thank You ! Gerhard Weikum June 14, 2007 41/41