E P F L seminar Increasing the Precision when Obtaining Information from the Web Gio Wiederhold Stanford University 4 April 2000 related report: www-db.stanford.edu/pub/gio/1999/miti.htm Supported by the AFOSR- New World Vistas Program March 2000 Gio XIT 1 Growth Factors Research & Inno vation Tool building General Technology Push Information Technology Consumer Product building & marketing Pull Business needs Government responsibilities March 2000 Gio XIT 2 Trends 1998 : 1999 • Users of the Internet 40% 52% of U.S. population • Growth of Net Sites (now 2.2M public sites with 288M pages) • Expected growth in E-commerce by Internet users [BW, 6 Sep.1999] An unstainable trend cannot be sustained [Herbert Stein] new services March 2000 1998 1999 7.2% 16.0% 6.3% 16.4% Centroid, in 1999 3.1% 10.3% ~1% of total market 2.6% 4.0% 1.4% 4.2% 8.0% 33.0% = $9.5Billion % – – – – – – segment books music & video toys travel tickets Overall 90 80 70 60 50 40 30 20 10 E-penetration Toys 0 98 99 00 01 02 03 04 0.3 1 3 9 27 81 ** Year / % Gio XIT 3 Expect continuing growth • Hardware technology will continue to lead and encourage broader usage • Communication technology will continue to lead and become more economical • User interfaces will improve and not be a barrier to the acceptance of technology • Government policies will not hinder open interaction - or not be able to March 2000 Gio XIT 4 The Problem of Information Growth: "We are drowning in information but starved for knowledge. This level of information is clearly impossible to be handled by present means. Uncontrolled and unorganized information is no longer a resource in an information society, instead it becomes the enemy." -- John Naisbitt, author of 1982 bestseller Megatrends . . . and it’s not getting better Dealing with this issue requires Precision: • Helpful for casual users -reduce human filtering when browsing • Essential for business -regular tasks require automation March 2000 needs Knowledge Gio XIT 5 Data + Knowledge Information • The product: Information Data Observations Aggregation of instances Integration of sources Analyses Filters Knowledge from Experts encoded for reuse March 2000 Gio XIT 6 Precision to be improved in • Relevance of Information for the Customer – modeling the customer • Timeliness of Information – resolving temporal mismatch for past data • Search for Information – precision versus recall • Meaning of the Information our focus here – resolving semantic mismatch Service model to achieve these objectives services add value by increasing precision March 2000 Gio XIT 7 Search techniques add value Yahoo Junglee AltaVista Excite Firefly Cookies Alexa Google humans catalog and organize useful web sites. integrates diverse sources using wrappers. automatically surfs and indexes the web. also tracks queries and classifies customers. provides customer control over their profiles. track users’ activities between sessions. collects webpages and their usage. ranks the reference importance of web pages. ... March 2000 Gio XIT 8 Problems for search engines and progress • Unsuitable source representations • part classification: HTML --- XML • print formats: postscript, adobe PDF • non-text: images, sound, video • hidden in databases behind CGI scripts Being improved. Rate? • Inconsistent semantics • context distinct / scope / view • Naïve modeling of customers • roles & growth Search engines cannot solve all problems March 2000 Gio XIT 9 Large quantities affect cost Progress Nature 1 human The human genome: ~ 4 000 000 000 base pairs ~10 000 proteins ? diseases 6 000 000 000 humans <1000 system s ~2 000 000 molecules March 2000 Genes, and gene abnormalities Everybody’s genes Metabolic pathways Small organic molecules - affect proteins - suitable for drugs Gio XIT 10 Need for precision More precision is needed as data volume increases --- a small error rate still leads to too many errors False Positives have to be investigated ( attractive-looking supplier - makes toysnot real cars apparent drug-target with poor annotation ) Information Wall lost opportunities, suboptimal to some degree False positives = poor precision typically cost more than false negatives = poor recall Testing false lead in pharmaceutics costs > $ 100 000 in stage 1. data errors False Negatives cause information quantity adapted from Warren Powell, Princeton Un. March 2000 Gio XIT 11 Heterogeneity among Domains If interoperation involves distinct domains mismatch ensues • Autonomy conflicts with consistency, – Local Needs have Priority, – Outside uses are a Byproduct Heterogeneity must be addressed • Platform and Operating Systems 44 • Representation and Access Conventions 4 • Naming and Ontology : March 2000 Gio XIT 12 Semantic Mismatches Information comes from many autonomous sources • Differing viewpoints (by source) – – – – – differing terms for similar items { lorry, truck } same terms for dissimilar items trunk(luggage, car) differing coverage vehicles (DMV, AIA) differing granularity trucks (shipper, manuf.) different scope student museum fee, Stanford • Hinders use of information from disjoint sources – missed linkages – irrelevant linkages loss of information, opportunities overload on user or application program • Poor precision when merged ok for web browsing , March 2000 poor for business Gio XIT 13 Proposed Solutions Specify and standardize terminology usage: ontology • Globally all interacting sources – – – – wonderful for users and their programs long time to achieve, 2 sources (UAL, BA), 3 (+ trucks), 4, … all ? costly maintenance, since all sources evolve who has the authority to dictate conformance • Domain-specific – – – – – XML DTD assumption Small, focused, cooperating groups high quality, some examples - genomics, arthritis, shakespeare plays allows sharable, formal tools ongoing, local maintenance affecting users - annual updates poor interoperation, users still face inter-domain mismatches • solves only part of the problem March 2000 Gio XIT 14 Domains and Consistency . • a domain will contain many objects • the object configuration is consistent • within a domain all terms are consistent & • relationships among objects are consistent Domain Ontology • context is implicit No committee is needed to forge compromises * within a domain Compromises hide valuable details March 2000 Gio XIT 15 Objective of Scalable Knowledge Composition Provide for Maintainable Application Ontologies • devolve maintenance onto many domain-specific experts / authorities SKC • provide an algebra to compute composed ontologies that are limited to their articulation terms • enable interpretation within the source contexts March 2000 Gio XIT 16 Sample Operation: INTERSECTION Articulation Source Domain 1: Owned and maintained by Store March 2000 Result contains shared terms, useful for purchasing Source Domain 2: Owned and maintained by Factory Gio XIT 17 Tools to create articulations Graph matcher for Articulationcreating Expert Transport ontology Vehicle ontology Suggestions for articulations March 2000 Gio XIT 18 continue from initial point Tool suggests terms for further articulation: • by spelling similarity, • by graph position • by term match nexus Expert response: 1. Okay 2. False 3. Irrelevant to this articulation All results are recorded Okay ’s are converted into articulation rules March 2000 Gio XIT 19 Candidate Match Nexus Term linkages automatically extracted from Webster’s* / Oxford dictionary + freely available + restricted * Based on processing headwords definitions using algebra primitives Notice presence of 2 domains: chemistry, transport March 2000 Gio XIT 20 Using the Match Nexus Experiment: On government structures of NATO countries: SKEIN system resolved over 70% of unmatched terms March 2000 Gio XIT 21 Using the Match Nexus March 2000 Gio XIT 22 An Ontology Algebra A knowledge-based algebra for ontologies Intersection Union Difference create a subset ontology keep sharable entries create a joint ontology merge entries create a distinct ontology remove shared entries The Articulation Ontology (AO) consists of rules that link domain ontologies March 2000 matching Gio XIT 23 INTERSECTION support Articulation ontology Terms useful for purchasing Matching rules that use terms from the 2 source domains Store Ontology March 2000 Factory Ontology Gio XIT 24 Other Basic Operations DIFFERENCE: material fully under local control UNION: merging entire ontologies Articulation ontology typically prior intersections March 2000 Gio XIT 25 Features of an algebra Operations can be composed Operations can be rearranged Alternate arrangements can be evaluated Optimization is enabled The record of past operations can be kept and reused when sources change March 2000 Gio XIT 26 Knowledge Composition Composed knowledge for applications using A,B,C,E Articulation knowledge (A B) U (B C) U (C E) Articulation knowledge (C E) U U U : union : intersection U Knowledge resource E Articulation knowledge for (A B) U Knowledge resource A March 2000 U (B C) Knowledge resource B Knowledge resource C (C U Legend: U U for D) Knowledge resource D Gio XIT 27 SKC Primitive Operations Model and Instance Unary • Summarize -- abstract • Glossarize - list terms • Filter - reduce instances • Extract - move into context Binary • Match - data corrobaration • Difference - distance measure • Intersect - use of articulation • Union - search broadening March 2000 Constructors • create object • create set Connectors • match object • match set Editors • insert value • edit value • move value • delete value Converters • object - value • object indirection • reference indirection Gio XIT 28 Exploiting the result Result has links to source . Avoid n2 problem of interpreter mapping Processing & query evaluation is best performed within Source Domains & by their engines March 2000 Gio XIT 29 Domain Specialization . • Knowledge Acquisition (20% effort) & • Knowledge Maintenance (80% effort *) to be performed • Domain specialists • Professional organizations • Field teams of modest size automously maintainable Empowerment * based on experience with software March 2000 Gio XIT 30 Summary To sustain the growth of web usage 1. The value of the results has to keep increasing precision, relevance not volume, nor recall 2. Value is provided by experts, encoded as models of diverse resources, customers Problems to be addressed mismatches quality Clear, scalable maintenance } models + Tools for these tasks March 2000 Gio XIT 31 Acknowledgments Supported by AF Office of Scientific Research – New World Vistas program Participants • • • • • • David Maluf, postdoc, PhD EE, McGill Univ., 1997. Jan Jannink, PhD candidate, CS, grad. June 2000? Shrish Agarwal, MS graduate, CS, 1999. Prasenjit Mitra, PhD candidate, EE, grad. 2001? Martin Kerstens, PhD, summer visitor from CWI. Stefan Decker, postdoc, PhD Univ.Karlsruhe 1999. March 2000 Gio XIT 32 Seminar Course on Intelligent Information Systems • April-June 2000, at 14:15 - 15:15, room ? Presentations in English -- but I'll try to manage discussions in French and/or German. • I plan to cover the material in an integrating fashion, drawing from concepts in databases, artificial intelligence, software engineering, and business principles. 1. 13/4 Historical background, enabling technology:ARPA, Internet, DB, OO, AI., IR, XML. 2. 27/4 Search engines and methods (recall, precision, overload, semantic problems). 3. 4/5 Digital libraries, information resources. Value of services, copyright. 4. 11/5 E-commerce. Client-servers. Portals. Payment mechanisms, dynamic pricing. 5. 19/5 Mediated systems. Functions, interfaces, and standards. Intelligence in processing. Role of humans and automation, maintenance. 6. 26/5 Software composition. Distribution of functions. Parallelism. [ww D.Beringer] 7. 31/5 Application to Bioinformatics. 8. 15/6 Educational challenges. Expected changes in teaching and learning. 9. 22/6 Privacy protection and security. Security mediation. 10.29/6 Summary and projection for the future. • Feedback and comments are appreciated. March 2000 Gio XIT 33