Intelligent Information Systems Finish 2. Search 3. Digital Libraries Gio Wiederhold EPFL, April-June 2000, at 14:15 - 15:15, room INJ 218 Schedule for Seminar Course on Intelligent Information Systems Presentations in English -- but I'll try to manage discussions in French and/or German. • I plan to cover the material in an integrating fashion, drawing from concepts in databases, artificial intelligence, software engineering, and business principles. 1. 13/4 Historical background, enabling technology:ARPA, Internet, DB, OO, AI., IR, XML. 2. 27/4 Search engines and methods (recall, precision, overload, semantic problems). 3. 4/5 Digital libraries, information resources. Value of services, copyright. 4. 11/5 E-commerce. Client-servers. Portals. Payment mechanisms, dynamic pricing. 5. 19/5 Mediated systems. Functions, interfaces, and standards. Intelligence in processing. Role of humans and automation, maintenance. 6. 26/5 Software composition. Distribution of functions. Parallelism. [ww D.Beringer] 7. 31/5 Application to Bioinformatics. 8. 15/6 Ontology Algebra. Educational challengse in teaching and learning. 9. 22/6 Privacy protection and security. Security mediation. 10.29/6 Summary and projection for the future. • Feedback and comments are appreciated. 7/26/2016 EPFL3D - Gio spring 2000 2 Summary of Search Effective search requires increasing precision as volume of base material increases Many methods have been invented, can be used in combination, although relative effectiveness will decrease Formalization, quantification of ad-hoc methods is a research topic • Customer models – to control and simplify the process practice • Value models – to increase relevance • Semantic consistency for the customer – semantic translations from contexts theory Technology transfer: how to integrate good ideas operationally? 7/26/2016 EPFL3D - Gio spring 2000 3 Domain Specific Catalogs Objective: more depth than a general catalog can provide Accessed directly or by higher-level search engines • Knowledge Acquisition (20% effort) & • Knowledge Maintenance (80% effort *) to be performed • Domain specialists • Professional organizations • Field teams of modest size autonomously maintained Empowerment * based on experience with software 7/26/2016 EPFL3D - Gio spring 2000 4 Use of domain specific catalogs • In specialized e-com services within a domain – domain-specific vocabulary is understood – use of domain specific terms and abbreviations is effective • From higher level catalogs [Yahoo, ..] – context is established when accessed via hierarchy – global semantics of terms found are unclear, error prone Opportunities for domain specialists to make their mark • Professional organizations (ACM Computing Reviews hierarchy) • Research Institutes (EMBL for genomic terms) – Reimbursement, payment models will differ • professional advertisement should be constrained 7/26/2016 EPFL3D - Gio spring 2000 5 Customer models Customer is defined to be {a person • arranging a vacation trip 6 one specific task} • activity/interests location town days hotel by grade flight / tour bus public transport rented car • arranging a business trip • location & date hotel by corp. plan flight taxi, limo, or rented car • getting a computer for Joe Cheap • search CPU by price modem display • getting a computer for Peter Fast • search CPU by speed storage display network A customer model is Hierarchical computable, unambiguous alternatives at each level ( evaluate, closure, commit, rollback ) 7/26/2016 EPFL3D - Gio spring 2000 6 Example: Result modes for ranking Databases: • Completeness • All the answers Customer: • wants choices Prolog • Correctness • The first answer Optimization • The best choice • Assumes all factors are known, no human decision 7/26/2016 also (but rarely invoked) • explanation for trust • provider background EPFL3D - Gio spring 2000 7 Ranking Qualitative Significant Differences: in terms of the customer model Plan 1. UA59 dep.Wash.Dulles 17:10, arr. LAX 19:49 Plan 2. AA75 dep.Wash.Dulles 18:00, arr. LAX 20:10 Plan 3. UA119 dep.Wash.Dulles 9:25, arr. LAX 12:00 Busy Joe: Speedy Mike: Greedy Pete: P1= P2, P3 P2, P1=P3 P1=P3, P2 7/26/2016 EPFL3D - Gio spring 2000 8 Personal vs. Customer Model Actual Person has multiple roles how to switch explicitly - awkward implicitly - hard to perform fast keep past contexts return to prior local state Switching rate will differ • work versus fun • adequacy of models Concept not yet proven 7/26/2016 EPFL3D - Gio spring 2000 • experimentally • in practice 9 * Combining the models Identify articulations • Match customer and resource terms • semantic mismatches • thesauri, matching rules Match level of detail • Match customer and resource values, summarize numbers, result ranks • completeness, unit mismatches, text • indicate constraints in models • textual abstraction • input for visualization 7/26/2016 EPFL3D - Gio spring 2000 10 Digital Libraries Gio Wiederhold Stanford University Partially based on presentation for ACM February 1995 7/26/2016 EPFL3D - Gio spring 2000 11 Participants Action Publisher Analyst Printer Editors Concepts Reader Bookseller (acq., copy) Referee Customer Librarian Indexer Cataloger Library Author Gio Wiederhold 7/26/2016 EPFL3D - Gio spring 2000 12 1550? - printing invented 1969 - Arpanet has 5 nodes, 4 shared computing sites 1972 - 12 nodes, 37 sites, some data sharing 1975 - shared storage on VAX etc., enables local sharing networks 1978 - NLM computer-based bibliographic search in academic libraries 1979 - many computer scientists have/need access to networks 1981 - Stanford & Xerox router / gateway protocols for local subnets 1982 - email formalized through SMTP - new scientific medium 1989? - database publishing required for data supporting genomic papers 1990? - HTML for physics preprints [Berners-Lee] 1991 - Internet base for science (NREN, NSF backbone) 1991 - Web browsers Mosaic (UIUC, [Andreesen Bina]), Netscape 1992 - commercial domains permitted - ICANN established 1994 - Digital Library initiative [ARPA, NASA, NSF] 1995 - fully commercial operation, research use by grants 1996 - NSF research initiative New Generation Internet 1999 - 2.2M sites on Internet, 288M public pages 200x - paper publishing for scientific material nearly all on-line 7/26/2016 EPFL3D - Gio spring 2000 13 500 years? Foundation for Digital Libraries Services (Preparation) Service: Example (Provider @ Support) • Writing:the ACM article (author @ investment) • Locating citations: references (librarian, friends @ inst.) • Selecting works: refereeing (publishers&editors @ pubs.) • Editing: spelling, word usage (staff @ pub. inst.) • Graphics: figures, icons (staff artist @ inst. pub) • Ontologies: ACM CR (professional society @ pubs., dues) • Layout: figure placing, white space (staff @ pubs.) • Composition: creating a topic issue (EiC @ inst., pubs.) 7/26/2016 EPFL3D - Gio spring 2000 14 Services (Production&Distribution) Service: Example (Provider @ Support) • Printing, binding: a book (printer @ pub., vanity author) • Master cataloging: Assigning ISBN code (LoC @ govmnt) • Storage: source for orders (warehouse @ publisher) • Distribution: shipping (mailing house@ pub., vendor) • Local storage: store shelves (bookstore@bookstore sales) • Advertising: brochures, journal ads (pub., store @ sales) • Reviewing: critic's columns (Prof.Soc. @ subscribers) • Acquisition: university library (staff @ customer inst.) • Local cataloging: university library (staff, services@ inst.) 7/26/2016 EPFL3D - Gio spring 2000 15 Services (Dissemination) Service: Example (Provider@ Support) • Indexing: preparing MEDLINE entry (funded service @gov) • Retrieving: getting a book (librarian, clerk @inst., store) • Copying: class use (copy center, staff @customer inst.) • Revenue collection 1: royalties (vendor @ fraction of sales) • Revenue collection 2: copyright fee (copyright center@ use) • Validation: contact users cited (staff @ customer inst.) • Abstracting: executive summary (staff @ customer inst.) • Integration: summary (staff, consultant @ customer inst.) • Presentation: viewgraphs (staff @ customer inst.) 7/26/2016 EPFL3D - Gio spring 2000 16 Opportunities • • • • • • • • • • On-line editor services (given audience level) Integration of documents and figures for WWW Review services (domain-specific, value-added) Abstraction services (with references to base mat.) Alternative ranking (flights, investments, . . . ) Active documents (function evaluation, plotting) Dynamic books (curriculum -> text book) Test generators and checkers (people, equipment) Test and demonstration datasets Services for remote education (tutors, . . . ) 7/26/2016 EPFL3D - Gio spring 2000 17 Stanford High Wire Press for Electronic Journals: Journal of Biological Chemistry. (weekly 2-3 Gbyte/issue) http://www-jbc.stanford.edu/jbc/ Cost was 8-9M/year, 40% is printing, 4200 libraries more: Keller “publish Core Sciences” Many Others JIAR (MorganKaufman) versus JAAAI (AAAI) 7/26/2016 EPFL3D - Gio spring 2000 18 Ex:.Professional Organization • • • • • • 1998: ACM Digital Libraries bought by 69 Libraries and 1 consortium (UC with 9 campuses, office of he president at a cost of 9x -40% , one print copy). Currently for: Non-profits, large corporations; small corporations. Cost of electronic pulication is 60% of print price. Now electronic only charged 80%, $4600.- for everything for a University $4000.- print only for a University. Libraries want institutional pricing, not institutional membership. 7/26/2016 EPFL3D - Gio spring 2000 19 Publishing Paths Library access Archiving publish years scanning GOP Print Distribute accept typesetting Review Edit submit review MS Manuscript Preparation Digital borrow subscribe, buy Readers access rights Commentary On-line Availability reject give up on paper ? Gio Wiederhold 7/26/2016 EPFL3D - Gio spring 2000 20 Barriers to progress • Large base of information NSF/ARPA/NASA digital library program – Publishers are unsure, provide mixed approaches [JUCS. ...] • Common formats ± commercial needs dominate, science is dependent on them • Ease of reading adequate screens, printers are becoming ubiquitous • Ease of annotation and their sharing – several technologies, no standards • Economical access for all ? role for public libraries, schools, foreign aid • Quality control ? responsible initiatives by professional societies • Acceptance by academic promotion committees of publications – requires accepted, recognized quality control 7/26/2016 EPFL3D - Gio spring 2000 21 Web Services Browsers for HTML – Mosaic [Andressen, Bina at UIUC], Netscape, … • obviate writing, loading of specialized search & output routines • (a major effort for early on-line library services) Search engines (Topic 2) – locate sources for material – scientific material is swamped by commercial priorities URL based citations – addresses, people, institutions move, change URL – URI initiative by CNRI, supported by Library of Congress – Los Alamos (publishers) offer persistent access for preprints 7/26/2016 EPFL3D - Gio spring 2000 22 Political interest + Digital Earth [Vice President Gore, 1998] • A multi-resolution, three-dimensional representation of the planet, into which we can embed vast quantities of geo-referenced data. • a `collaboratory’ – a laboratory without walls – for research scientists seeking to understand the complex interactions between humanity and our environment. • a “user interface” – a browsable, 3-D version of the planet available at various levels of resolution, a rapidly growing universe of networked geospatial information, and the mechanisms for integrating and displaying information from multiple sources. Source: NSF 1998 planning viewgraph 7/26/2016 EPFL3D - Gio spring 2000 23 Projects Many research projects in U.S., Europe, Japan • Most government supported to academic institutes • Some industrial efforts, mainly in space sensing – Spot, Lockheed Martin, … (spy spinoffs) Conflict of technology versus contents • computer scientists prefer advancing technology, steal contents • domain scientists want to grow, validate contents, steal software – cooperation? • Requires appreciation of each others scientific objectives (lacking) • flexibility of formats for contents , processing • failure to recognize domain-specific semantics 7/26/2016 EPFL3D - Gio spring 2000 24 Components of Geo Informatics Perception Humans Action Reality Cognition Search & control Data acquisition GI systems of Michael Goodchild et al. NCGIA workshop, NSF 14Jan1999 Rediscovering the world through GIS GIS planet meeting, Lisbon, Sept. 1998. Courtesy 7/26/2016 EPFL3D - Gio spring 2000 25 The TSIMMIS Project at Stanford Ramana Yerneni, Yannis Papakonstantinou, ... • Objective: Support mediation technology – integrated access to distributed, autonomous, heterogeneous data sources, using object fusion – wrapper toolkit – to rapidly create wrappers, based on source specification, – a uniform interface to heterogeneous sources – mediator toolkit – to rapidly construct mediators, based on a mediator specification and wrapper specs – to integrate data from a set of wrappers – omit redundant documents [SPAM] 7/26/2016 EPFL3D - Gio spring 2000 26 Fuse Information from Multiple Sources . Network Ticker Tape WWW 7/26/2016 Personal database • group together information about the same real-world entity • remove redundancies • resolve conflicts EPFL3D - Gio spring 2000 27 A Common Integration Architecture Client Application portfolios for each company Mediator stock market prices 7/26/2016 business reports Wrapper Wrapper Ticker Tape Dialog EPFL3D - Gio spring 2000 28 Source representations do not have Schemas Examples • semistructured – irregular – deeply nested – incomplete instances • implicit schema – autonomous – dynamic • • • • • • bibliographic information SGML documents World Wide Web ( HTML & Links ) genome data chemical structures files Self-describing Dynamic Object Models (OEM XML): <structure> := <name> data | <name> <substructure> (partial schema extracted when needed) 7/26/2016 EPFL3D - Gio spring 2000 29 Wrappers & Mediators from High-Level Specifications Client Declarative Mediator Specification in OEM* Mediator Mediator Specification Interpreter Wrapper Wrapper Source Source Wrapper Specification Interpreter Declarative Source Specifications in OEM Extracted Hints * being replaced by XML 7/26/2016 EPFL3D - Gio spring 2000 30 Mediator Specification Interpreter Architecture Query Result Query Rewriter logical datamerge program Mediator Specification Cost-Based Optimizer plan Datamerge Engine Queries to Wrappers 7/26/2016 Results EPFL3D - Gio spring 2000 31 TSIMMIS Status sum 1999 • Mediator Specification Interpreter running on Ultrix, AIX, OSF. • 9000 lines of C/C++ code • 4000 C++ lines of Server/Client Support Libraries • Integration of three disparate bibliographic sources – – – – legacy system flat BibTeX files relational DB wwWeb files 7/26/2016 EPFL3D - Gio spring 2000 32 Progress ? • On the invention of writing: “This discovery will create forgetfulness, people that read will seem omniscient but know nothing" [Socrate] • The invention of printing is just a passing fad and will induce sloppiness [paraphrase of statement by a Jesuit expert in ~ 1600] • Paper books will never be replaced by electronic gadgets [Many librarians today ] 7/26/2016 EPFL3D - Gio spring 2000 33