3. Digital Libraries Finish 2. Search Gio Wiederhold Intelligent Information Systems

advertisement
Intelligent Information Systems
Finish 2. Search
3. Digital Libraries
Gio Wiederhold
EPFL,
April-June 2000, at 14:15 - 15:15, room INJ 218
Schedule for Seminar Course on
Intelligent Information Systems
Presentations in English -- but I'll try to manage discussions in French and/or German.
• I plan to cover the material in an integrating fashion, drawing from concepts in
databases, artificial intelligence, software engineering, and business principles.
1. 13/4 Historical background, enabling technology:ARPA, Internet, DB, OO, AI., IR, XML.
2. 27/4 Search engines and methods (recall, precision, overload, semantic problems).
3. 4/5 Digital libraries, information resources. Value of services, copyright.
4. 11/5 E-commerce. Client-servers. Portals. Payment mechanisms, dynamic pricing.
5. 19/5 Mediated systems. Functions, interfaces, and standards. Intelligence in
processing. Role of humans and automation, maintenance.
6. 26/5 Software composition. Distribution of functions. Parallelism. [ww D.Beringer]
7. 31/5 Application to Bioinformatics.
8. 15/6 Ontology Algebra. Educational challengse in teaching and learning.
9. 22/6 Privacy protection and security. Security mediation.
10.29/6 Summary and projection for the future.
• Feedback and comments are appreciated.
7/26/2016
EPFL3D - Gio spring 2000
2
Summary of Search
Effective search requires increasing precision as volume of base
material increases
Many methods have been invented, can be used in combination,
although relative effectiveness will decrease
Formalization, quantification of ad-hoc methods is a research topic
• Customer models
– to control and simplify the process
practice
• Value models
– to increase relevance
• Semantic consistency for the customer
– semantic translations from contexts
theory
Technology transfer: how to integrate good ideas operationally?
7/26/2016
EPFL3D - Gio spring 2000
3
Domain Specific Catalogs
Objective: more depth than a general catalog can provide
Accessed directly or by higher-level search engines
• Knowledge Acquisition (20% effort) &
• Knowledge Maintenance (80% effort *)
to be performed
• Domain specialists
• Professional organizations
• Field teams
of modest size
autonomously
maintained
Empowerment
* based on experience with software
7/26/2016
EPFL3D - Gio spring 2000
4
Use of domain specific catalogs
• In specialized e-com services within a domain
– domain-specific vocabulary is understood
– use of domain specific terms and abbreviations is effective
• From higher level catalogs [Yahoo, ..]
– context is established when accessed via hierarchy
– global semantics of terms found are unclear, error prone
Opportunities for domain specialists to make their mark
• Professional organizations (ACM Computing Reviews hierarchy)
• Research Institutes
(EMBL for genomic terms)
– Reimbursement, payment models will differ
• professional advertisement should be constrained
7/26/2016
EPFL3D - Gio spring 2000
5
Customer models
Customer is defined to be {a person
• arranging a vacation trip
6
one specific task}
• activity/interests  location town  days  hotel by grade 
flight / tour bus  public transport  rented car
• arranging a business trip
• location & date  hotel by corp. plan  flight 
taxi, limo, or rented car
• getting a computer for Joe Cheap
• search CPU by price  modem  display
• getting a computer for Peter Fast
• search CPU by speed  storage  display  network
 A customer model is Hierarchical  computable, unambiguous
 alternatives at each level ( evaluate, closure, commit, rollback )
7/26/2016
EPFL3D - Gio spring 2000
6
Example: Result modes for ranking
Databases:
• Completeness
• All the answers
Customer:
• wants choices
Prolog
• Correctness
• The first answer
Optimization
• The best choice
• Assumes all
factors are known,
no human decision
7/26/2016
also (but rarely invoked)
• explanation for trust
• provider background
EPFL3D - Gio spring 2000
7
Ranking
Qualitative Significant Differences:
in terms of the customer model
Plan 1. UA59 dep.Wash.Dulles 17:10, arr. LAX 19:49
Plan 2. AA75 dep.Wash.Dulles 18:00, arr. LAX 20:10
Plan 3. UA119 dep.Wash.Dulles 9:25, arr. LAX 12:00
Busy
Joe:
Speedy
Mike:
Greedy
Pete:
P1= P2, P3
P2, P1=P3
P1=P3, P2
7/26/2016
EPFL3D - Gio spring 2000
8
Personal vs. Customer Model
Actual Person has multiple roles
 how to switch
 explicitly - awkward
 implicitly - hard to
perform fast
 keep past contexts
return to prior local state
Switching rate will differ
• work versus fun
• adequacy of models
Concept not yet proven
7/26/2016
EPFL3D - Gio spring 2000
• experimentally
• in practice
9
* Combining the models
Identify articulations
• Match customer and resource terms
• semantic mismatches
• thesauri, matching rules
Match level of detail
• Match customer and resource values,
summarize numbers, result ranks
• completeness, unit mismatches, text
• indicate constraints in models
• textual abstraction
• input for visualization
7/26/2016
EPFL3D - Gio spring 2000
10
Digital Libraries
Gio Wiederhold
Stanford University
Partially based on presentation for ACM February 1995
7/26/2016
EPFL3D - Gio spring 2000
11
Participants
Action
Publisher
Analyst
Printer
Editors
Concepts
Reader
Bookseller
(acq., copy)
Referee
Customer
Librarian
Indexer
Cataloger
Library
Author
Gio Wiederhold
7/26/2016
EPFL3D - Gio spring 2000
12
1550? - printing invented
1969 - Arpanet has 5 nodes, 4 shared computing sites
1972 - 12 nodes, 37 sites, some data sharing
1975 - shared storage on VAX etc., enables local sharing networks
1978 - NLM computer-based bibliographic search in academic libraries
1979 - many computer scientists have/need access to networks
1981 - Stanford & Xerox router / gateway protocols for local subnets
1982 - email formalized through SMTP - new scientific medium
1989? - database publishing required for data supporting genomic papers
1990? - HTML for physics preprints [Berners-Lee]
1991 - Internet base for science (NREN, NSF backbone)
1991 - Web browsers Mosaic (UIUC, [Andreesen Bina]), Netscape
1992 - commercial domains permitted - ICANN established
1994 - Digital Library initiative [ARPA, NASA, NSF]
1995 - fully commercial operation, research use by grants
1996 - NSF research initiative New Generation Internet
1999 - 2.2M sites on Internet, 288M public pages
200x - paper publishing for scientific material nearly all on-line
7/26/2016
EPFL3D - Gio spring 2000
13
500 years?
Foundation for Digital Libraries
Services (Preparation)
Service: Example (Provider @ Support)
• Writing:the ACM article (author @ investment)
• Locating citations: references (librarian, friends @ inst.)
• Selecting works: refereeing (publishers&editors @ pubs.)
• Editing: spelling, word usage (staff @ pub. inst.)
• Graphics: figures, icons (staff artist @ inst. pub)
• Ontologies: ACM CR (professional society @ pubs., dues)
• Layout: figure placing, white space (staff @ pubs.)
• Composition: creating a topic issue (EiC @ inst., pubs.)
7/26/2016
EPFL3D - Gio spring 2000
14
Services
(Production&Distribution)
Service: Example (Provider @ Support)
• Printing, binding: a book (printer @ pub., vanity author)
• Master cataloging: Assigning ISBN code (LoC @ govmnt)
• Storage: source for orders (warehouse @ publisher)
• Distribution: shipping (mailing house@ pub., vendor)
• Local storage: store shelves (bookstore@bookstore sales)
• Advertising: brochures, journal ads (pub., store @ sales)
• Reviewing: critic's columns (Prof.Soc. @ subscribers)
• Acquisition: university library (staff @ customer inst.)
• Local cataloging: university library (staff, services@ inst.)
7/26/2016
EPFL3D - Gio spring 2000
15
Services (Dissemination)
Service: Example (Provider@ Support)
• Indexing: preparing MEDLINE entry (funded service @gov)
• Retrieving: getting a book (librarian, clerk @inst., store)
• Copying: class use (copy center, staff @customer inst.)
• Revenue collection 1: royalties (vendor @ fraction of sales)
• Revenue collection 2: copyright fee (copyright center@ use)
• Validation: contact users cited (staff @ customer inst.)
• Abstracting: executive summary (staff @ customer inst.)
• Integration: summary (staff, consultant @ customer inst.)
• Presentation: viewgraphs (staff @ customer inst.)
7/26/2016
EPFL3D - Gio spring 2000
16
Opportunities
•
•
•
•
•
•
•
•
•
•
On-line editor services (given audience level)
Integration of documents and figures for WWW
Review services (domain-specific, value-added)
Abstraction services (with references to base mat.)
Alternative ranking (flights, investments, . . . )
Active documents (function evaluation, plotting)
Dynamic books (curriculum -> text book)
Test generators and checkers (people, equipment)
Test and demonstration datasets
Services for remote education (tutors, . . . )
7/26/2016
EPFL3D - Gio spring 2000
17
Stanford High Wire Press for Electronic Journals:
Journal of Biological Chemistry. (weekly 2-3 Gbyte/issue)
http://www-jbc.stanford.edu/jbc/
Cost was 8-9M/year, 40% is printing, 4200 libraries
more: Keller “publish Core Sciences”
Many Others
JIAR (MorganKaufman) versus JAAAI (AAAI)
7/26/2016
EPFL3D - Gio spring 2000
18
Ex:.Professional Organization
•
•
•
•
•
•
1998: ACM Digital Libraries bought by 69 Libraries
and 1 consortium (UC with 9 campuses, office of he
president at a cost of 9x -40% , one print copy).
Currently for: Non-profits, large corporations; small
corporations.
Cost of electronic pulication is 60% of print price.
Now electronic only charged 80%,
$4600.- for everything for a University
$4000.- print only for a University. Libraries want
institutional pricing, not institutional membership.
7/26/2016
EPFL3D - Gio spring 2000
19
Publishing Paths
Library access
Archiving
publish
years
scanning
GOP
Print
Distribute
accept
typesetting
Review
Edit
submit
review MS
Manuscript
Preparation
Digital
borrow
subscribe,
buy
Readers
access rights
Commentary
On-line Availability
reject
give up on paper ?
Gio Wiederhold
7/26/2016
EPFL3D - Gio spring 2000
20
Barriers to progress
• Large base of information
 NSF/ARPA/NASA digital library program
– Publishers are unsure, provide mixed approaches [JUCS. ...]
• Common formats
± commercial needs dominate, science is dependent on them
• Ease of reading
 adequate screens, printers are becoming ubiquitous
• Ease of annotation and their sharing
– several technologies, no standards
• Economical access for all
? role for public libraries, schools, foreign aid
• Quality control
? responsible initiatives by professional societies
• Acceptance by academic promotion committees of publications
– requires accepted, recognized quality control
7/26/2016
EPFL3D - Gio spring 2000
21
Web Services
Browsers for HTML
– Mosaic [Andressen, Bina at UIUC], Netscape, …
• obviate writing, loading of specialized search & output routines
• (a major effort for early on-line library services)
Search engines (Topic 2)
– locate sources for material
– scientific material is swamped by commercial priorities
URL based citations
– addresses, people, institutions move, change URL
– URI initiative by CNRI, supported by Library of Congress
– Los Alamos (publishers) offer persistent access for preprints
7/26/2016
EPFL3D - Gio spring 2000
22
Political interest + Digital Earth
[Vice President Gore, 1998]
• A multi-resolution, three-dimensional representation of the planet,
into which we can embed vast quantities of geo-referenced data.
• a `collaboratory’ – a laboratory without walls – for research scientists
seeking to understand the complex interactions between humanity
and our environment.
• a “user interface” – a browsable, 3-D version of the planet available
at various levels of resolution, a rapidly growing universe of
networked geospatial information, and the mechanisms for
integrating and displaying information from multiple sources.
Source: NSF 1998 planning viewgraph
7/26/2016
EPFL3D - Gio spring 2000
23
Projects
Many research projects in U.S., Europe, Japan
• Most government supported to academic institutes
• Some industrial efforts, mainly in space sensing
– Spot, Lockheed Martin, … (spy spinoffs)
Conflict of technology versus contents
• computer scientists prefer advancing technology, steal contents
• domain scientists want to grow, validate contents, steal software
– cooperation?
• Requires appreciation of each others scientific objectives (lacking)
• flexibility of formats for contents , processing 
• failure to recognize domain-specific semantics
7/26/2016
EPFL3D - Gio spring 2000
24
Components of Geo Informatics
Perception
Humans
Action
Reality
Cognition
Search &
control
Data
acquisition
GI
systems
of Michael Goodchild et al.
NCGIA workshop, NSF 14Jan1999
Rediscovering the world through GIS
GIS planet meeting, Lisbon, Sept. 1998.
Courtesy
7/26/2016
EPFL3D - Gio spring 2000
25
The TSIMMIS Project at Stanford
Ramana Yerneni, Yannis Papakonstantinou, ...
• Objective: Support mediation technology
– integrated access to distributed, autonomous, heterogeneous
data sources, using object fusion
– wrapper toolkit
– to rapidly create wrappers, based on source specification,
– a uniform interface to heterogeneous sources
– mediator toolkit
– to rapidly construct mediators,
based on a mediator specification and wrapper specs
– to integrate data from a set of wrappers
– omit redundant documents [SPAM]
7/26/2016
EPFL3D - Gio spring 2000
26
Fuse Information
from Multiple
Sources
.
Network
Ticker Tape
WWW
7/26/2016
Personal
database
• group together information about
the same real-world entity
• remove redundancies
• resolve conflicts
EPFL3D - Gio spring 2000
27
A Common Integration Architecture
Client
Application
portfolios for each company
Mediator
stock market prices
7/26/2016
business reports
Wrapper
Wrapper
Ticker
Tape
Dialog
EPFL3D - Gio spring 2000
28
Source representations do not have Schemas
Examples
• semistructured
– irregular
– deeply nested
– incomplete instances
• implicit schema
– autonomous
– dynamic
•
•
•
•
•
•
bibliographic information
SGML documents
World Wide Web ( HTML & Links )
genome data
chemical structures
files
Self-describing Dynamic Object Models (OEM  XML):
<structure> := <name> data | <name> <substructure>
(partial schema extracted when needed)
7/26/2016
EPFL3D - Gio spring 2000
29
Wrappers & Mediators from
High-Level Specifications
Client
Declarative Mediator
Specification in OEM*
Mediator
Mediator Specification
Interpreter
Wrapper
Wrapper
Source
Source
Wrapper Specification
Interpreter
Declarative Source
Specifications in OEM
Extracted
Hints
* being replaced by XML
7/26/2016
EPFL3D - Gio spring 2000
30
Mediator Specification Interpreter Architecture
Query
Result
Query Rewriter
logical datamerge
program
Mediator
Specification
Cost-Based Optimizer
plan
Datamerge Engine
Queries to
Wrappers
7/26/2016
Results
EPFL3D - Gio spring 2000
31
TSIMMIS Status sum 1999
• Mediator Specification Interpreter running on Ultrix,
AIX, OSF.
• 9000 lines of C/C++ code
• 4000 C++ lines of Server/Client Support Libraries
• Integration of three disparate bibliographic sources
–
–
–
–
legacy system
flat BibTeX files
relational DB
wwWeb files
7/26/2016
EPFL3D - Gio spring 2000
32
Progress ?
• On the invention of writing: “This discovery will
create forgetfulness, people that read will seem
omniscient but know nothing" [Socrate]
• The invention of printing is just a passing fad and will
induce sloppiness [paraphrase of statement by a
Jesuit expert in ~ 1600]
• Paper books will never be replaced by electronic
gadgets [Many librarians today ]
7/26/2016
EPFL3D - Gio spring 2000
33
Download