Increasing the Precision when Obtaining Information from the Web Gio Wiederhold

advertisement
E P F L seminar
Increasing the Precision when
Obtaining Information from the Web
Gio Wiederhold
Stanford University
4 April 2000
related report: www-db.stanford.edu/pub/gio/1999/miti.htm
Supported by the AFOSR- New World Vistas Program
March 2000
Gio XIT 1
Growth Factors
Research
&
Inno vation
Tool
building
General
Technology
Push
Information
Technology
Consumer
Product
building &
marketing
Pull
Business
needs
Government
responsibilities
March 2000
Gio XIT 2
Trends
1998 : 1999
• Users of the Internet 40%  52% of U.S. population
• Growth of Net Sites (now 2.2M public sites with 288M pages)
• Expected growth in E-commerce by Internet users [BW, 6 Sep.1999]
An unstainable trend cannot be sustained [Herbert Stein]
new services
March 2000

1998
1999
7.2%  16.0%
6.3%  16.4% Centroid, in 1999
3.1% 10.3% ~1% of total market
2.6%  4.0%
1.4%  4.2%
8.0%  33.0% = $9.5Billion
%
–
–
–
–
–
–
segment
books
music & video
toys
travel
tickets
Overall
90
80
70
60
50
40
30
20
10
E-penetration
Toys
0
98 99 00 01 02 03 04
0.3 1 3 9 27 81 **
Year / % 
Gio XIT 3
Expect continuing growth
• Hardware technology will continue to lead
and encourage broader usage
• Communication technology will continue
to lead and become more economical
• User interfaces will improve and not be a
barrier to the acceptance of technology
• Government policies will not hinder open
interaction - or not be able to
March 2000
Gio XIT 4
The Problem of Information Growth:
"We are drowning in information but starved for knowledge. This level of
information is clearly impossible to be handled by present means.
Uncontrolled and unorganized information is no longer a resource in
an information society, instead it becomes the enemy."
-- John Naisbitt, author of 1982 bestseller Megatrends
. . . and it’s not getting better
Dealing with this issue requires Precision:
• Helpful for casual users -reduce human filtering when browsing
• Essential for business -regular tasks require automation
March 2000
needs
Knowledge
Gio XIT 5
Data + Knowledge  Information
• The product: Information
Data
Observations
Aggregation
of instances
Integration
of sources
Analyses
Filters
Knowledge from Experts
encoded for reuse
March 2000
Gio XIT 6
Precision to be improved in
• Relevance of Information for the Customer
– modeling the customer
• Timeliness of Information
– resolving temporal mismatch for past data
• Search for Information
– precision versus recall
• Meaning of the Information
our focus here
– resolving semantic mismatch
Service model to achieve these objectives
services add value by increasing precision
March 2000
Gio XIT 7
Search techniques add value
Yahoo
Junglee
AltaVista
Excite
Firefly
Cookies
Alexa
Google
humans catalog and organize useful web sites.
integrates diverse sources using wrappers.
automatically surfs and indexes the web.
also tracks queries and classifies customers.
provides customer control over their profiles.
track users’ activities between sessions.
collects webpages and their usage.
ranks the reference importance of web pages.
...
March 2000
Gio XIT 8
Problems for search engines and progress
• Unsuitable source representations
• part classification: HTML --- XML
• print formats: postscript, adobe PDF
• non-text: images, sound, video
• hidden in databases behind CGI scripts
Being
improved.
Rate?
• Inconsistent semantics
• context distinct / scope / view
• Naïve modeling of customers
• roles & growth
Search engines cannot solve all problems
March 2000
Gio XIT 9
Large quantities affect cost
Progress
Nature
1
human
The human genome: ~ 4 000 000 000 base pairs
~10 000
proteins
?
diseases
6 000 000
000
humans
<1000
system
s
~2 000 000
molecules
March 2000
Genes, and gene abnormalities
Everybody’s genes
Metabolic pathways
Small organic molecules - affect proteins - suitable for drugs
Gio XIT 10
Need for precision
More precision is needed as data volume increases
--- a small error rate still leads to too many errors
False Positives have to be investigated
( attractive-looking supplier - makes toysnot real cars
apparent drug-target with poor annotation )
Information Wall
lost opportunities,
suboptimal to some degree
False positives = poor precision
typically cost more than
false negatives = poor recall
Testing false lead in pharmaceutics
costs > $ 100 000 in stage 1.
data errors
False Negatives cause
information quantity
adapted from Warren Powell, Princeton Un.
March 2000
Gio XIT 11
Heterogeneity among Domains
If interoperation involves distinct
domains mismatch ensues
• Autonomy conflicts with consistency,
– Local Needs have Priority,
– Outside uses are a Byproduct
Heterogeneity must be addressed
• Platform and Operating Systems 44
• Representation and Access Conventions 4
• Naming and Ontology :
March 2000
Gio XIT 12
Semantic Mismatches
Information comes from many autonomous sources
• Differing viewpoints
(by source)
–
–
–
–
–
differing terms for similar items
{ lorry, truck }
same terms for dissimilar items
trunk(luggage, car)
differing coverage
vehicles (DMV, AIA)
differing granularity
trucks (shipper, manuf.)
different scope
student museum fee, Stanford
• Hinders use of information from disjoint sources
– missed linkages
– irrelevant linkages
loss of information, opportunities
overload on user or application program
• Poor precision when merged
ok for web browsing ,
March 2000
poor for business
Gio XIT 13
Proposed Solutions
Specify and standardize terminology usage: ontology
• Globally
all interacting sources
–
–
–
–
wonderful for users and their programs
long time to achieve, 2 sources (UAL, BA), 3 (+ trucks), 4, … all ?
costly maintenance, since all sources evolve
who has the authority to dictate conformance
• Domain-specific
–
–
–
–
–
XML DTD assumption
Small, focused, cooperating groups
high quality, some examples - genomics, arthritis, shakespeare plays
allows sharable, formal tools
ongoing, local maintenance affecting users - annual updates
poor interoperation, users still face inter-domain mismatches
• solves only part of the problem
March 2000
Gio XIT 14
Domains and Consistency
.
• a domain will contain many objects
• the object configuration is consistent
• within a domain all terms are consistent &
• relationships among objects are consistent
Domain Ontology
• context is implicit
No committee is needed
to forge compromises *
within a domain
 Compromises hide valuable details
March 2000
Gio XIT 15
Objective
of
Scalable Knowledge Composition
Provide for Maintainable Application Ontologies
• devolve maintenance onto many
domain-specific experts / authorities
SKC
• provide an algebra to compute
composed ontologies that are
limited to their articulation terms
• enable interpretation within
the source contexts
March 2000
Gio XIT 16
Sample Operation: INTERSECTION
Articulation
Source Domain 1:
Owned and maintained
by Store
March 2000
Result contains
shared terms,
useful for purchasing
Source Domain 2:
Owned and maintained
by Factory
Gio XIT 17
Tools to create articulations
Graph matcher
for
Articulationcreating
Expert
Transport
ontology
Vehicle
ontology
Suggestions
for articulations
March 2000
Gio XIT 18
continue from initial point
Tool suggests terms
for further articulation:
• by spelling similarity,
• by graph position
• by term match nexus
Expert response:
1. Okay
2. False
3. Irrelevant
to this articulation
All results are recorded
Okay ’s are converted into articulation rules
March 2000
Gio XIT 19
Candidate Match Nexus
Term linkages automatically extracted from Webster’s* / Oxford dictionary +
freely available
+ restricted
*
Based on processing
headwords  definitions
using algebra primitives
Notice presence
of 2 domains:
chemistry, transport
March 2000
Gio XIT 20
Using the Match Nexus
Experiment:
On government structures of
NATO countries:
SKEIN system resolved
over 70% of unmatched terms
March 2000
Gio XIT 21
Using the Match Nexus
March 2000
Gio XIT 22
An Ontology Algebra
A knowledge-based algebra for ontologies
Intersection
Union
Difference
create a subset ontology
keep sharable entries
create a joint ontology
merge entries
create a distinct ontology
remove shared entries
The Articulation Ontology (AO) consists of
rules that link domain ontologies
March 2000
matching
Gio XIT 23
INTERSECTION support
Articulation ontology
Terms useful
for purchasing
Matching
rules that use
terms from the
2 source domains
Store
Ontology
March 2000
Factory
Ontology
Gio XIT 24
Other Basic Operations
DIFFERENCE: material
fully under local control
UNION: merging
entire ontologies
Articulation
ontology
typically prior
intersections
March 2000
Gio XIT 25
Features of an algebra
Operations can be composed
Operations can be rearranged
Alternate arrangements can be evaluated
Optimization is enabled
The record of past operations can be
kept and reused when sources change
March 2000
Gio XIT 26
Knowledge Composition
Composed knowledge for
applications using A,B,C,E
Articulation
knowledge
(A B) U
(B C) U
(C E)
Articulation
knowledge
(C E)
U
U
U : union
: intersection
U
Knowledge
resource
E
Articulation
knowledge
for (A B)
U
Knowledge
resource
A
March 2000
U
(B
C)
Knowledge
resource
B
Knowledge
resource
C
(C
U
Legend:
U
U
for
D)
Knowledge
resource
D
Gio XIT 27
SKC Primitive Operations
Model and Instance
Unary
• Summarize -- abstract
• Glossarize - list terms
• Filter - reduce instances
• Extract - move into context
Binary
• Match - data corrobaration
• Difference - distance measure
• Intersect - use of articulation
• Union - search broadening
March 2000
Constructors
• create object
• create set
Connectors
• match object
• match set
Editors
• insert value
• edit value
• move value
• delete value
Converters
• object - value
• object indirection
• reference indirection
Gio XIT 28
Exploiting the result
Result has links
to source
.
Avoid n2 problem of
interpreter mapping
Processing & query
evaluation is best
performed within
Source Domains
& by their engines
March 2000
Gio XIT 29
Domain Specialization
.
• Knowledge Acquisition (20% effort) &
• Knowledge Maintenance (80% effort *)
to be performed
• Domain specialists
• Professional organizations
• Field teams
of modest size
automously
maintainable
Empowerment
* based on experience with software
March 2000
Gio XIT 30
Summary
To sustain the growth of web usage
1. The value of the results has to keep increasing
precision, relevance not volume, nor recall
2. Value is provided by experts,
encoded as models of
diverse resources, customers
Problems to be addressed
mismatches
quality
Clear, scalable
maintenance
}
models
+ Tools for these tasks
March 2000
Gio XIT 31
Acknowledgments
Supported by AF Office of Scientific Research
– New World Vistas program
Participants
•
•
•
•
•
•
David Maluf, postdoc, PhD EE, McGill Univ., 1997.
Jan Jannink, PhD candidate, CS, grad. June 2000?
Shrish Agarwal, MS graduate, CS, 1999.
Prasenjit Mitra, PhD candidate, EE, grad. 2001?
Martin Kerstens, PhD, summer visitor from CWI.
Stefan Decker, postdoc, PhD Univ.Karlsruhe 1999.
March 2000
Gio XIT 32
Seminar Course on Intelligent Information Systems
• April-June 2000, at 14:15 - 15:15, room ?
Presentations in English -- but I'll try to manage discussions in French and/or German.
• I plan to cover the material in an integrating fashion, drawing from concepts in
databases, artificial intelligence, software engineering, and business principles.
1. 13/4 Historical background, enabling technology:ARPA, Internet, DB, OO, AI., IR, XML.
2. 27/4 Search engines and methods (recall, precision, overload, semantic problems).
3. 4/5 Digital libraries, information resources. Value of services, copyright.
4. 11/5 E-commerce. Client-servers. Portals. Payment mechanisms, dynamic pricing.
5. 19/5 Mediated systems. Functions, interfaces, and standards. Intelligence in
processing. Role of humans and automation, maintenance.
6. 26/5 Software composition. Distribution of functions. Parallelism. [ww D.Beringer]
7. 31/5 Application to Bioinformatics.
8. 15/6 Educational challenges. Expected changes in teaching and learning.
9. 22/6 Privacy protection and security. Security mediation.
10.29/6 Summary and projection for the future.
• Feedback and comments are appreciated.
March 2000
Gio XIT 33
Download