New Bases for New Data Omar Benjelloun Stanford University 1

advertisement
New Bases for New Data
Omar Benjelloun
Stanford University
Omar Benjelloun - New Bases for New Data
January 27th, 2006
1
Relational databases are great
A simple, understandable model for data
Boss
Emp
Manager
Joe
Bill
Bill
Steve
High-level, declarative language for queries and updates: SQL
Efficient optimization techniques
Relational databases are the cornerstone of the management of
homogeneous, regular, exact, centralized information
Omar Benjelloun - New Bases for New Data
2
… but data has changed
•
•
•
Data is distributed, behind applications, dynamically changing
Data is heterogeneous
Data may be uncertain
Today
•
•
•
Data is stored in relational databases (or XML)
Techniques for data integration, data exchange
… Lots of code
Traditional Database Management Systems (DBMS’s) are too rigid
New characteristics should be represented in the data
New bases are needed
•
•
foundations (models and languages)
Processing and optimization techniques
Omar Benjelloun - New Bases for New Data
3
Applications
Information integration
•
•
•
•
Data is distributed on multiple heterogenous, independent sources
Conflicting information from the sources: inconsistency, uncertainty
Varying and evolving reliability of sources
Where data came from can be critical information
Scientific data management
Receptor (e.g., sensor) data management
Data cleaning (entity resolution)
And many others…
Omar Benjelloun - New Bases for New Data
4
Agenda
Distributed and dynamic data: Active XML
•
•
•
•
A “glue” language to connect data and programs
XML documents with embedded calls to Web services
Distributed interactions through the exchange of AXML data
Techniques to query and control the exchange of AXML data
Uncertain data: ULDB’s
•
•
•
An extension of the relational model with uncertainty and lineage
Efficient query evaluation
Computing probabilities
Conclusion
Omar Benjelloun - New Bases for New Data
5
Active XML
Omar Benjelloun - New Bases for New Data
6
Distributed data management
Information is everywhere
XML
XML
Web
service
XML
services
services
XML
Internet
services
XML
XML
XML
XML
services
Omar Benjelloun - New Bases for New Data
Web
service
Data warehouses
Databases
Web sites
PC, PDA, cell phones,
home appliances, cars…
7
The golden triangle
of distributed data management
XML
XML
a standard for data representation & exchange
•
•
•
Extensible Markup Language
Labeled ordered trees
Rich types: XML Schema
Query languages
•
XPath, XQuery
Web services
•
Standards for distributed computing
Omar Benjelloun - New Bases for New Data
SOAP
WSDL
XQuery
XPath
8
What is Active XML (AXML)?
AXML is a declarative language
for distributed information management
and
an infrastructure to support this language,
in a peer-to-peer framework.
Omar Benjelloun - New Bases for New Data
9
Active XML documents
XML documents with embedded calls to Web services
Intensional
• Some of the data is given explicitly
• Some is given intensionally
(i.e. the means to acquire data when needed are given)
Dynamic
• If the external sources change, the same document will provide
•
different information
Reaction to world changes
Omar Benjelloun - New Bases for New Data
10
Not a new idea in databases, nor on the Web
Mixing calls to data is an old idea
• Procedural attributes in relational systems
• Basis of Object-oriented Databases
In Web programming
• Sun’s JSP, PHP+MySQL
Calls to Web services inside documents
• Macromedia FLEX, Apache Jelly, Microsoft XAML
What is new is the exploitation of the idea…
Omar Benjelloun - New Bases for New Data
11
Web services in brief
A number of standards
• XML
• SOAP: Exchange of messages between applications
• WSDL: Description of service interfaces (e.g. input/output types)
• UDDI: Advertisement and discovery of services
• … other proposed standards (choreography, security, etc.)
For us: means to provide, invoke and describe
remote functions with XML input/output.
They make AXML documents universally understandable.
Omar Benjelloun - New Bases for New Data
12
A sample AXML document
<?xml version=“1.0” ?>
<newspaper>
<title>Le Monde</title>
<date>06/10/2003</date>
<call svc=“Yahoo.GetTemp”>
<city>Paris</city>
</call>
<call svc=“TimeOut.GetEvents”>
exhibits
</call>
</newspaper>
newspaper
GetEvents
title
date
GetTemp
“Exhibits”
city
“06/10/2003”
“Le Monde”
“Paris”
AXML documents may contain calls:
•
•
Omar Benjelloun - New Bases for New Data
to any existing Web services
(e-bay.net, google.com…)
to any AXML Web services
(to be defined)
13
Materialization
<?xml version=“1.0” ?>
<newspaper>
<title>Le Monde</title>
<date>06/10/2003</date>
<temp>16°C</temp>
<call
svc=“Yahoo.GetTemp”>
<city>Paris</city>
</call>
<call svc=“TimeOut.GetEvents”>
exhibits
</call>
</newspaper>
newspaper
GetEvents
temp
GetTemp
date
“Exhibits”
city
“16°C”
“06/10/2003”
“Paris”
“Le Monde”
title
SOAP
call
Y!
•
•
Replacing the call by its result is not the only option
Calls are not necessarily RPC-style synchronous invocations
Omar Benjelloun - New Bases for New Data
14
AXML Web services
Parameters:
AXML data
Result:
AXML data
Great
flexibility
Distribute computations: by sending as parameters
data containing service calls, one can delegate some
work to other peers.
Partial computations: by returning data containing
service calls, one can give to the receiver the control
of these calls.
Omar Benjelloun - New Bases for New Data
15
Distributed interactions
Omar Benjelloun - New Bases for New Data
16
Exchanging Active XML
Omar Benjelloun - New Bases for New Data
17
To call or not to call ?
newspaper
GetEvents
temp
GetTemp
“Exhibits”
city
“06/10/2003”
“16°C”
“Le Monde”
“Paris”
title
date
Y!
 Materialization can be performed
 by the sender, before sending a document…
 or by the receiver, after receiving it.
Omar Benjelloun - New Bases for New Data
18
Why control the materialization of calls?
For added functionality, e.g.
• Intensional data allows to get up-to-date information.
For security reasons or capabilities, e.g.
• I don’t trust this Web service/domain,
• I don’t have the right credentials to invoke it,
• It costs money,
• Maybe the receiver doesn’t know Active XML!
For performance reasons, e.g.
• A proxy can invoke all the services on behalf of a PDA.
… and many more reasons you can think of!
Omar Benjelloun - New Bases for New Data
19
How to control it? Using types
We extend XML Schema, with intensional types: XMLSchemaint
Sender
Capabilities
ACL
Cost
...
g
q
f
g
r
...
q
...
q
g
g
q
g
...
g
f
r
q
g
f
g
...
r
...
data
exchange
Schema
f
Receiver
Capabilities
ACL
Cost
...
...
...
Static analysis algorithms use signatures of services: WSDLint
Omar Benjelloun - New Bases for New Data
20
The extended schema language
To simplify, we use here a DTD-like syntax
Data:
newspaper
= title.date.(GetTemp|temp).(GetEvents|exhibit*)
title
= data
date
= data
temp
= data
city
= data
exhibit
= title.(GetDate|date)
newspaper
GetEvents
title
date
GetTemp
“Exhibits”
city
“06/10/2003”
Functions:
“Le Monde”
GetTemp(city)
-> temp
GetEvents(data)
-> (exhibit|performance)*
GetDate(title)
-> date
“Paris”
Rewriting: replace call(s) by an arbitrary output of the service.
Omar Benjelloun - New Bases for New Data
21
Rewritings
The Goal:
Given
• an AXML document d
• a schema s,
Can we rewrite d so that it matches s?
Safe rewriting: one that for sure leads to s
(we know without making any call)
Possible rewriting: one that may lead to s
(depending on the answers of services)
Omar Benjelloun - New Bases for New Data
22
Difficulties
Infinite search space
• Vertical
• Horizontal
Main problem
• The result of a Web service call is unknown
• We just know a signature (input/output types)
We want a very efficient solution
Foundations of the problem
• String & tree automata,
• with existential and universal transitions.
Omar Benjelloun - New Bases for New Data
23
Results
The general problem is undecidable [MSS03]
Restrictions on the considered rewritings
• Left-to-right: No “going back and forth”
• K-depth: bound on the nesting of function calls
(Search space still infinite but finitely representable)
Under these restrictions
• We have algorithms to find safe/possible rewritings.
• They are PTIME (for deterministic schemas).
• We can also do it between schemas.
Implementation
• demo at VLDB 2003 (customizable news syndication)
Omar Benjelloun - New Bases for New Data
24
Safe rewriting algorithm (flavor)
Build an FSA that
accepts all k-depth
rewritings of the
initial word.
q0
title
q1
date
q2
GetTemp

q3

q5
temp
q6
GetEvents


q7
exhibit
Aw1
Build an FSA that
recognizes the
complement of the
target type.
* title
p0
q4
performance
* date
p1
* temp
p2
*GetEvents
p3
p4
exhibit
*
*
p6
*
p5
A
exhibit
Omar Benjelloun - New Bases for New Data
25
Safe rewriting algorithm
Compute the intersection of these languages:

performance
exhibit
q7,p6

q4,p6

GetEvents
q3,p6
q7,p6
GetTemp
q0,p0
title
q1,p1
date
q2,p2
q3,p3

q5,p2

GetEvents
q6,p3

q7,p5

q4,p5
exhibit
performance

temp
exhibit
exhibit
performance
q7,p3
q4,p3
q4,p4
A  Awk  A
A smart marking determines whether a safe rewriting exists.
Then run the word on the marked automaton to find an actual rewriting.
Optimizations: lazy construction of the automata
parallel evaluation of calls
Omar Benjelloun - New Bases for New Data
26
Querying Active XML
Omar Benjelloun - New Bases for New Data
27
Querying AXML Data
Given a (tree pattern) query:
/newspaper[temp > 18°C]/exhibits//exhibit[location=“Le Louvre”]
newspaper
Materialize the document?
Call only the services that may
data to the query answer.
exhibits
GetEvents
temp
GetTemp
contribute title
“Exhibits”
getDate
GetExhibits
city
“19°C”
City
“Paris”
“Le Monde”
“Paris”
The problem: Lazy evaluation of service calls
To call or not to call, this time when evaluating a query
Omar Benjelloun - New Bases for New Data
28
Lazy evaluation
Difficulties:
•
•
•
•
Calls can be found everywhere in the document
May appear dynamically (as a result of previous calls)
May become (ir)relevant due to previous invocations
Need to take signatures of calls into consideration
A possible approach: modify the query processor
•
•
•
Top-down evaluation
Trigger the calls found on the way
Not so great:
– Computation is blocked
– Optimization opportunities are lost
Omar Benjelloun - New Bases for New Data
29
NFQ’s
newspaper
Given a query to evaluate:
temp
> 18°C
exhibits
exhibit
location
“Le Louvre”
newspaper
Derive a set of
exhibits
“node-focused” queries (NFQ),
that find the relevant calls
when evaluated on the document.
temp
*
*
*
> 18°C
Etc.
Need to be reevaluated, as the document evolves!
Omar Benjelloun - New Bases for New Data
30
Optimizations
Service calls sequencing
•
•
Analysis of the relationship between calls (through the NFQ’s)
Layering, and parallelization inside each layer.
Filtering by type analysis
•
Match output types of services to the data expected by queries
“Pushing” queries to capable services
Acceleration:
•
•
Via relaxation:
– NFQ approximation
– Superset of the relevant calls
Via a special access structure, similar to a DataGuide:
– Restricted to paths that lead to service calls
– Indexes the calls
Experimental assessment
•
10x speed-up when combining optimizations
Omar Benjelloun - New Bases for New Data
31
There is more…
The AXML peer system
•
•
•
Manages persistent AXML documents
Provides AXML services
Open source
Language extensions to control the activation of calls
Continuous services
Theoretical foundations
…check out http://www.activexml.net
Omar Benjelloun - New Bases for New Data
32
Uncertain data
Omar Benjelloun - New Bases for New Data
33
Basic Premise
Traditional relational DB
• Every data item’s value must be exact
• Every data item is in the database or not
• Where data came from and how it evolves is not important
ULDB’s relax these constraints by making
1. Data
2. Uncertainty
3. Lineage
all first-class interrelated concepts
Omar Benjelloun - New Bases for New Data
34
Previous work
Models for uncertainty
• Labeled nulls, c-tables, probabilistic models,...
Trade-off between
• expressiveness
• Simplicity of representation, complexity of operations
• We investigated this space in [DBHM06]
Models for lineage
• In relational databases, data warehouses
• Definition of lineage can be tricky for complex queries
First to consider lineage together with uncertainty
Omar Benjelloun - New Bases for New Data
35
Uncertainty
alternate
SAW
x-tuple
Witness Car
Granny
VW
Granny BMW
Cop
Ford
Cop
?
maybe
Cop
Ford
Cop
VW
VW
Possible worlds:
Granny
VW
Granny
BMW
Cop
Ford
Cop
Ford
Granny
VW
Granny
BMW
Cop
VW
Cop
VW
Simple formalism
•
•
not complete
not closed under joins
Omar Benjelloun - New Bases for New Data
36
Lineage
SAW
OWNS
Witness Car
Suspect Car
Granny
VW
Chris
VW
Cop
Ford
Chris
BMW
Mike
VW
Mike
Ford
 witness, suspect
ACCUSES Witness Suspect
Granny
Chris
Granny
Mike
Cop
Mike
Omar Benjelloun - New Bases for New Data
37
ULDB’s
SAW
OWNS
Witness Car
Granny
VW
Cop
Ford
Granny BMW
?
Suspect Car
Chris
VW
Chris
BMW
Mike
VW
Mike
Ford
ACCUSES Witness Suspect
Granny
Chris
Granny
Mike
Cop
Mike
Granny
Chris
Omar Benjelloun - New Bases for New Data
?
?
?
38
ULDB’s
SAW
OWNS
Witness Car
Granny
VW
Cop
Ford
Grann
y
BMW
?
Suspect Car
Chris
VW
Chris
BMW
Mike
VW
Mike
Ford
ACCUSES Witness Suspect
Granny
Chris
Granny
Mike
Cop
Mike
Granny
Chris
Omar Benjelloun - New Bases for New Data
?
?
?
39
Properties
ULDB’s are simple
•
•
x-tuples: set of alternate tuples, with or without ‘?’
lineage: associates with each alternate a set of alternates / external
symbols
ULDB’s are expressive
•
•
•
Complete: can represent any finite set of possible worlds (with lineage)
Simple implementation of monotonic queries, with correct lineages
Natural probabilistic extension
ULDB’s are efficient
•
•
Query processing can use existing query optimizers
Tuple certainty/membership can be tested in polynomial time
Omar Benjelloun - New Bases for New Data
40
Query processing
Omar Benjelloun - New Bases for New Data
41
Querying ULDB’s
D
Q(D)
Possible worlds
Algorithm
ULDB’s
Relational databases
(with lineage)
Query semantics
D1, D2, …, Dn
Q(D1), Q(D2), …, Q(Dn)
Q(Di): add query result
as new relation and lineage to Di
Omar Benjelloun - New Bases for New Data
42
Algorithm
SAW
OWNS
Witness Car
Granny
VW
Cop
Ford
Granny BMW
Kid
Grann
yKid
BMW
Ford
?
?
Ford
Chris
Granny
Mike
Cop
Mike
Granny
Kid
Chris
Mike
Omar Benjelloun - New Bases for New Data
Chris
VW
Chris
BMW
Mike
VW
Mike
Ford
 witness, suspect
ACCUSES Witness Suspect
Granny
Suspect Car
?
?
Kid
Mike
?
?
43
Properties
Efficient algorithm
• Query processing phase can use standard query optimizer
• Lineages are easy to propagate
• “Grouping” phase requires a single pass on the result
Initial prototype
• represents a ULDB as a relational DB
• uses simple query rewriting techniques
Algorithm works for any monotonic query
(including SPJU queries)
Omar Benjelloun - New Bases for New Data
44
Probabilities
Omar Benjelloun - New Bases for New Data
45
Probabilistic ULDB’s
SAW
Witness Car
Granny
VW
Cop
Ford
0.3
0.2
Granny
BMW
Cop
VW
0.5
?0.3
0.7
Semantics: As before, with a probability for each possible world
Without lineages
•
•
Alternates of the same x-tuple correspond to disjoint events
Alternates of different x-tuples correspond to independent events
Lineages
•
•
Capture correlations
Help propagate probabilities for query results
Omar Benjelloun - New Bases for New Data
46
Probabilistic query answering
Compute queries as before
Compute probabilities on demand
•
•
Traverse lineages transitively to the leaves
Combine probabilities of reached alternates
?
?
?
?
0.2
0.3
0.4
?
0.1
0.3
0.5
1
Optimizations: memoize probabilities, efficiently detect ‘closest
independent ancestors’
Omar Benjelloun - New Bases for New Data
47
Future work
Richer queries
•
•
•
Duplicate elimination, difference, aggregation
Supported through new kinds of lineages (e.g., disjunctive, negative)
Querying the uncertainty and the lineage
More operations
•
•
Updates (and their lineage), close to versioning
“Uncertain operations”, e.g., entity resolution, inconsistency repairs
More optimization techniques
More theory
Omar Benjelloun - New Bases for New Data
48
Conclusion
Omar Benjelloun - New Bases for New Data
49
New “Bases” for new data
The database way
•
•
•
Simple models
Declarative languages
Optimization techniques
… for new features of data
•
•
Distribution and decentralization: Active XML
Uncertainty and lineage: ULDB’s
There are more challenges
•
Real-world side effects, semantic reasoning
and strong requirements
•
security, privacy, personalization
Big challenge: Doing it all in a coherent way
•
•
One “big” model?
Integration of models?
Omar Benjelloun - New Bases for New Data
50
Merci
Omar Benjelloun - New Bases for New Data
51
Download