Q -B D P

advertisement
QUERY-BASED DATA PRICING
Paraschos Koutris
Prasang Upadhyaya
Magdalena Balazinska
Bill Howe
Dan Suciu
University of Washington
PODS 2012
MOTIVATION
• Data is increasingly sold and bought on the web
• Websites that sell data:
– AggData [www.aggdata.com]
– Xignite (financial data) [www.xignite.com]
– Gnip (social media) [www.gnip.com]
• Data marketplace services:
– Windows Azure Marketplace (100+ datasets)
[datamarket.azure.com]
– Infochimps (15,000 datasets) [www.infochimps.com]
Query-based pricing customized for buyers
2
CURRENT PRICING (1)
• A fixed price for the whole dataset or for a
specific set of views
• Example: CustomLists
– USA Business Database for $399
– Email addresses for $299
– Businesses in WA for $199
• Limitations:
– Restaurants in WA ?
– Businesses in cities with population >100,000 ?
3
CURRENT PRICING (2)
• API Subscriptions (Azure Marketplace, Infochimps)
– Allow queries over the data
– Pay by number of transactions (page of results)
4
ISSUES WITH PRICING
• Buyers today need to buy a superset of the data
they are interested in
• Sellers can’t easily anticipate all possible queries
that buyers might ask
• Solution: we need a more flexible pricing
scheme, parameterized by queries
5
OUTLINE
1. The Pricing Framework
2. The Pricing Formula
3. The Complexity of Pricing
4. Dichotomy and Algorithms for Selections
6
THE PRICING FRAMEWORK
• The seller defines price points (view-price pairs):
S = { (V1,p1), (V2,p2), … }
• A buyer can buy any query Q
• The system will compute priceDS(Q)
Buyer Q(D) ?
Seller
V1,p1
V2,p2
…
Pricing System
+
Database D
priceDS(Q)
7
INSTANCE-BASED DETERMINACY
Definition.
V = V1,…,Vk determine Q given D, denoted D ⊢ V ↠
Q, if: forall D’, if V(D) = V(D’), then Q(D) = Q(D’)
Intuitively, “V1,…, Vk determine Q” means that Q(D) can be
answered only from V1(D),…,Vk(D), without accessing the
database instance D
8
ARBITRAGE-FREE
Axiom 1.
Given D, the pricing function priceD(Q) is arbitragefree if for all views V1, …, Vk and query Q where
D ⊢ V1, …, Vk ↠ Q:
priceD(Q) ≤ priceD(V1) + … + priceD(Vk)
Suppose V determines Q and priceD(Q) > priceD(V).
Then, we can
1. buy V(D) for priceD(V)
2. compute Q(D) from V(D)
3. now we have answered Q at some price
p<priceD(Q)
9
DISCOUNT-FREE
Axiom 2.
The pricing function priceD(Q) should not offer any
other additional discounts except for the explicit
price points defined by the seller.
• The intuition is that the price points represent
discounts that the seller offers relative to the
price of the whole database
• A pricing function is discount-free if it is maximal
10
EXAMPLE: ORIGAMI DATABASE
11
EXAMPLE: ORIGAMI DATABASE
Database S
Shape
Color
Price points
Swan
White
. . . . .
Swan
Yellow
. . . . .
Dragon
Car
Fish
Yellow
Yellow
White
View
Price
V1(x,y,z) :- S(x,y,z), x=‘Swan’
$2
V2(x,y,z) :- S(x,y,z), x=‘Dragon’
$2
V3(x,y,z) :- S(x,y,z), x=‘Car’
$2
V4(x,y,z) :- S(x,y,z), x=‘Fish’
$2
Picture
Get all
dragon
origami
for $2
. . . . .
. . . . .
. . . . .
W1(x,y,z) :- S(x,y,z), y=‘White’
$3
W2(x,y,z) :- S(x,y,z), y=‘Yellow’
$3
W3(x,y,z) :- S(x,y,z), y=‘Red’
$3
What is the price of the entire database?
Q(x,y,z) :- S(x,y,z)
V1, V2, V3, V4 determine Q: price(Q) ≤ $8
W1, W2, W3 determine Q: price(Q) ≤ $9
Get all red
origami
for $3
Exhausts the active domain
price(Q)=$8
12
R
EXAMPLE: ORIGAMI DATABASE
Shape
Instructions
Shape
Color
Picture
Color
PaperSpecs
Swan
fold, cut, fold…
Swan
White
. . . . .
White
15g/100, $10
Dragon
cut, fold, cut,…
Swan
Yellow
. . . . .
Black
20g/100, $15
Dragon
Yellow
. . . . .
p(σshape)=$99
S
Car
Yellow
. . . . .
Fish
White
. . . . .
p(σshape)=$2
T
p(σcolor)=$50
p(σcolor)=$5
What is the price of the full join?
Q(x,y,z,u,v) :- R(x,u), S(x,y,z), T(y,v)
13
OUTLINE
1. The Pricing Framework
2. The Pricing Formula
3. The Complexity of Pricing
4. Dichotomy and Algorithms for Selections
14
THE QUERY PRICING FORMULA
Given:
1. Price points S = {(V1,p1),…,(Vk, pk)}
2. Database instance D
3. Query Q.
Compute: priceDS(Q)
Properties: (a) arbitrage-free, (b) discount-free, (c) priceDS(Vi)=pi
If it exists, we say that the price points are consistent
Method:
• Consider all subsets of V ={V1,…,Vk} that determine Q
• Let C be the subset with the minimum price, Σi pi, for Vi in C
• Define pD(Q) = Σi pi
Theorem.
(a)The price points are consistent iff pD(Vi)=pi for any price point
i=1,…,k
(b) priceDS(Q) = pD(Q) is the unique arbitrage-free, discount-free
pricing function that agrees with the price points
15
DISCUSSION
• If the result of Q1 is always a subset of Q2, should
Q1 be priced less than Q2? No!
Example:
– V(x,y) :- Fortune500(x,y)
Q(x,y) :- Fortune500(x,y), StrongBuyRec(x)
– price(Q) >> price(V)
• We ignore computation costs in our framework
– Cost of computing query Q
– Q(D)=f(V(D)), but f can be hard to compute
16
OUTLINE
1. The Pricing Framework
2. The Pricing Formula
3. The Complexity of Pricing
4. Dichotomy and Algorithms for Selections
17
DETERMINACY
Definition. [Instance-dependent]
V determines Q given D, denoted as D ⊢ V ↠ Q, if:
forall D’, if V(D’) = V(D), then Q(D) = Q(D’)
[Nash, Segoufin, Vianu ‘07]
Definition. [Instance-independent]
V determines Q, denoted as V ↠ Q, if:
forall D, D’, if V(D) = V(D’), then Q(D) = Q(D’)
V ↠ Q iff there exists a function f
such that Q(D) = f(V(D)) for all D
iff for every D, we have that D ⊢ V ↠ Q
18
COMPLEXITY OF DETERMINACY
V, Q are UCQ
V, Q are CQ
Undecidable
[NSV ’07]
?
data
coNP-complete
[this paper]
coNP-complete
[this paper]
combined
Π 2P
[this paper]
Π 2P
[this paper]
Instance-independent
V↠Q
Instancedependent
D⊢V↠Q
Open Question: is the bound on the combined
complexity tight?
19
COMPLEXITY OF PRICING
Corollary.
Deciding whether priceDS(Q) ≤ k is:
• Combined complexity [input S, D]: Σp2
• Data complexity [input D]: coNP-hard
Proposition.
Pricing is at least as hard as determinacy
How do we deal with the hardness of computation?
20
OUTLINE
1. The Pricing Framework
2. The Pricing Formula
3. The Complexity of Pricing
4. Dichotomy and Algorithms for
Selections
21
RESTRICTING PRICE POINTS
TO
SELECTIONS
• A seller can specify only the prices of selection
queries of the form σR.X=a: prices on columns
• The domain of each column is finite and known to
buyers and sellers
• Price points on selections is how prices are set in
most cases today
22
DICHOTOMY THEOREM
Theorem.
Assuming selection views only, for any Conjunctive
Query w/o self-joins Q, one of the following holds
(data complexity):
(a) priceQS(D) is in PTIME
(b) checking whether priceQS(D)≤k is NP-complete
• PTIME:
– Q(x,y,z,u,v) :- R(x,u),S(x,y,z),T(y,v) [Chains]
– Q(x1,…,xk) :- R1(x1,x2),…,Rk(xk,x1) [Cycles]
• NP-complete:
– Q(x) :- R(x,y) [Projections]
– Q(x,y,z) :- R(x,y,z),S(x),T(y),U(z)
23
ALGORITHM FOR PTIME CASES
• The algorithm uses a reduction to maximum flow
• Edges of finite capacity represent price points
• A set of edges of finite cost is a cut iff they
determine the query
• Example:
– Chain query Q(x,y):-R(x),S(x,y),T(y)
S
R
X
a1
a2
X
Y
a1 b1
a2 b2
a2 b2
a3 b2
a4 b1
Y
b1
b3
T
Dom(X) = {a1,a2,a3,a4}
Dom(Y) = {b1,b2,b3}
24
FLOW GRAPH
R
S
X
Y
a1
b1
Y
a1
a2
b2
b1
a2
a2
b2
b3
a3
b2
a4
b1
X
T
R
T
a4
b1
a3
b2
a2
b3
a1
a4
a3
A set of edges of
finite cost is a cut
iff they determine
the query
a2
a1
b1
b2
b3
S
25
CONCLUSIONS
• Summary:
– The seller sets prices to some views, while the system
computes the price of any query
– Interesting application of query determinacy
– Complexity: dichotomy for CQs w/o self-joins
• Future Work:
– Pricing in the presence of updates
– How do we overcome pricing for intractable queries?
– Connection of pricing and privacy
26
Thank you !
27
Download