QUERY-BASED DATA PRICING Paraschos Koutris Prasang Upadhyaya Magdalena Balazinska Bill Howe Dan Suciu University of Washington PODS 2012 MOTIVATION • Data is increasingly sold and bought on the web • Websites that sell data: – AggData [www.aggdata.com] – Xignite (financial data) [www.xignite.com] – Gnip (social media) [www.gnip.com] • Data marketplace services: – Windows Azure Marketplace (100+ datasets) [datamarket.azure.com] – Infochimps (15,000 datasets) [www.infochimps.com] Query-based pricing customized for buyers 2 CURRENT PRICING (1) • A fixed price for the whole dataset or for a specific set of views • Example: CustomLists – USA Business Database for $399 – Email addresses for $299 – Businesses in WA for $199 • Limitations: – Restaurants in WA ? – Businesses in cities with population >100,000 ? 3 CURRENT PRICING (2) • API Subscriptions (Azure Marketplace, Infochimps) – Allow queries over the data – Pay by number of transactions (page of results) 4 ISSUES WITH PRICING • Buyers today need to buy a superset of the data they are interested in • Sellers can’t easily anticipate all possible queries that buyers might ask • Solution: we need a more flexible pricing scheme, parameterized by queries 5 OUTLINE 1. The Pricing Framework 2. The Pricing Formula 3. The Complexity of Pricing 4. Dichotomy and Algorithms for Selections 6 THE PRICING FRAMEWORK • The seller defines price points (view-price pairs): S = { (V1,p1), (V2,p2), … } • A buyer can buy any query Q • The system will compute priceDS(Q) Buyer Q(D) ? Seller V1,p1 V2,p2 … Pricing System + Database D priceDS(Q) 7 INSTANCE-BASED DETERMINACY Definition. V = V1,…,Vk determine Q given D, denoted D ⊢ V ↠ Q, if: forall D’, if V(D) = V(D’), then Q(D) = Q(D’) Intuitively, “V1,…, Vk determine Q” means that Q(D) can be answered only from V1(D),…,Vk(D), without accessing the database instance D 8 ARBITRAGE-FREE Axiom 1. Given D, the pricing function priceD(Q) is arbitragefree if for all views V1, …, Vk and query Q where D ⊢ V1, …, Vk ↠ Q: priceD(Q) ≤ priceD(V1) + … + priceD(Vk) Suppose V determines Q and priceD(Q) > priceD(V). Then, we can 1. buy V(D) for priceD(V) 2. compute Q(D) from V(D) 3. now we have answered Q at some price p<priceD(Q) 9 DISCOUNT-FREE Axiom 2. The pricing function priceD(Q) should not offer any other additional discounts except for the explicit price points defined by the seller. • The intuition is that the price points represent discounts that the seller offers relative to the price of the whole database • A pricing function is discount-free if it is maximal 10 EXAMPLE: ORIGAMI DATABASE 11 EXAMPLE: ORIGAMI DATABASE Database S Shape Color Price points Swan White . . . . . Swan Yellow . . . . . Dragon Car Fish Yellow Yellow White View Price V1(x,y,z) :- S(x,y,z), x=‘Swan’ $2 V2(x,y,z) :- S(x,y,z), x=‘Dragon’ $2 V3(x,y,z) :- S(x,y,z), x=‘Car’ $2 V4(x,y,z) :- S(x,y,z), x=‘Fish’ $2 Picture Get all dragon origami for $2 . . . . . . . . . . . . . . . W1(x,y,z) :- S(x,y,z), y=‘White’ $3 W2(x,y,z) :- S(x,y,z), y=‘Yellow’ $3 W3(x,y,z) :- S(x,y,z), y=‘Red’ $3 What is the price of the entire database? Q(x,y,z) :- S(x,y,z) V1, V2, V3, V4 determine Q: price(Q) ≤ $8 W1, W2, W3 determine Q: price(Q) ≤ $9 Get all red origami for $3 Exhausts the active domain price(Q)=$8 12 R EXAMPLE: ORIGAMI DATABASE Shape Instructions Shape Color Picture Color PaperSpecs Swan fold, cut, fold… Swan White . . . . . White 15g/100, $10 Dragon cut, fold, cut,… Swan Yellow . . . . . Black 20g/100, $15 Dragon Yellow . . . . . p(σshape)=$99 S Car Yellow . . . . . Fish White . . . . . p(σshape)=$2 T p(σcolor)=$50 p(σcolor)=$5 What is the price of the full join? Q(x,y,z,u,v) :- R(x,u), S(x,y,z), T(y,v) 13 OUTLINE 1. The Pricing Framework 2. The Pricing Formula 3. The Complexity of Pricing 4. Dichotomy and Algorithms for Selections 14 THE QUERY PRICING FORMULA Given: 1. Price points S = {(V1,p1),…,(Vk, pk)} 2. Database instance D 3. Query Q. Compute: priceDS(Q) Properties: (a) arbitrage-free, (b) discount-free, (c) priceDS(Vi)=pi If it exists, we say that the price points are consistent Method: • Consider all subsets of V ={V1,…,Vk} that determine Q • Let C be the subset with the minimum price, Σi pi, for Vi in C • Define pD(Q) = Σi pi Theorem. (a)The price points are consistent iff pD(Vi)=pi for any price point i=1,…,k (b) priceDS(Q) = pD(Q) is the unique arbitrage-free, discount-free pricing function that agrees with the price points 15 DISCUSSION • If the result of Q1 is always a subset of Q2, should Q1 be priced less than Q2? No! Example: – V(x,y) :- Fortune500(x,y) Q(x,y) :- Fortune500(x,y), StrongBuyRec(x) – price(Q) >> price(V) • We ignore computation costs in our framework – Cost of computing query Q – Q(D)=f(V(D)), but f can be hard to compute 16 OUTLINE 1. The Pricing Framework 2. The Pricing Formula 3. The Complexity of Pricing 4. Dichotomy and Algorithms for Selections 17 DETERMINACY Definition. [Instance-dependent] V determines Q given D, denoted as D ⊢ V ↠ Q, if: forall D’, if V(D’) = V(D), then Q(D) = Q(D’) [Nash, Segoufin, Vianu ‘07] Definition. [Instance-independent] V determines Q, denoted as V ↠ Q, if: forall D, D’, if V(D) = V(D’), then Q(D) = Q(D’) V ↠ Q iff there exists a function f such that Q(D) = f(V(D)) for all D iff for every D, we have that D ⊢ V ↠ Q 18 COMPLEXITY OF DETERMINACY V, Q are UCQ V, Q are CQ Undecidable [NSV ’07] ? data coNP-complete [this paper] coNP-complete [this paper] combined Π 2P [this paper] Π 2P [this paper] Instance-independent V↠Q Instancedependent D⊢V↠Q Open Question: is the bound on the combined complexity tight? 19 COMPLEXITY OF PRICING Corollary. Deciding whether priceDS(Q) ≤ k is: • Combined complexity [input S, D]: Σp2 • Data complexity [input D]: coNP-hard Proposition. Pricing is at least as hard as determinacy How do we deal with the hardness of computation? 20 OUTLINE 1. The Pricing Framework 2. The Pricing Formula 3. The Complexity of Pricing 4. Dichotomy and Algorithms for Selections 21 RESTRICTING PRICE POINTS TO SELECTIONS • A seller can specify only the prices of selection queries of the form σR.X=a: prices on columns • The domain of each column is finite and known to buyers and sellers • Price points on selections is how prices are set in most cases today 22 DICHOTOMY THEOREM Theorem. Assuming selection views only, for any Conjunctive Query w/o self-joins Q, one of the following holds (data complexity): (a) priceQS(D) is in PTIME (b) checking whether priceQS(D)≤k is NP-complete • PTIME: – Q(x,y,z,u,v) :- R(x,u),S(x,y,z),T(y,v) [Chains] – Q(x1,…,xk) :- R1(x1,x2),…,Rk(xk,x1) [Cycles] • NP-complete: – Q(x) :- R(x,y) [Projections] – Q(x,y,z) :- R(x,y,z),S(x),T(y),U(z) 23 ALGORITHM FOR PTIME CASES • The algorithm uses a reduction to maximum flow • Edges of finite capacity represent price points • A set of edges of finite cost is a cut iff they determine the query • Example: – Chain query Q(x,y):-R(x),S(x,y),T(y) S R X a1 a2 X Y a1 b1 a2 b2 a2 b2 a3 b2 a4 b1 Y b1 b3 T Dom(X) = {a1,a2,a3,a4} Dom(Y) = {b1,b2,b3} 24 FLOW GRAPH R S X Y a1 b1 Y a1 a2 b2 b1 a2 a2 b2 b3 a3 b2 a4 b1 X T R T a4 b1 a3 b2 a2 b3 a1 a4 a3 A set of edges of finite cost is a cut iff they determine the query a2 a1 b1 b2 b3 S 25 CONCLUSIONS • Summary: – The seller sets prices to some views, while the system computes the price of any query – Interesting application of query determinacy – Complexity: dichotomy for CQs w/o self-joins • Future Work: – Pricing in the presence of updates – How do we overcome pricing for intractable queries? – Connection of pricing and privacy 26 Thank you ! 27