TOWARD PRACTICAL QUERY PRICING WITH QUERYMARKET Paraschos Koutris Prasang Upadhyaya Magdalena Balazinska Bill Howe Dan Suciu University of Washington SIGMOD 2013 MOTIVATION • Data is increasingly sold and bought on the web • Websites that sell data: – Xignite (financial) – Gnip (social) • Data marketplace services: – – – – Windows Azure Marketplace Infochimps Factual DataMarket 2 A PRICING SCENARIO (1) English-German dictionary T english german thanks Danke car Auto day Tag road Strasse Road Weg … … PRICING SCHEMES Sell the whole table T for a fixed price • Q: translate only the word “thanks” • The user pays for redundant information Price per output tuple • Q: Does the word “thanks” translate to “Auto” ? • An empty result still carries information 3 A PRICING SCENARIO (2) English-German dictionary T english german thanks Danke car Auto day Tag road Strasse Road Weg … … Word Frequency Stats UF word frequency genre rank rock 0.025 music 20 pop 0.030 music 10 database 0.001 science 1453 … … .. … Q1: Return all translations to German of top 10 words in the genre “music” Q2: Return all translations to German of top 20 words in the genre “music” • Current systems do not sell queries that combine datasets • Queries issued by a user may have overlapping content 4 HOW TO PRICE DATA English-German dictionary T english german p(σT.english=‘thanks’)=$0.1 thanks Danke p(σT.english=‘car’)=$0.1 car Auto p(σT.english=‘day’)=$0.1 day Tag road Strasse p(σT.english=‘road’)=$0.15 road Weg p(σT.english=‘cat’)=$0.05 … … p(σT.german=‘Auto’)=$0.5 … Price points • selection queries on single table • exhaust the possible values (ColA) of some attribute A • may select on values not in the active domain 5 QUERYMARKET: CONTRIBUTIONS • A formal pricing framework where: – sellers specify a set of price points as selection queries – buyers can purchase any query on the database – the system automatically computes the price of the query • Support efficient computation of prices for a large class of SQL queries • Support the necessary functionality for a marketplace: – Pricing queries with overlapping information content – Database updates – Revenue sharing among different sellers? 6 OUTLINE 1. 2. 3. 4. The Pricing Framework Computing the Price Query History Revenue Sharing 7 THE PRICING FRAMEWORK [Koutris et al., PODS 2012] • The seller defines price points (view-price pairs): S = { (V1,p1), (V2,p2), … } • A buyer can buy any query Q • The system will compute priceDS(Q) Buyer Q(D) ? Seller Price points Pricing System + Database D priceDS(Q) 8 PROPERTIES OF PRICES We say that the views V1,…, Vk determine Q if one can compute Q(D) from V1(D),…, Vk(D) without access to D Arbitrage-free: Given D, priceD(Q) is arbitrage-free if for all views V1, …, Vk that determine Q: priceD(Q) ≤ priceD(V1) + … + priceD(Vk) Discount-free: priceD(Q) must not offer additional discounts except for the explicit price points defined by the seller 9 THE PRICING FORMULA Arbitrage-Price: • The price of the cheapest set of views from price points S that determine the query Q • unique + arbitrage-free + discount-free + agrees with price points Table R price = $1 A a1 Table S price = $2 A B a1 b a2 b ColA = { a1, a2, a3 } ColB = { b } Q(y) = R(x),S(x,y) price = $3 • {σ[R.A=a1], σ[S.B=b] } determines Q • cost = 1 + 3 = 4 • {σ[R.A=a1], σ[S.A=a1] } also determines Q • cost = 1 + 2 = 3 (cheapest possible) 10 OUTLINE 1. 2. 3. 4. The Pricing Framework Computing the Price Query History Revenue Sharing 11 COMPUTING THE PRICE • The problem of computing the arbitrage price even for SELECT-PROJECT-JOIN queries is coNP-complete • For some queries, the price can be computed fast: • Selections, joins w/o projection • We describe pricing as an Integer Linear Program (ILP) and then use fast ILP solvers (e.g. GLPK, CPLEX) • Classes of queries supported: • • • • Selections/Projections/Joins Unions User-Defined Functions (UDF) Bundles of queries 12 ILP CONSTRUCTION (1) Table R A Table S a1 price = $1 price = $2 A B a1 b a2 b ColA = { a1, a2, a3 } ColB = { b } price = $3 • Price the query Q(x,y) = R(x), S(x,y) • Introduce a {0/1} variable x[attribute,value] for each price point: x[R.A, a2], x[S.A, a1], x[S.B, b], … 13 ILP CONSTRUCTION (2) Table R A Table S a1 A B a1 b a2 b ColA = { a1, a2, a3 } ColB = { b } Q(x,y) = R(x), S(x,y) • Minimize (independent of the query): price = x[R.A,a1] + x[R.A,a2] + x[R.A,a3] +2x[S.A,a1] + 2x[S.A,a2] + 2x[S.A,a3] +3x[S.B,b] • Constraints: • (a1,b) in Q: x[R.A,a1] ≥1 x[S.A,a1] + x[S.B,b] ≥ 1 • (a2,b) not in Q: x[R.A,a2] ≥1 • (a3,b) not in Q: x[R.A,a3] + x[S.A,a3] + x[S.B,b] ≥ 1 14 ILP CONSTRUCTION (3) Table R Table S A B a1 a1 b a2 a2 b A ColA = { a1, a2, a3} ColB = { b} New variable for each tuple in Qfull • Projection: Q(y) = R(x), S(x,y) • Constraints: • (a1,b) in Qfull: x[R.A,a1] • (a2,b) in Qfull: x[R.A,a2] • (b) in Q : ≥ z1 x[S.A,a1] + x[S.B,b] ≥ z1 ≥ z2 x[S.A,a2] + x[S.B,b] ≥ z2 z 1 + z2 ≥ 1 15 QUERYMARKET SYSTEM • Runs on top of any SQL database • Information stored in the database: – Price points are stored in the database in price tables – Keeping track of price tables with an index table • The dataset: – – – – English-german translation: English-french translation : UDF to find hashtags : Word frequency stats : Ten,gr(w, w’) Ten,fr(w, w’) IsHashtag(w) WF(w, genre, frequency, rank) 16 PRICE COMPUTATION (1) • Small dataset where columns have size ~ 102 10 ILP solving time ILP construction time Time in seconds 8 6 4 2 0 1 2 3 4 selections 5 6 7 8 Query 9 10 11 12 13 14 15 3-way join 2-way joins 2-way joins w/o projections with projections 17 PRICE COMPUTATION (2) • Larger dataset where columns have size ~ 103 80 ILP solving time 70 ILP construction time Time in seconds 60 50 40 30 20 10 0 1 2 3 4 selections 5 6 7 8 Query 9 10 11 12 13 14 15 3-way join 2-way joins 2-way joins w/o projections with projections 18 OUTLINE 1. 2. 3. 4. The Pricing Framework Computing the Price Query History Revenue Sharing 19 QUERY HISTORY A user asks a sequence of queries over time of varying information overlap Q = Q1, Q2, …, Qk Experiment with 30 selection/join queries • • Oblivious pricing: each query priced independently 18 16 price in dollars 14 High Overlap 12 Bundle pricing: each query Qi priced p(Q1,…,Qi)- p(Q1,…,Qi-1) 10 8 6 4 2 0 1 2 3 4 5 6 7 8 9 101112131415161718192021222324252627282930 query View pricing: when a query is purchased, the purchased views are free for later queries 20 QUERY HISTORY (2) 25 price in dollars 20 Moderate Overlap Oblivious pricing 15 View pricing 10 Bundle pricing 5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 16 14 query Weak Overlap price in dollars 12 10 Oblivious pricing 8 Bundle pricing 6 View pricing 4 2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 query 21 VIEW PRICING • View Pricing is our proposed strategy: – – – • Computationally efficient Low storage overhead Close to optimal (bundle) price View Pricing can be used for dynamic databases: if view V is purchased at some point and then updated, the user pays only an update price 22 OUTLINE 1. 2. 3. 4. The Pricing Framework Computing the Price Query History Revenue Sharing 23 REVENUE SHARING • How is the revenue shared between sellers if several datasets contribute to the answer? • What if the cheapest set of views to determine a query is not unique ? • Example: – Q(‘sigmod13’) = isHashtag(‘sigmod13’), isNoun(‘sigmod13’) – Seller 1 prices $1 per entry for isHashtag, so does seller 2 – If both isHashtag, isNoun are false and each costs $1, purchasing either of the entries answers Q 24 REVENUE SHARING: SOLUTION • For a seller s, share(s, Q) is the maximum revenue of s over all minimum-cost set of price points that determine Q • share(s, Q) can be computed in our framework • Solution: split price(Q) among sellers proportionally to their shares • Example: – Both shares are $1 – The revenue of each seller will be $0.5, since their shares are equal 25 CONCLUSIONS • QueryMarket: the first system that supports pricing a large class of SQL queries within a formal framework • We presented solutions to address the requirements of a real-world marketplace • Future work includes: – Scaling the price computation (bucketization) – Full SQL Support (aggregates, negation) – Query answering under limited budget 26 Thank you ! 27