BigBench Discussion Tilmann Rabl, msrg.org, UofT Third Workshop on Big Data Benchmarking, Xi’an July 16, 2013 Introduction BigBench • End to end benchmark • Application level • Based on a product retailer (TPC-DS) • Focus on • Parallel DBMS • MR engines • History • Launched at 1st WBDB, San Jose • Published at SIGMOD 2013 • Full spec at WBDB proceedings 2012 (in progress) • Collaboration with Industry & Academia • Teradata, University of Toronto, InfoSizing , Oracle Outline • Data Model • Variety, Volume, Velocity • Workload • Main driver: retail big data analytics • Covers : data source, declarative & procedural and machine learning algorithms. • Metrics • Discussion Data Model I • 3 Parts • Structured: TPC-DS + market prices • Semi-structured: website click-stream • Unstructured: customers’ reviews Data Model II • Variety • Different schema parts • Volume • Based on scale factor • Similar to TPC-DS scaling • Weblogs & product reviews also scaled • Velocity • Periodic refreshes for all data • Different velocity for different areas • Vstructured < Vunstructured < Vsemistructured • Queries run with refresh Workload • Workload Queries • 30 queries • Specified in English • No required syntax • Business functions (Adapted from McKinsey+) • Marketing • Cross-selling, Customer micro-segmentation, Sentiment analysis, Enhancing multichannel consumer experiences • Merchandising • Assortment optimization, Pricing optimization • Operations • Performance transparency, Product return analysis • Supply chain • Inventory management • Reporting (customers and products) SQL-MR Query 1 SELECT category_cd1 AS category1_cd , category_cd2 AS category2_cd , COUNT (*) AS cnt FROM basket_generator ( ON ( SELECT i. i_category_id AS category_cd , s. ws_bill_customer_sk AS customer_id FROM web_sales s INNER JOIN item i ON s. ws_item_sk = i_item_sk ) PARTITION BY customer_id BASKET_ITEM (' category_cd ') ITEM_SET_MAX (500) ) GROUP BY 1,2 order by 1 ,3 ,2; Workload – Technical Aspects Data Sources Number of Queries Percentage Structured 18 60% Semi-structured 7 23% Un-structured 5 17% Analytic techniques Number of Queries Percentage Statistics analysis 6 20% Data mining 17 57% Reporting 8 27% Metrics • Discussion topic.. • Initial thoughts • • • • Focus on loading and type of processing MR engines good at loading DBMS good at SQL Metric = 4 ππ· ∗ ππ ∗ ππ΅ ∗ ππΏ • • • • ππ· βΆ πΈπ₯πππ’π‘πππ π‘πππ πππ π·πππππππ‘ππ£π ππ’πππππ ππ : πΈπ₯πππ’π‘πππ π‘πππ πππ πππππππ’πππ ππ’πππππ ππ΅ βΆ πΈπ₯πππ’π‘πππ π‘πππ πππ π΅ππ‘β ππΏ : πΈπ₯πππ’π‘πππ π‘πππ πππ πΏππππππ • Applicable to single and multi-streams Discussion Topics • What did we miss? • Extending/completing BigBench • How to go beyond MapReduce? • E.g. add social/graph component (e.g., LinkBench) • NoSQL, cube model, recommendations • What kind of queries, schema extensions are necessary? • What has to be measured? • Metrics and properties • Price performance, (energy) efficiency • What properties does the system have to fulfill? • Any other concerns? Discussion – Organization • Break-out groups (2) • Discussion leader • Scribe: Prepare 5 minute presentation • Student groups (2) • Find an English speaking member (presenter) • Prepare 1+ slide on each question • Schedule: • • • • 14:15 – 15:15 Discussion 1 15 min break 15:30 – 16:15 Discussion 2 16:15 – 17:00 Plenary