Introduction to BigBench

advertisement
BigBench Discussion
Tilmann Rabl, msrg.org, UofT
Third Workshop on Big Data Benchmarking, Xi’an
July 16, 2013
Introduction BigBench
• End to end benchmark
• Application level
• Based on a product retailer (TPC-DS)
• Focus on
• Parallel DBMS
• MR engines
• History
• Launched at 1st WBDB, San Jose
• Published at SIGMOD 2013
• Full spec at WBDB proceedings 2012 (in progress)
• Collaboration with Industry & Academia
• Teradata, University of Toronto, InfoSizing , Oracle
Outline
• Data Model
• Variety, Volume, Velocity
• Workload
• Main driver: retail big data analytics
• Covers : data source, declarative & procedural and
machine learning algorithms.
• Metrics
• Discussion
Data Model I
• 3 Parts
• Structured: TPC-DS + market prices
• Semi-structured: website click-stream
• Unstructured: customers’ reviews
Data Model II
• Variety
• Different schema parts
• Volume
• Based on scale factor
• Similar to TPC-DS scaling
• Weblogs & product reviews also scaled
• Velocity
• Periodic refreshes for all data
• Different velocity for different areas
• Vstructured < Vunstructured < Vsemistructured
• Queries run with refresh
Workload
• Workload Queries
• 30 queries
• Specified in English
• No required syntax
• Business functions (Adapted from McKinsey+)
• Marketing
• Cross-selling, Customer micro-segmentation, Sentiment
analysis, Enhancing multichannel consumer experiences
• Merchandising
• Assortment optimization, Pricing optimization
• Operations
• Performance transparency, Product return analysis
• Supply chain
• Inventory management
• Reporting (customers and products)
SQL-MR Query 1
SELECT
category_cd1 AS category1_cd ,
category_cd2 AS category2_cd , COUNT (*) AS cnt
FROM
basket_generator (
ON
( SELECT i. i_category_id AS category_cd ,
s. ws_bill_customer_sk AS customer_id
FROM web_sales s INNER JOIN item i
ON s. ws_item_sk = i_item_sk
)
PARTITION BY customer_id
BASKET_ITEM (' category_cd ')
ITEM_SET_MAX (500)
)
GROUP BY 1,2
order by 1 ,3 ,2;
Workload – Technical Aspects
Data Sources
Number of
Queries
Percentage
Structured
18
60%
Semi-structured
7
23%
Un-structured
5
17%
Analytic techniques
Number of
Queries
Percentage
Statistics analysis
6
20%
Data mining
17
57%
Reporting
8
27%
Metrics
• Discussion topic..
• Initial thoughts
•
•
•
•
Focus on loading and type of processing
MR engines good at loading
DBMS good at SQL
Metric = 4 𝑇𝐷 ∗ 𝑇𝑃 ∗ 𝑇𝐡 ∗ 𝑇𝐿
•
•
•
•
𝑇𝐷 ∢ 𝐸π‘₯π‘’π‘π‘’π‘‘π‘–π‘œπ‘› π‘‘π‘–π‘šπ‘’ π‘“π‘œπ‘Ÿ π·π‘’π‘π‘™π‘Žπ‘Ÿπ‘Žπ‘‘π‘–π‘£π‘’ π‘„π‘’π‘’π‘Ÿπ‘–π‘’π‘ 
𝑇𝑃 : 𝐸π‘₯π‘’π‘π‘’π‘‘π‘–π‘œπ‘› π‘‘π‘–π‘šπ‘’ π‘“π‘œπ‘Ÿ π‘ƒπ‘Ÿπ‘œπ‘π‘’π‘‘π‘’π‘Ÿπ‘Žπ‘™ π‘„π‘’π‘’π‘Ÿπ‘–π‘’π‘ 
𝑇𝐡 ∢ 𝐸π‘₯π‘’π‘π‘’π‘‘π‘–π‘œπ‘› π‘‘π‘–π‘šπ‘’ π‘“π‘œπ‘Ÿ π΅π‘œπ‘‘β„Ž
𝑇𝐿 : 𝐸π‘₯π‘’π‘π‘’π‘‘π‘–π‘œπ‘› π‘‘π‘–π‘šπ‘’ π‘“π‘œπ‘Ÿ πΏπ‘œπ‘Žπ‘‘π‘–π‘›π‘”
• Applicable to single and multi-streams
Discussion Topics
• What did we miss?
• Extending/completing BigBench
• How to go beyond MapReduce?
• E.g. add social/graph component (e.g., LinkBench)
• NoSQL, cube model, recommendations
• What kind of queries, schema extensions are necessary?
• What has to be measured?
• Metrics and properties
• Price performance, (energy) efficiency
• What properties does the system have to fulfill?
• Any other concerns?
Discussion – Organization
• Break-out groups (2)
• Discussion leader
• Scribe: Prepare 5 minute presentation
• Student groups (2)
• Find an English speaking member (presenter)
• Prepare 1+ slide on each question
• Schedule:
•
•
•
•
14:15 – 15:15 Discussion 1
15 min break
15:30 – 16:15 Discussion 2
16:15 – 17:00 Plenary
Download