SQL-MR Multiple Inputs

advertisement
BigBench: Big Data Benchmark Proposal
Ahmad Ghazal, Tilmann Rabl, Minqing Hu, Francois Raab, Meikel Poess,
Alain Crolotte, Hans-Arno Jacobsen
BigBench
• End to end benchmark
• Based on a product retailer
• Focus on
- Parallel DBMS
- MR engines
• Initial work presented at 1st WBDB, San Jose
• Collaboration with Industry & Academia
- Teradata, University of Toronto, InfoSizing , Oracle
• Full spec at 3rd WBDB, Xian, China
2
Outline
• Data Model
- Variety, Volume, Velocity
• Data Generator
- PDGF for structured data
- Enhancement : Semi-structured & Text generation
• Workload specification
- Main driver: retail big data analytics
- Covers : data source, declarative & procedural and machine learning
algorithms.
• Metrics
• Evaluation
- Done on Teradata Aster
3
Data Model
• Big Data is not about size only
• 3 V’s
- Variety
• Different types of data
- Volume
• Huge size
- Velocity
• Continuous updates
4
Data Model
(Variety)
5
• Volume
Data Model
(continued)
- Based on scale factor
- Similar to TPC-DS scaling
- Weblogs & product reviews also scaled
• Velocity
- Periodic refreshes for all data
- Different velocity for different areas
• Vstructured
< Vunstructured < Vsemistructured
- Queries run with refresh
6
Data Generator
• "Parallel Data Generation Framework“ PDGF
- For the structured part of model
- Scale factor similar to TPC-DS
• Extensions to PDGF
- Web logs
- Product reviews
• Web logs: retail customers/guests visiting site
- Web logs similar to apache web server logs
- Coupled with structured data
• Product reviews: Customers and guest users
- Algorithm based on Markov chain
- Real data set sample input
- Coupled with structured data
7
Data Generator
TextGen
• Input
-
real review data
Category information
Rating
Review text
• Initial setup
- Parse text
• Produce tokens
• Correlation between tokens
-
Build repository by category
• API for data generation
- Integrated with PDGF
- Based on category ID
8
Workload (continued)
• Workload Queries
- 30 queries
- Specified in English
- No required syntax
• Business functions (Adapted from McKinsey+)
- Marketing
• Cross-selling, Customer micro-segmentation, Sentiment analysis,
Enhancing multichannel consumer experiences
- Merchandising
• Assortment optimization, Pricing optimization
- Operations
• Performance transparency, Product return analysis
- Supply chain
• Inventory management
- Reporting (customers and products)
9
Workload (continued)
Technical Functions
• Data source dimension
- Structured
- Semi-structured
- Un-structured
• Processing type dimension
- Declarative (SQL, HQL)
- Procedural
- Mix of both
• Analytic technique dimension
- Statistical analysis: correlation analysis, time-series, regression
- Data mining: classification, clustering, association mining, pattern
analysis and text analysis
- Simple reporting: ad hoc queries not covered above
10
Workload (continued)
Business Categories Query Breakdown
Business category
11
Number of
Queries
Percentage
Marketing
18
60%
Merchandising
5
17%
Operations
4
13%
Supply chain
2
7%
New business
models
1
3%
Workload (continued)
Technical Dimensions Breakdown
Query processing type
12
Number of
Queries
Percentage
Declarative
10
33%
Procedural
7
23%
Declarative + Procedural
13
43%
Technical Dimensions Breakdown
(continued)
Data Sources
Percentage
Structured
18
60%
Semi-structured
7
23%
Un-structured
5
17%
Number of
Queries
Percentage
Analytic techniques
13
Number of
Queries
Statistics analysis
6
20%
Data mining
17
57%
Reporting
8
27%
Metrics
• Future Work
• Initial thoughts
- Focus on loading and type of processing
- MR engines good at loading
- DBMS good at SQL
- Metric =
4
𝑇𝐷 ∗ 𝑇𝑃 ∗ 𝑇𝐡 ∗ 𝑇𝐿
• 𝑇𝐷 ∢ 𝐸π‘₯π‘’π‘π‘’π‘‘π‘–π‘œπ‘› π‘‘π‘–π‘šπ‘’ π‘“π‘œπ‘Ÿ π·π‘’π‘π‘™π‘Žπ‘Ÿπ‘Žπ‘‘π‘–π‘£π‘’ π‘„π‘’π‘’π‘Ÿπ‘–π‘’π‘ 
• 𝑇𝑃 : 𝐸π‘₯π‘’π‘π‘’π‘‘π‘–π‘œπ‘› π‘‘π‘–π‘šπ‘’ π‘“π‘œπ‘Ÿ π‘ƒπ‘Ÿπ‘œπ‘π‘’π‘‘π‘’π‘Ÿπ‘Žπ‘™ π‘„π‘’π‘’π‘Ÿπ‘–π‘’π‘ 
• 𝑇𝐡 ∢ 𝐸π‘₯π‘’π‘π‘’π‘‘π‘–π‘œπ‘› π‘‘π‘–π‘šπ‘’ π‘“π‘œπ‘Ÿ π΅π‘œπ‘‘β„Ž
• 𝑇𝐿 : 𝐸π‘₯π‘’π‘π‘’π‘‘π‘–π‘œπ‘› π‘‘π‘–π‘šπ‘’ π‘“π‘œπ‘Ÿ πΏπ‘œπ‘Žπ‘‘π‘–π‘›π‘”
• Applicable to single and multi-streams
14
Evaluation
• BigBench proof of concept
• DBMS
-
Typically data loaded into tables
Possibly parsing weblogs to get schema
Reviews captured as VARCHAR or BLOB fields
Queries run using SQL + UDF
• MR engine
- Data can be loaded on HDFS
- MR, HQL, PigLatin can be used
• DBMS & MR engine
- DBMS with Hadoop connectors
- Data can be placed and split among both
- Processing can also be split among two
15
Evaluation (continued)
• Teradata Aster
- MPP database
- Discovery platform
- Supports SQL-MR
• System
- Aster 5.0
- 8 node cluster
- 200 GB of data
16
Evaluation (continued)
• Data generation
- DSDGen produced structured part
- PDGF+ produced semi-structured and un-structured
• Data loaded into tables
- Parsed weblogs into Weblogs table
- Product reviews table
• Queries
- SQL-MR syntax
17
Evaluation (continued)
• Example query
• Product category affinity analysis
- Computes the probability of browsing products from a category
after customers viewed items from another category.
- One form of market basket
• Business case
- Marketing
- cross-selling
• Type of source
- Structured (web sales)
• Processing type
- mix of declarative and procedural
• Analytics type
- Data mining
18
Evaluation (continued)
SELECT
category_cd1 AS category1_cd ,
category_cd2 AS category2_cd , COUNT (*) AS cnt
FROM
basket_generator (
ON
( SELECT i. i_category_id AS category_cd ,
s. ws_bill_customer_sk AS customer_id
FROM web_sales s INNER JOIN item i
ON s. ws_item_sk = i_item_sk
)
PARTITION BY customer_id
BASKET_ITEM (' category_cd ')
ITEM_SET_MAX (500)
)
GROUP BY 1,2
order by 1 ,3 ,2;
19
Evaluation
(Sample Queries)
20
Evaluation
(Processing Type)
21
Conclusion
• End to end big data benchmark
• Data Model
- Adding semi-structured and unstructured data
- Different velocity for structured, semi-structured and unstructured
- Volume based on scale factor
• Data Generation
- Unstructured data using Markov chain model.
- Enhancing PDGF for semi-structured and integrating it with
unstructured
• Workload
- 30 queries based on McKinsey report
- Covers processing type, data source and data mining algorithms
• Evaluation
- Aster SQL-MR
22
Future Work
• Add late binding to web logs
- No schema/keys upfront
• Finalizing
- Data generator
- Metric specifications.
- Downloadable kit
• More POC’s
- Hadoop ecosystem like HIVE
23
Download