Benchmarking *No One Size Fits All* Big Data Analytics

advertisement
Benchmarking “No One
Size Fits All”
Big Data Analytics
BigFrame Team
The Hong Kong Polytechnic University
Duke University
HP Labs
Analytics System Landscape
•
•
•
•
•
•
MPP DB
o
Greenplum, SQL server PDW, Teradata, etc.
Columnar
o
Vertica, Redshift, Vectorwise, etc.
MapReduce
o
Hadoop, Hive, HadoopDB, Tenzing, etc
Streaming
o
Storm, Streambase, etc
Graph
o
Pregel, GraphLab, etc
Multi-tenancy
o
Mesos, Yarn, etc
Analytics System Landscape
•
•
•
•
•
•
MPP DB
o Greenplum, SQL server PDW, Teradata, etc.
Columnar
o Vertica, Redshift, Vectorwise, etc.
MapReduce
o Hadoop, Hive, HadoopDB, Tenzing, etc
Streaming
o Storm, Streambase, etc
Graph
o Pregel, GraphLab, etc
Multi-tenancy
o Mesos, Yarn, etc
What does this
mean for Big
Data
Practitioners?
Gives them a lot of power!
Even the mighty may need a little
help
Challenges for Practitioners
Which system to
use for the app
that I am
developing?
App Developers,
Data Scientists
•
•
•
•
•
Features (e.g. graph
data)
Performance (e.g.,
claims like System A is
50x faster than B)
Resource efficiency
Growth and scalability
Multi-tenancy
Challenges for Practitioners
Which system to
use for the app
that I am
developing?
Different parts of
my app have
different
requirements
App Developers,
Data Scientists
Compose "best of breed" systems Or
Use "one size fits all" System?
Challenges for Practitioners
Which system to
use for the app
that I am
developing?
Different parts of
my app have
different
requirements
App Developers,
Data Scientists
Managing many
systems is hard!
System Admins
Total Cost of
Ownership (TCO)?
CIO
Need
Benchmarks
One Approach
Categorize systems
Develop a benchmark
per system category
Useful, But ...
•
•
•
•
•
MPP DB, Columnar
o
TPC-H/TPC-DS, Berkeley Big Data Benchmark etc.
MapReduce
o
Terasort, DFSIO, GridMix, HiBench etc.
Streaming
o
Linear Road, etc.
Graph
o
...
Graph 500, PageRank, etc.
Problem: May miss the Big Picture
Problem: May miss the Big Picture
• Cannot capture the complexities and endto-end behavior of big data applications
and deployments:
o Bottlenecks
o
o
o
o
Data conversion, transfer, & loading overheads
Storage costs & other parts of the data life-cycle
Resource management challenges
Total Cost of Ownership (TCO)
A Better Approach:
BigBench or Deep Analytics Pipeline:
Applications driven
Involved multiple types of data:
•
•
o
o
•
o
Structured
Semi-structured
Unstructured
Involved multiple types of operator:
o
o
o
Relation Operators: join, group by
Text Analytics: Sentiment analysis
Machine Learning
Problem:
Benchmark
Give a man fish and you
will feed him for a day.
X
Give him fishing gear and
you will feed him for life.
--Anonymous
X
Benchmark Generator
BigFrame
A Benchmark Generator for
Big Data Analytics
How a user uses BigFrame
BigFrame
Interface
bigif
(benchmark
input format)
Benchmark
Generator
bigspec
(benchmark
specification)
result
System Under Test
MapReduce
Hive
HBase
Benchmark Driver
for System Under
Test
run the
benchmark
bigspec:
Benchmark Specification
MapReduce
Hive
HBase
What should be captured by the
benchmark input format
• The 3Vs
Volume
Velocity
Variety
bigif: BigFrame's InputFormat
Benchmark Generation
bigif
(benchmark
input format)
bigif describes points
in a discrete space of
{Data, Query} X
{Variety, Volume, Velocity}
Benchmark
Generator
bigspec
(benchmark
specification)
1.
2.
3.
4.
Initial data to load
Data refresh pattern
Query streams
Evaluation metrics
Benchmark generation can be
addressed as a search problem
within a rich application domain
Application Domain Modeled
Currently
E-commerce sales,
promotions,
recommendations
Social media
sentiment &
influence
Benchmark generation can be
addressed as a search problem
within a rich application domain
Application Domain Modeled
Currently
Application Domain Modeled
Currently
Web_sales
Promotion
Item
Application Domain
Modeled Currently
Use Case 1: Exploratory BI
•
•
•
Large volumes of relational data
Data Variety = {Relational}
Mostly aggregation and few join
Query Variety = {Micro}
Can Spark's performance match that of a
MPP DB
BigFrame will generate a benchmark
specification containing
relational data and (SQL-ish) queries
Use Case 2: Complex BI
•
•
•
Large volumes of relational data
Data Variety = {Relational, text}
Even larger volumes of text data
Query Variety = {Macro}
(application-focused instead of micro-benchmark)
Combined analytics
BigFrame will generate a benchmark
specification that includes
sentiment analysis tasks over tweets
Use Case 3: Dashboards
•
Large volume and velocity of relational and
text data
Data Velocity= Fast
•
Continuously-updated Dashboards
Query Variety = continuous
(as opposed to Exploratory)
BigFrame will generate a benchmark
specification that includes data refresh
as well as continuous queries whose
results change upon data refresh
Working with the community
•
•
•
•
First release of BigFrame planned for August
2013
o
open source with extensibility APIs
Benchmark Driver for more systems
Utilities (accessed through the benchmark
Driver to drill down into system behavior
during benchmarking)
Instantiate the BigFrame pipeline for more
app domains
Take Away
•
•
Benchmarks shape a field (for better or
worse); they are how we determine the value
of change.
--(David Patterson, University of
California Berkeley, 1994).
Benchmarks meet different needs for
different people
•
•
End customers, application developers, system
designers, system administrators, researchers, CIOs
BigFrame helps users generate benchmarks
that best meet their needs
Download