Benchmarking “No One Size Fits All” Big Data Analytics BigFrame Team The Hong Kong Polytechnic University Duke University HP Labs Analytics System Landscape • • • • • • MPP DB o Greenplum, SQL server PDW, Teradata, etc. Columnar o Vertica, Redshift, Vectorwise, etc. MapReduce o Hadoop, Hive, HadoopDB, Tenzing, etc Streaming o Storm, Streambase, etc Graph o Pregel, GraphLab, etc Multi-tenancy o Mesos, Yarn, etc Analytics System Landscape • • • • • • MPP DB o Greenplum, SQL server PDW, Teradata, etc. Columnar o Vertica, Redshift, Vectorwise, etc. MapReduce o Hadoop, Hive, HadoopDB, Tenzing, etc Streaming o Storm, Streambase, etc Graph o Pregel, GraphLab, etc Multi-tenancy o Mesos, Yarn, etc What does this mean for Big Data Practitioners? Gives them a lot of power! Even the mighty may need a little help Challenges for Practitioners Which system to use for the app that I am developing? App Developers, Data Scientists • • • • • Features (e.g. graph data) Performance (e.g., claims like System A is 50x faster than B) Resource efficiency Growth and scalability Multi-tenancy Challenges for Practitioners Which system to use for the app that I am developing? Different parts of my app have different requirements App Developers, Data Scientists Compose "best of breed" systems Or Use "one size fits all" System? Challenges for Practitioners Which system to use for the app that I am developing? Different parts of my app have different requirements App Developers, Data Scientists Managing many systems is hard! System Admins Total Cost of Ownership (TCO)? CIO Need Benchmarks One Approach Categorize systems Develop a benchmark per system category Useful, But ... • • • • • MPP DB, Columnar o TPC-H/TPC-DS, Berkeley Big Data Benchmark etc. MapReduce o Terasort, DFSIO, GridMix, HiBench etc. Streaming o Linear Road, etc. Graph o ... Graph 500, PageRank, etc. Problem: May miss the Big Picture Problem: May miss the Big Picture • Cannot capture the complexities and endto-end behavior of big data applications and deployments: o Bottlenecks o o o o Data conversion, transfer, & loading overheads Storage costs & other parts of the data life-cycle Resource management challenges Total Cost of Ownership (TCO) A Better Approach: BigBench or Deep Analytics Pipeline: Applications driven Involved multiple types of data: • • o o • o Structured Semi-structured Unstructured Involved multiple types of operator: o o o Relation Operators: join, group by Text Analytics: Sentiment analysis Machine Learning Problem: Benchmark Give a man fish and you will feed him for a day. X Give him fishing gear and you will feed him for life. --Anonymous X Benchmark Generator BigFrame A Benchmark Generator for Big Data Analytics How a user uses BigFrame BigFrame Interface bigif (benchmark input format) Benchmark Generator bigspec (benchmark specification) result System Under Test MapReduce Hive HBase Benchmark Driver for System Under Test run the benchmark bigspec: Benchmark Specification MapReduce Hive HBase What should be captured by the benchmark input format • The 3Vs Volume Velocity Variety bigif: BigFrame's InputFormat Benchmark Generation bigif (benchmark input format) bigif describes points in a discrete space of {Data, Query} X {Variety, Volume, Velocity} Benchmark Generator bigspec (benchmark specification) 1. 2. 3. 4. Initial data to load Data refresh pattern Query streams Evaluation metrics Benchmark generation can be addressed as a search problem within a rich application domain Application Domain Modeled Currently E-commerce sales, promotions, recommendations Social media sentiment & influence Benchmark generation can be addressed as a search problem within a rich application domain Application Domain Modeled Currently Application Domain Modeled Currently Web_sales Promotion Item Application Domain Modeled Currently Use Case 1: Exploratory BI • • • Large volumes of relational data Data Variety = {Relational} Mostly aggregation and few join Query Variety = {Micro} Can Spark's performance match that of a MPP DB BigFrame will generate a benchmark specification containing relational data and (SQL-ish) queries Use Case 2: Complex BI • • • Large volumes of relational data Data Variety = {Relational, text} Even larger volumes of text data Query Variety = {Macro} (application-focused instead of micro-benchmark) Combined analytics BigFrame will generate a benchmark specification that includes sentiment analysis tasks over tweets Use Case 3: Dashboards • Large volume and velocity of relational and text data Data Velocity= Fast • Continuously-updated Dashboards Query Variety = continuous (as opposed to Exploratory) BigFrame will generate a benchmark specification that includes data refresh as well as continuous queries whose results change upon data refresh Working with the community • • • • First release of BigFrame planned for August 2013 o open source with extensibility APIs Benchmark Driver for more systems Utilities (accessed through the benchmark Driver to drill down into system behavior during benchmarking) Instantiate the BigFrame pipeline for more app domains Take Away • • Benchmarks shape a field (for better or worse); they are how we determine the value of change. --(David Patterson, University of California Berkeley, 1994). Benchmarks meet different needs for different people • • End customers, application developers, system designers, system administrators, researchers, CIOs BigFrame helps users generate benchmarks that best meet their needs