Visualizing Big Data David Schmittdiel CSC 9010-003 9/16/2014 Outline • • • • Me Big Data review and background Problem statement Case study: StubHub Intro I don’t have a Computer Science background (but I really, really regret it) MATLAB PHP MySQL Oracle Manager of Business Intelligence Development at StubHub • Bringing actionable data to the masses • Self-service, on-demand, exploratory BI • Data discovery through visualization • Automation Big Data, Big Ruse? Stephen Few: “What the hell is Big Data anyway?” BI vendor-driven responses: • Increased data volume AND velocity • New data sources (unstructured) Fundamental question: Do you really need Big Data? “Until you’ve figured out how to use the data that you already have, collecting more will only distract you from the real task. Time spent collecting more data is time that could be better spent weaving it into something meaningful.” Stephen Few, Perceptual Edge - July/August/September 2012, “Big Data, Big Ruse” http://www.perceptualedge.com/articles/visual_business_intelligence/big_data_big_ruse.pdf The Real Task Transforming raw data into meaningful, useful, actionable information Leveraging the past to guide future endeavors Finding the signals amidst the noise Driving forces: • Scientific research • Business (ecommerce) • Government Stephen Few: “The success of BI … [is] measured in our increased ability to understand data and then make better decisions based on that understanding.” Visualizing Small Data MS Excel • Ease of use for tasks involving smaller data sets, limited interactivity • Stephen Few: “building applications on top of Excel can be arduous and painful” Stephen Few, Perceptual Edge – September/October 2009, “Fundamental Differences in Analytical Tools” http://www.perceptualedge.com/articles/visual_business_intelligence/differences_in_analytical_tools.pdf Visualizing Small Data Static dashboards: “custom analytics” • Time-consuming to build but relatively easy to maintain • “Remove … functionality that isn’t relevant to the analytical objective of its users” Unique Challenges Juliana Freire: “Visualization: Big Data Considerations” • Interactivity is key, but challenging for Big Data • Need better integration between data management and visualization components Phil Simon describing Netflix’s data mindset: • Data should be accessible, easy to discover, and easy to process for everyone • The longer you take to find the data, the less valuable it becomes • Whether a dataset is large or small, being able to visualize it makes it easier to explain Juliana Freire, DIMACS 2013, “Big Data Analysis and Integration” http://dimacs.rutgers.edu/Workshops/BigData/Slides/2013-dimacs.pdf Phil Simon, HBR Webinar, “The Visual Organization: Data Visualization, Big Data, and the Quest for Better Decisions” http://www.scribd.com/doc/232032215/HBR-Webinar-Summary-The-Visual-Organization Case Study: StubHub Using SAP Business Objects (BO) since at least 2008 on top of Oracle 11g DW Included in the “Leaders” quadrant of 2014 Gartner report • BO “delivers a broad range of BI and analytic capabilities through a semantic layer best suited for large IT-managed deployments that require robust governance and administrative capabilities” • Customers use “primarily for reporting; the number that use it for interactive discovery or visualization was well below the average” Gartner, Magic Quadrant for Business Intelligence and Analytics Platforms www.gartner.com/technology/reprints.do?id=11QLGACN&ct=140210&st=sb Case Study: StubHub Feedback from business users was universally poor • Hard to use • Limited number of (inadequate) visualizations available • Not interactive • Supported by Tech org only Reporting Team within Analytics org formed in January, 2013 • Innovative • Responsive • Promote self-service • Objective vs subjective use of data Case Study: StubHub General concept: aggregate any metrics by any breakdown, over any time period, filtered for anything Supports “exploratory analytics”: pursue each question as it arises Settle instead for a collection of dashboards categorized by business use case Case Study: StubHub First iteration: Dynamic SQL • Complicated rules for commenting based on front-end selections select -- DATE: sp.src_created_dttm_sale g.genre_cat_final as "GCF", -- DISPLAY CATEGORY: GCF g.genre_descr as "Genre", -- DISPLAY CATEGORY: Genre sum(sp.ticket_cost) as "GTS", -- DATA METRIC: GTS count(distinct transaction_id) as "# Orders", -- DATA METRIC: # Orders from owbruntarget_dw.dw_sales_pipeline_fact sp join owbruntarget_dw.dw_genre_dim g on sp.genre_dw_id = g.genre_dw_id -- DISPLAY CATEGORY or FILTER: GCF, Genre where 1=1 -- FILTER: g.genre_cat_final for GCF -- FILTER: g.genre_descr for Genre AND trunc(src_created_dttm_sale) between :startdate and :enddate group by g.genre_cat_final, -- DISPLAY CATEGORY: GCF g.genre_descr, -- DISPLAY CATEGORY: Genre -- DATEG: sp.src_created_dttm_sale '' Proved unworkable because of long query execution times, even after incorporating bind variables Case Study: StubHub Next iteration: “pandas” dataframes • Open source Python library for data manipulation and analysis • Fast and efficient DataFrame object for data manipulation with integrated indexing • Tools for reading and writing data between in-memory data structures and different formats (e.g. CSV) For each dashboard, one static query • Tuning + Oracle query optimizer • Retrieve comprehensive data set needed to power the dashboard • Store data in CSV files on network • “Jukebox” functionality: only files needed are loaded into memory for processing Pandas: http://pandas.pydata.org/pandas-docs/stable/index.html Case Study: StubHub Results: • Huge decrease in dashboard run times • Corresponding increase in adoption rate Case Study: StubHub Where does the interactivity necessary for data discovery come from? Template-based front end built with PHP + HTML + CSS + jQuery • Provide different levels of granularity • Decreases amount of time needed to create a new dashboard (vs. Tableau) Menus control requests for: • Categories group by • Metrics aggregate functions • Filters where clause • Date range • Chart types, date aggregation Case Study: StubHub How to provide integration between back-end data management and front-end visualization components? Solution is Data-Driven Documents (D3.js) • JavaScript library to drive the creation and control of dynamic and interactive graphical forms which run in web browsers • W3C-compliant, making use of the widely implemented Scalable Vector Graphics (SVG), JavaScript, HTML5, and Cascading Style Sheets (CSS3) standards • Large data sets can be easily bound to SVG objects using JSON and simple D3 functions to generate charts and diagrams D3: http://d3js.org/ Case Study: StubHub Summary of approach Create a collection of BI dashboards that are: • Fast • Customizable • Interactive • Highly visual • On-demand • Scalable • Consistent Custom build EVERYTHING as needed Leverage open source technologies whenever possible Data source agnostic to accommodate new data stores as they become available Output from MapReduce jobs in CSV format