Visualizing Big Data - Villanova University

advertisement
Visualizing Big Data
David Schmittdiel
CSC 9010-003
9/16/2014
Outline
•
•
•
•
Me
Big Data review and background
Problem statement
Case study: StubHub
Intro
I don’t have a Computer Science background (but I really, really regret it)
MATLAB  PHP  MySQL  Oracle
Manager of Business Intelligence Development at StubHub
• Bringing actionable data to the masses
• Self-service, on-demand, exploratory BI
• Data discovery through visualization
• Automation
Big Data, Big Ruse?
Stephen Few: “What the hell is Big Data anyway?”
BI vendor-driven responses:
• Increased data volume AND velocity
• New data sources (unstructured)
Fundamental question: Do you really need Big Data?
“Until you’ve figured out how to use the data that you already have, collecting
more will only distract you from the real task. Time spent collecting more data is
time that could be better spent weaving it into something meaningful.”
Stephen Few, Perceptual Edge - July/August/September 2012, “Big Data, Big Ruse”
http://www.perceptualedge.com/articles/visual_business_intelligence/big_data_big_ruse.pdf
The Real Task
Transforming raw data into meaningful, useful, actionable information
 Leveraging the past to guide future endeavors
 Finding the signals amidst the noise
Driving forces:
• Scientific research
• Business (ecommerce)
• Government
Stephen Few: “The success of BI … [is] measured in our increased ability to
understand data and then make better decisions based on that understanding.”
Visualizing Small Data
MS Excel
• Ease of use for tasks involving smaller data sets, limited interactivity
• Stephen Few: “building applications on top of Excel can be arduous and painful”
Stephen Few, Perceptual Edge – September/October 2009, “Fundamental Differences in Analytical Tools”
http://www.perceptualedge.com/articles/visual_business_intelligence/differences_in_analytical_tools.pdf
Visualizing Small Data
Static dashboards: “custom analytics”
• Time-consuming to build but relatively easy to maintain
• “Remove … functionality that isn’t relevant to the analytical objective of its users”
Unique Challenges
Juliana Freire: “Visualization: Big Data Considerations”
• Interactivity is key, but challenging for Big Data
• Need better integration between data management and visualization
components
Phil Simon describing Netflix’s data mindset:
• Data should be accessible, easy to discover, and easy to process for everyone
• The longer you take to find the data, the less valuable it becomes
• Whether a dataset is large or small, being able to visualize it makes it easier to
explain
Juliana Freire, DIMACS 2013, “Big Data Analysis and Integration”
http://dimacs.rutgers.edu/Workshops/BigData/Slides/2013-dimacs.pdf
Phil Simon, HBR Webinar, “The Visual Organization: Data Visualization, Big Data, and the Quest for Better Decisions”
http://www.scribd.com/doc/232032215/HBR-Webinar-Summary-The-Visual-Organization
Case Study: StubHub
Using SAP Business Objects (BO) since at
least 2008 on top of Oracle 11g DW
Included in the “Leaders” quadrant
of 2014 Gartner report
• BO “delivers a broad range of BI and
analytic capabilities through a semantic
layer best suited for large IT-managed
deployments that require robust
governance and administrative
capabilities”
• Customers use “primarily for reporting;
the number that use it for interactive
discovery or visualization was well
below the average”
Gartner, Magic Quadrant for Business Intelligence and Analytics
Platforms
www.gartner.com/technology/reprints.do?id=11QLGACN&ct=140210&st=sb
Case Study: StubHub
Feedback from business users was universally poor
• Hard to use
• Limited number of (inadequate) visualizations available
• Not interactive
• Supported by Tech org only
Reporting Team within Analytics org formed in January, 2013
• Innovative
• Responsive
• Promote self-service
• Objective vs subjective use of data
Case Study: StubHub
General concept: aggregate any metrics by any breakdown, over any time period,
filtered for anything
 Supports “exploratory analytics”: pursue each question as it arises
Settle instead for a collection of dashboards categorized by business use case
Case Study: StubHub
First iteration: Dynamic SQL
• Complicated rules for commenting based on front-end selections
select
-- DATE: sp.src_created_dttm_sale
g.genre_cat_final as "GCF", -- DISPLAY CATEGORY: GCF
g.genre_descr as "Genre", -- DISPLAY CATEGORY: Genre
sum(sp.ticket_cost) as "GTS", -- DATA METRIC: GTS
count(distinct transaction_id) as "# Orders", -- DATA METRIC: # Orders
from owbruntarget_dw.dw_sales_pipeline_fact sp
join owbruntarget_dw.dw_genre_dim g on sp.genre_dw_id = g.genre_dw_id -- DISPLAY CATEGORY
or FILTER: GCF, Genre
where 1=1
-- FILTER: g.genre_cat_final for GCF
-- FILTER: g.genre_descr for Genre
AND trunc(src_created_dttm_sale) between :startdate and :enddate
group by
g.genre_cat_final, -- DISPLAY CATEGORY: GCF
g.genre_descr, -- DISPLAY CATEGORY: Genre
-- DATEG: sp.src_created_dttm_sale ''
Proved unworkable because of long query execution times, even after incorporating bind
variables
Case Study: StubHub
Next iteration: “pandas” dataframes
• Open source Python library for data manipulation and analysis
• Fast and efficient DataFrame object for data manipulation with integrated
indexing
• Tools for reading and writing data between in-memory data structures and
different formats (e.g. CSV)
For each dashboard, one static query
• Tuning + Oracle query optimizer
• Retrieve comprehensive data set needed to power the dashboard
• Store data in CSV files on network
• “Jukebox” functionality: only files needed are loaded into memory for
processing
Pandas: http://pandas.pydata.org/pandas-docs/stable/index.html
Case Study: StubHub
Results:
• Huge decrease in dashboard run times
• Corresponding increase in adoption rate
Case Study: StubHub
Where does the interactivity necessary for data discovery come from?
Template-based front end built with PHP + HTML + CSS + jQuery
• Provide different levels of granularity
• Decreases amount of time needed to create a new dashboard (vs. Tableau)
Menus control requests for:
• Categories  group by
• Metrics  aggregate functions
• Filters  where clause
• Date range
• Chart types, date aggregation
Case Study: StubHub
How to provide integration between back-end data management and front-end
visualization components?
Solution is Data-Driven Documents (D3.js)
• JavaScript library to drive the creation and control of dynamic and interactive
graphical forms which run in web browsers
• W3C-compliant, making use of the widely implemented Scalable Vector
Graphics (SVG), JavaScript, HTML5, and Cascading Style Sheets (CSS3) standards
• Large data sets can be easily bound to SVG objects using JSON and simple D3
functions to generate charts and diagrams
D3: http://d3js.org/
Case Study: StubHub
Summary of approach
Create a collection of BI dashboards that are:
• Fast
• Customizable
• Interactive
• Highly visual
• On-demand
• Scalable
• Consistent
Custom build EVERYTHING as needed
Leverage open source technologies whenever possible
Data source agnostic to accommodate new data stores as they become available
 Output from MapReduce jobs in CSV format
Download