Slides PDF - Spark Summit

advertisement
Using Spark and Shark for
Fast Cycle Analysis on Diverse Data
Vaibhav Nivargi
12.2.13
clearstorydata.com
About ClearStory Data
clearstorydata.com
Analysis in the New Data Landscape
New use cases seen in all industries.
• Live situational analysis requiring fast-cycle
analysis across internal data and sources of
external data
• Multi-source analysis with data refreshing on
new insights, as data from sources evolves
• Large-scale analysis of structured and
unstructured data combined in integrated
insights
clearstorydata.com
Example: Interactive Multi-source Analysis
More data and more people change the analysis.
News
Coverage
Online, Print,
Television
Data Intelligence
Donations
New Members,
Donations
Website Traffic
Traffic,
Referrals,
Content
Facebook
Shares, Likes,
Comments
Twitter
Followers,
Tweets, Retweets
Interactive analysis on diverse
internal & external data
Corporate Sponsors
Corporate
Engagement, New
Inquiries
clearstorydata.com
Today’s Need is Speed, Scale & Ad Hoc Flexibility
With more sources, more data and more people.
?
?
clearstorydata.com
?
?
Why Spark and Shark ?
• RDDs
– Low latency & scale
– Iterative and Interactive computation
• Lineage and fault tolerance
– Able to re-derive data
• Expressive power of Scala and SQL
– Operations beyond aggregations, joins, and statistical operators
– Advanced: ML, data mining, segmentation, approximate queries,
graphs …
• Support for structured and semi-structured data
• BDAS Stack & AMPLab
– Tachyon, MLBase, BlinkDB, GraphX …
• Community and adoption
clearstorydata.com
The ClearStory Solution
Data Sources
ClearStory Platform
ClearStory Application
Harmonization
Data Inference & Profiling
In-Memory
Data Units
Visualization
Collaboration
clearstorydata.com
Where do Spark & Shark fit ?
User Application
ClearStory API
Harmonization Engine and Blended Data Processing
Spark Cluster + ClearStory IP
Data Access, Inference and Lineage
Data Source API
RDBMS
clearstorydata.com
Hadoop
Files
Public
Web
Premium
How we leverage Spark & Shark
• User intent captured and translated to custom API
• Harmonization-as-a-Service
• Manages Spark and Shark query execution
• Read cached data from HDFS
• RESTful
• Merges datasets (RDDs) on the fly – on user request
• Support conversion of user actions to backend queries
• Query optimizations
• Performance optimizations
• Mixed-mode execution (sql2rdd & spark native)
• Caching
• Pre-computation
clearstorydata.com
How we leverage Spark & Shark
• Query results returned to the application for
scalable visualization and ClearStory-specific viz
techniques
• RDDs cached/un-cached and materialized at
strategic points based on usage patterns and
signals
• Data updates automatically processed as source
data changes
• ClearStory’s own deployment, packaging, and
integrated monitoring for operations at scale
clearstorydata.com
Spark Developments – What We Like
• Query cancellation, progress indication (0.8.1 and
beyond)
• More performance breakthroughs
• Workload Management
• BlinkDB
• MLBase
• Tachyon
• GraphX
clearstorydata.com
We’re Hiring!
• Working with the community, giving back
• Lots of exciting new developments
• This is like the early days of Hadoop – massive
momentum gathering
The First Spark Summit!
More Meet-ups!
clearstorydata.com
clearstorydata.com
Download