Using Spark and Shark for Fast Cycle Analysis on Diverse Data Vaibhav Nivargi 12.2.13 clearstorydata.com About ClearStory Data clearstorydata.com Analysis in the New Data Landscape New use cases seen in all industries. • Live situational analysis requiring fast-cycle analysis across internal data and sources of external data • Multi-source analysis with data refreshing on new insights, as data from sources evolves • Large-scale analysis of structured and unstructured data combined in integrated insights clearstorydata.com Example: Interactive Multi-source Analysis More data and more people change the analysis. News Coverage Online, Print, Television Data Intelligence Donations New Members, Donations Website Traffic Traffic, Referrals, Content Facebook Shares, Likes, Comments Twitter Followers, Tweets, Retweets Interactive analysis on diverse internal & external data Corporate Sponsors Corporate Engagement, New Inquiries clearstorydata.com Today’s Need is Speed, Scale & Ad Hoc Flexibility With more sources, more data and more people. ? ? clearstorydata.com ? ? Why Spark and Shark ? • RDDs – Low latency & scale – Iterative and Interactive computation • Lineage and fault tolerance – Able to re-derive data • Expressive power of Scala and SQL – Operations beyond aggregations, joins, and statistical operators – Advanced: ML, data mining, segmentation, approximate queries, graphs … • Support for structured and semi-structured data • BDAS Stack & AMPLab – Tachyon, MLBase, BlinkDB, GraphX … • Community and adoption clearstorydata.com The ClearStory Solution Data Sources ClearStory Platform ClearStory Application Harmonization Data Inference & Profiling In-Memory Data Units Visualization Collaboration clearstorydata.com Where do Spark & Shark fit ? User Application ClearStory API Harmonization Engine and Blended Data Processing Spark Cluster + ClearStory IP Data Access, Inference and Lineage Data Source API RDBMS clearstorydata.com Hadoop Files Public Web Premium How we leverage Spark & Shark • User intent captured and translated to custom API • Harmonization-as-a-Service • Manages Spark and Shark query execution • Read cached data from HDFS • RESTful • Merges datasets (RDDs) on the fly – on user request • Support conversion of user actions to backend queries • Query optimizations • Performance optimizations • Mixed-mode execution (sql2rdd & spark native) • Caching • Pre-computation clearstorydata.com How we leverage Spark & Shark • Query results returned to the application for scalable visualization and ClearStory-specific viz techniques • RDDs cached/un-cached and materialized at strategic points based on usage patterns and signals • Data updates automatically processed as source data changes • ClearStory’s own deployment, packaging, and integrated monitoring for operations at scale clearstorydata.com Spark Developments – What We Like • Query cancellation, progress indication (0.8.1 and beyond) • More performance breakthroughs • Workload Management • BlinkDB • MLBase • Tachyon • GraphX clearstorydata.com We’re Hiring! • Working with the community, giving back • Lots of exciting new developments • This is like the early days of Hadoop – massive momentum gathering The First Spark Summit! More Meet-ups! clearstorydata.com clearstorydata.com