Hadoop Update Big Data Analytics May 23nd 2012 Matt Mead, Cloudera What is Hadoop? Apache Hadoop is an open source platform for data storage and processing that is… Scalable Fault tolerant Distributed CORE HADOOP SYSTEM COMPONENTS Hadoop Distributed File System (HDFS) MapReduce Self-Healing, High Bandwidth Clustered Storage Provides storage and computation in a single, scalable system. Distributed Computing Framework Why Use Hadoop? Move beyond rigid legacy frameworks. Hadoop handles any data type, in any quantity. Hadoop grows with your business. Structured, unstructured Schema, no schema Hadoop helps you derive the complete value of all your data. No vendor lock-in Proven at petabyte scale High volume, low volume Capacity and performance grow simultaneously All kinds of analytic applications Leverages commodity hardware to mitigate costs 1 Hadoop is 100% Apache® licensed and open source. 2 Community development Rich ecosystem of related projects 3 Drives revenue by extracting value from data that was previously out of reach Controls costs by storing data more affordably than any other platform The Need for CDH 1. The Apache Hadoop ecosystem is complex – Many different components – lots of moving parts – Most companies require more than just HDFS and MapReduce – Creating a Hadoop stack is time-consuming and requires specific expertise • Component and version selection • Integration (internal & external) • System test w/end-to-end workflows 2. Enterprises consume software in a certain way – – – – System, not silo Tested and stable Documented and supported Predictable release schedule Core Values of CDH A with everything you need for production use. Components of the CDH Stack File System Mount UI Framework FUSE-DFS Workflow APACHE OOZIE Storage SDK HUE Scheduling APACHE OOZIE HUE SDK Computation Metadata APACHE HIVE Integration Languages / Compilers Data Integration APACHE PIG, APACHE HIVE, APACHE MAHOUT APACHE FLUME, APACHE SQOOP Fast Read/Write Access Coordination APACHE HBASE HDFS, MAPREDUCE Coordination APACHE ZOOKEEPER Access The Need for CDH A set of open source components, HDFS – Distributed, scalable, fault tolerant file system MapReduce – Parallel processing framework for large data sets Apache Hive – SQL-like language and metadata repository Apache Pig – High level language for expressing data analysis programs Apache HBase – Hadoop database for random, realtime read/write access Apache Oozie – Server-based workflow engine for Hadoop activities Apache Zookeeper – Highly reliable distributed coordination service Apache Sqoop – Integrating Hadoop with RDBMS Apache Flume – Distributed service for collecting and aggregating log and event data Fuse-DFS – Module within Hadoop for mounting HDFS as a traditional file system Apache Mahout – Library of machine learning algorithms for Apache Hadoop Hue – Browser-based desktop interface for interacting with Hadoop Apache Whirr – Library for running Hadoop in the cloud Core Hadoop Use Cases INDUSTRY TERM VERTICAL INDUSTRY TERM Social Network Analysis Web Clickstream Sessionization Content Optimization Media Engagement Network Analytics Telco Mediation Loyalty & Promotions Analysis Retail Data Factory Fraud Analysis Financial Trade Reconciliation Entity Analysis Federal SIGINT Sequencing Analysis Bioinformatics Genome Mapping 2 DATA PROCESSING ADVANCED ANALYTICS 1 Two Core Use Cases Applied Across Verticals FMV & Image Processing Data Processing – Full Motion Video & Image Processing • Record by record -> Easy Parallelization – “Unit of work” is important – Raw data in HDFS • Adaptation of existing image analyzers to Map Only / Map Reduce • Scales horizontally • Simple detections – Vehicles – Structures – Faces Cybersecurity Analysis Advanced Analytics – Cybersecurity Analysis • Rates and flows – ingest can be in excess of the multiple gigabyte per second range • Can be complex because of mixed-workload clusters • Typically involves ad-hoc analysis – Question oriented analytics • “Productionized” use cases allow insight by non-analysts • Existing open source solution SHERPASURFING – Focuses on the cybersecurity analysis underpinnings for common data-sets (pcap, netflow, audit logs, etc.) – Provides a means to ask questions without reinventing all the plumbing Index Preparation Data Processing – Index Preparation • Hadoop’s Seminal Use Case • Dynamic Partitioning -> Easy Parallelization • String Interning • Inverse Index Construction • Dimensional data capture • Destination indices • – Lucene/Solr (and derivatives) – Endeca Existing solution USA Search (http://usasearch.howto.gov/) Data Landing Zone Data Processing – Schema-less Enterprise Data Warehouse / Landing Zone • Begins as storage, light ingest processing, retrieval • Capacity scales horizontally • Schema-less -> holds arbitrary content • Schema-less -> allows ad-hoc fusion and analysis • Additional analytic workload forces decisions Hadoop: Getting Started • Reactive – • • Forced by scale or cost of scaling Proactive – Seek talent ahead of need to build – Identify data-sets – Determine high-value use cases that change organizational outcomes – Start with 10-20 nodes and 10+TB unless data-sets are super-dimensional Either way – Talent a major challenge – Start with “Data Processing” use cases – Physical infrastructure is complex, make the software infrastructure simple to manage Customer Success Self-Source Deployment vs. Cloudera Enterprise – 500 node deployment Option 2: Self-Source Estimated Cost: $4.8 million Deployment Time: ~ 6 Months $5M Cost, $Millions $4M $3M Option 1: Use Cloudera Enterprise Estimated Cost: $2 million Deployment Time: ~ 2 Months $2M $1M 1 2 3 4 5 Time required for Production Deployment (Months) Note: Cost estimates include personnel, software & hardware Source: Cloudera internal estimates 6 Customer Success Cloudera Enterprise Subscription vs. Self-Source Item Cloudera Enterprise Self-Source or Contract Support Offering World-Class, Global, Dedicated Contributors and Committers Must recruit, hire, train and retain Hadoop experts Monitoring and Management Fully Integrated application for Hadoop Intelligence Must be developed and maintained in house Support for the Full Hadoop Stack Full Stack* Unknown Regular Scheduled Releases Yearly Major, Quarterly Minor, Hot Fixes? N/A Training and Certification for the Full Hadoop Stack Available Worldwide None Support for Full Lifecycle All Inclusive Development through Production Community support Rich Knowledge-base 500+ Articles None Production Solution Guides Included None * Flume, FuseDFS, HBase, HDFS, Hive, Hue, Mahout, MR1, MR2, Oozie, Pig, Sqoop, Zookeeper Contact Us • Erin Hawley – Business Development, Cloudera DoD Engagement – ehawley@cloudera.com • Matt Mead – Sr. Systems Engineer, Cloudera Federal Engagements – mmead@cloudera.com