Getting Started with Hadoop Who We Are Mission: To help organizations profit from their data How We Do It Credentials Technical Team Leadership We deliver relevant products and services. The Apache Hadoop experts. Unmatched knowledge and experience. Strong executive team with proven abilities. A distribution of Apache Hadoop that is tested, certified and supported Comprehensive support and professional service offerings A suite of management software for Hadoop operations Training and certification programs for developers, administrators, managers and data scientists Number 1 distribution of Apache Hadoop in the world Founders, committers and contributors to Hadoop Largest contributor to the open source Hadoop ecosystem A wealth of experience in the design and delivery of production software More committers on staff than any other company More than 100 customers across a wide variety of industries Strong growth in revenue and new accounts 2 ©2011 Cloudera, Inc. All Rights Reserved. Mike Olson CEO Kirk Dunn COO Charles Zedlewski VP, Product Mary Rorabaugh CFO Jeff Hammerbacher Chief Scientist Amr Awadalla VP Engineering Doug Cutting Chief Architect Omer Trajman VP, Customer Solutions Users of Cloudera Financial Web Telecom 3 ©2011 Cloudera, Inc. All Rights Reserved. Media Retail & Consumer What is Apache Hadoop? Hadoop is a platform for data storage and processing that is… Scalable Fault tolerant Open source Flexibility CORE HADOOP COMPONENTS Hadoop Distributed File System (HDFS) MapReduce File Sharing & Data Protection Across Physical Servers Distributed Computing Across Physical Servers Scalability A single repository for storing processing & analyzing any type of data Scale-out architecture divides workloads across multiple nodes Not bound by a single schema Flexible file system eliminates ETL bottlenecks 4 ©2011 Cloudera, Inc. All Rights Reserved. Low Cost Can be deployed on commodity hardware Open source platform guards against vendor lock What Makes Hadoop Different? • Ability to scale out to Petabytes in size using commodity hardware • Processing (MapReduce) jobs are sent to the data versus shipping the data to be processed • Hadoop doesn’t impose a single data format so it can easily handle structure, semistructure and unstructured data • Manages fault tolerance and data replication automatically 5 ©2011 Cloudera, Inc. All Rights Reserved. GIGABYTES OF DATA CREATED (IN BILLIONS) Why the Need for Hadoop? 10,000 1.8 trillion gigabytes of data was created in 2011… More than 90% is unstructured data Approx. 500 quadrillion files Quantity doubles every 2 years 5,000 0 2005 2015 2010 STRUCTURED DATA Source: IDC 2011 6 ©2011 Cloudera, Inc. All Rights Reserved. UNSTRUCTURED DATA Hadoop Use Cases Application Industry Application Social Network Analysis Web Clickstream Sessionization Content Optimization Media Clickstream Sessionization Network Analytics Telco Mediation Loyalty & Promotions Analysis Retail Data Factory Fraud Analysis Financial Trade Reconciliation Entity Analysis Federal SIGINT Sequencing Analysis Bioinformatics Genome Mapping 7 ©2011 Cloudera, Inc. All Rights Reserved. Use Case DATA PROCESSING ADVANCED ANALYTICS Use Case Hadoop in the Enterprise OPERATORS ENGINEERS ANALYSTS BUSINESS USERS Management Tools IDE’s BI / Analytics Enterprise Reporting Enterprise Data Warehouse CUSTOMERS Web Application Logs Files Web Data Relational Databases 8 ©2011 Cloudera, Inc. All Rights Reserved. What is CDH? Cloudera’s Distribution Including Apache Hadoop (CDH) is an enterprise-ready distribution of Hadoop that is… 100% Apache open source Contains all components needed for deployment Fully documented and supported Released on a reliable schedule Fastest Path to Success Stable and Reliable No need to write your own scripts or do integration testing on different components Extensive Cloudera QA systems, software & processes Works with a wide range of operating systems, hardware, databases and data warehouses Proven at scale in dozens of enterprise environments Tested & run in production at scale 9 ©2011 Cloudera, Inc. All Rights Reserved. Community Driven Incorporates only main-line components from the Apache Hadoop ecosystem – no forks or proprietary underpinnings FREE Cloudera’s Commitment to the Open Source Community Component Cloudera Committers Cloudera Founder 2011 Commits Common 6 Yes #1 HDFS 6 Yes #2 MapReduce 5 Yes #1 HBase 2 No #2 Zookeeper 1 Yes #2 Oozie 1 Yes #1 Pig 0 No #3 Hive 1 No #2 Sqoop 2 Yes #1 Flume 3 Yes #1 Hue 3 Yes #1 Snappy 2 No #1 Bigtop 8 Yes #1 Avro 4 Yes #1 Whirr 2 Yes #1 10 ©2011 Cloudera, Inc. All Rights Reserved. Components of CDH Cloudera Enterprise User Interface HUE Workflow File System Mount APACHE OOZIE FUSE-DFS Scheduling APACHE OOZIE Languages / Compilers APACHE PIG, APACHE HIVE Data Integration Fast Read/Write Access APACHE FLUME, APACHE SQOOP APACHE HBASE Coordination 11 ©2011 Cloudera, Inc. All Rights Reserved. APACHE ZOOKEEPER Hadoop Distributed File System Block Size = 64MB Replication Factor = 3 1 2 1 4 2 5 5 1 2 HDFS 3 3 4 4 5 5 2 1 3 Cost is $400-$500/TB 3 4 12 ©2011 Cloudera, Inc. All Rights Reserved. 5 Components of Hadoop • NameNode – Holds all metadata for HDFS – Needs to be a highly reliable machine • RAID drives – typically RAID 10 • Dual power supplies • Dual network cards – Bonded – The more memory the better – typical 36GB to 64GB • Secondary NameNode – Provides check pointing for the NameNode. Same hardware as the NameNode should be used 13 ©2011 Cloudera, Inc. All Rights Reserved. Components of Hadoop • DataNodes – Hardware will depend on the specific needs of the cluster – No RAID needed, JBOD (just a bunch of disks) is used – Typical ratio is: • 1 hard drive • 2 cores • 4GB of RAM 14 ©2011 Cloudera, Inc. All Rights Reserved. Networking • One of the most important things to consider when setting up a Hadoop cluster • Typically a top of rack is used with Hadoop with a core switch • Careful on over subscribing the backplane of the switch! 15 ©2011 Cloudera, Inc. All Rights Reserved. Map • Records from the data source (lines out of files, rows of a database, etc) are fed into the map function as key*value pairs: e.g., (filename, line). • map() produces one or more intermediate values along with an output key from the input. (key 1, values) Map Task (key 2, values) (key 1, int. values) Shuffle Phase (key 3, values) (key 1, int. values) (key 1, int. values) 16 ©2011 Cloudera, Inc. All Rights Reserved. Reduce Task Final (key, values) Reduce • After the map phase is over, all the intermediate values for a given output key are combined together into a list • reduce() combines those intermediate values into one or more final values for that same output key (key 1, values) Map Task (key 2, values) (key 1, int. values) Shuffle Phase (key 3, values) (key 1, int. values) (key 1, int. values) 17 ©2011 Cloudera, Inc. All Rights Reserved. Reduce Task Final (key, values) MapReduce Execution 18 ©2011 Cloudera, Inc. All Rights Reserved. Sqoop SQL to Hadoop Tool to import/export any JDBC-supported database into Hadoop Transfer data between Hadoop and external databases or EDW High performance connectors for some RDBMS Developed at Cloudera 19 ©2011 Cloudera, Inc. All Rights Reserved. Flume Distributed, reliable, available service for efficiently moving large amounts of data as it is produced Suited for gathering logs from multiple systems Inserting them into HDFS as they are generated Design goals Reliability, Scalability, Manageability, Extensibility Developed at Cloudera 20 ©2011 Cloudera, Inc. All Rights Reserved. Flume: high-level architecture Master send configuration to all Agents Agent Agent Agent Agent Configurable levels of reliability Guarantee delivery in event of failure Deployable, centrally administered encrypt MASTER Processor Processor compress batch Optionally pre-process incoming data: perform transformations, suppressions, metadata enrichment encrypt Writes to multiple HDFS file formats (text, sequence, JSON, Avro, others) Parallelized writes across many collectors – as much write throughput as Collector(s) 21 ©2011 Cloudera, Inc. All Rights Reserved. Flexibly deploy decorators at any step to improve performance, reliability or security HBase Column-family store. Based on design of Google BigTable Provides interactive access to information Holds extremely large datasets (multi-TB) Constrained access model (key, value) lookup Limited transactions (only one row) 22 ©2011 Cloudera, Inc. All Rights Reserved. HBase 23 ©2011 Cloudera, Inc. All Rights Reserved. Hive SQL-based data warehousing application Language is SQL-like Supports SELECT, JOIN, GROUP BY, etc. Features for analyzing very large data sets Partition columns, Sampling, Buckets Example: SELECT s.word, s.freq, k.freq FROM shakespeares JOIN ON (s.word= k.word) WHERE s.freq >= 5; 24 ©2011 Cloudera, Inc. All Rights Reserved. Pig Data-flow oriented language – “Pig latin” Datatypes include sets, associative arrays, tuples High-level language for routing data, allows easy integration of Java for complex tasks Example: emps=LOAD 'people.txt’ AS(id,name,salary); rich = FILTER emps BY salary > 100000; srtd = ORDER rich BY salary DESC; STORE srtd INTO ’ rich_people.txt'; 25 ©2011 Cloudera, Inc. All Rights Reserved. Oozie Oozie is a workflow/cordination service to manage data processing jobs for Hadoop 26 ©2011 Cloudera, Inc. All Rights Reserved. Zookeeper Zookeeper is a distributed consensus engine Provides well-defined concurrent access semantics: Leader election Service discovery Distributed locking / mutual exclusion Message board / mailboxes 27 ©2011 Cloudera, Inc. All Rights Reserved. Pipes and Streaming Multi-language connector libraries for MapReduce Write native-code MapReduce in C++ Write MapReduce passes in any scripting language, including Perl Python 28 ©2011 Cloudera, Inc. All Rights Reserved. FUSE - DFS Allows mounting of HDFS volumes via Linux FUSE file system Does allow easy integration with other systems for data import/export Does not imply HDFS can be used for general-purpose file system 29 ©2011 Cloudera, Inc. All Rights Reserved. Hadoop Security Authentication is secured by Kerberos v5 and integrated with LDAP Hadoop server can ensure that users and groups are who they say they are Job Control includes Access Control Lists, which means Jobs can specify who can view logs, counters, configurations and who can modify a job Tasks now run as the user who launched the job 30 ©2011 Cloudera, Inc. All Rights Reserved. Cloudera Enterprise Cloudera Enterprise makes open source Hadoop enterprise-easy Simplify and Accelerate Hadoop Deployment Reduce Adoption Costs and Risks CLOUDERA ENTERPRISE COMPONENTS Cloudera Manager Production-Level Support End-to-End Management Application for Apache Hadoop Our Team of Experts OnCall to Help You Meet Your SLAs Lower the Cost of Administration Increase the Transparency Control of Hadoop Leverage the Experience of Our Experts EFFECTIVENESS EFFICIENCY Ensuring You Get Value From Your Hadoop Deployment Enabling You to Affordably Run Hadoop in Production 31 ©2011 Cloudera, Inc. All Rights Reserved. Cloudera Manager The industry’s first for Apache Hadoop the Apache Hadoop stack HDFS MAPREDUCE Automates the of Apache Hadoop HBASE DISCOVER ZOOKEEPER OOZIE HUE 32 ©2011 Cloudera, Inc. All Rights Reserved. DIAGNOSE ACT OPTIMIZE Cloudera Enterprise Including Cloudera Support Feature Benefit Flexible Support Windows Choose from 8x5 or 24x7 options to meet SLA requirements Configuration Checks Verify that your Hadoop cluster is fine-tuned for your environment Issue Resolution and Escalation Processes Proven processes ensure that support cases get resolved with maximum efficiency Comprehensive Knowledgebase Browse through hundreds of Articles and Tech Notes to expand upon your knowledge of Apache Hadoop Certified Connectors Connect your Apache Hadoop cluster to your existing data analysis tools such as IBM Netezza and Revolution Analytics Notification of New Developments and Events Stay up to speed with what’s going on in the Apache Hadoop community 34 ©2011 Cloudera, Inc. All Rights Reserved. Cloudera University Public and Private Training to Enable Your Success Class Description Developer Training & Certification Hands-on training and certification for developers who want to analyze their data but are new to Apache Hadoop (4 Days) System Administrator Training & Certification (3 Days) Hands-on training and certification for administrators who will be responsible for setting up, configuring, monitoring an Apache Hadoop cluster HBase Training (2 Day) Covers the HBase architecture, data model, and Java API as well as some advanced topics and best practices Analyzing Data with Hive and Pig Hive and Pig training is designed for people who have a basic understanding of how Apache Hadoop works and want to utilize these languages for analysis of their data (2 Days) Essentials for Managers (1 Day) Provides decision-makers the information they need to know about Apache Hadoop, answering questions such as “when is Hadoop appropriate?”, “what are people using Hadoop for?” and “what do I need to know about choosing Hadoop?” 35 ©2011 Cloudera, Inc. All Rights Reserved. Cloudera Consulting Services Put Our Expertise To Work For You. Cloudera’s team of Solutions Architects provides guidance and hands-on expertise to address unique enterprise challenges. Service Description Use Case Discovery Assess the appropriateness and value of Hadoop for your organization New Hadoop Deployment Set up and configure high performance, production-ready Hadoop clusters Proof of Concept Verify the prototype functionality and project feasibility for a new Hadoop cluster Production Pilot Deploy your first production-level project using Hadoop Process and Team Development Define the requirements and processes for creating a new Hadoop team Hadoop Deployment Certification Perform periodic health checks to certify and tune up existing Hadoop clusters 36 ©2011 Cloudera, Inc. All Rights Reserved. Journey of the Cloudera Customer Discover the Benefits of Apache Hadoop Flexibility to store and mine all types of data Cloudera’s Distribution Subscribe to Cloudera Enterprise The fastest, surest path to success with Apache Hadoop Simplify and accelerate Apache Hadoop deployment 37 ©2011 Cloudera, Inc. All Rights Reserved. Cloudera in Production Consulting Services Cloudera University Cloudera Services OPERATORS ENGINEERS ANALYSTS BUSINESS USERS CUSTOMERS IDE’s BI / Analytics Enterprise Reporting Web Application Cloudera Enterprise Management Tools Cloudera Management Suite Cloudera Support Enterprise Data Warehouse Cloudera’s Distribution Including Apache Hadoop (CDH) & SCM Express Logs Files Operational Rules Engines Web Data Relational Databases 38 ©2011 Cloudera, Inc. All Rights Reserved. Get Hadoop Cloudera helps you profit from all your data. +1 (888) 789-1488 sales@cloudera.com cloudera.com twitter.com/ cloudera facebook.com/ cloudera 39 ©2011 Cloudera, Inc. All Rights Reserved. Cloudera Manager The application that: Hadoop management Manages the Manages and monitors the Incorporates comprehensive Has built-in 40 ©2011 Cloudera, Inc. All Rights Reserved. Cloudera Manager Key and ONLY CLOUDERA Installs the complete Hadoop stack in minutes. The simple, wizard-based interface guides you through the steps. Gives you complete, end-to-end visibility and control over your Hadoop cluster from a single interface ONLY CLOUDERA Set server roles, configure services and manage security across the cluster Gracefully start, stop and restart of services as needed ONLY CLOUDERA Maintains a complete record of configuration changes for SOX compliance ONLY CLOUDERA ONLY CLOUDERA Monitors dozens of service performance metrics and alerts you when you approach critical thresholds Gather, view and search Hadoop logs collected from across the cluster Scans Hadoop logs for irregularities and warns you before they impact the cluster 41 ©2011 Cloudera, Inc. All Rights Reserved. Cloudera Manager Key and ONLY CLOUDERA Establishes the time context globally for almost all views Correlates jobs, activities, logs, system changes, configuration changes and service metrics along a single timeline to simplify diagnosis ONLY CLOUDERA ONLY CLOUDERA Takes a snapshot of the cluster state and automatically sends it to Cloudera support to assist with resolution Creates and aggregates relevant Hadoop events pertaining to system health, log messages, user services and activities and make them available for alerting and searching Generates email alerts when certain events occur ONLY CLOUDERA Visualize current and historical disk usage by user, group and directory Track MapReduce activity on the cluster by job or user View information pertaining to hosts in your cluster including status, resident memory, virtual memory and roles 42 ©2011 Cloudera, Inc. All Rights Reserved. Two Editions: Max Number of Nodes Supported FREE EDITION ENTERPRISE EDITION** 50 Unlimited Automated Deployment Host-Level Monitoring Secure Communication Between Server & Agents Configuration Management Manage HDFS, MapReduce, HBase, Hue, Oozie & Zookeeper Audit Trails Start/Stop/Restart Services Add/Restart/Decomission Role Instances Configuration Versioning & History Support for Kerberos Service Monitoring Proactive Health Checks Status & Health Summary Intelligent Log Management Events Management & Alerts Activity Monitoring Operational Reporting Global Time Control Support Integration ** Part of the Cloudera Enterprise subscription 43 ©2011 Cloudera, Inc. All Rights Reserved. View Service Health and Performance 44 ©2011 Cloudera, Inc. All Rights Reserved. Get Host-Level Snapshots 45 ©2011 Cloudera, Inc. All Rights Reserved. Monitor and Diagnose Cluster Workloads 46 ©2011 Cloudera, Inc. All Rights Reserved. Gather, View and Search Hadoop Logs 47 ©2011 Cloudera, Inc. All Rights Reserved. Track Events From Across the Cluster 48 ©2011 Cloudera, Inc. All Rights Reserved. Run Reports on System Performance & Usage 49 ©2011 Cloudera, Inc. All Rights Reserved. New in Cloudera Manager 3.7 Proactive Health Checks ONLY CLOUDERA Intelligent Log Management Global Time Control Support Integration Event Management ONLY CLOUDERA ONLY CLOUDERA ONLY CLOUDERA Alerts Audit Trails ONLY CLOUDERA Monitors dozens of service performance metrics and alerts you when you approach critical thresholds Gathers and scans Hadoop logs for irregularities and warns you before they impact the cluster Correlates jobs, activities, logs, system changes, configuration changes and service metrics along a single timeline to simplify diagnosis Takes a snapshot of the cluster state and automatically sends it to Cloudera support to assist with resolution Creates and aggregates relevant Hadoop events pertaining to system health, log messages, user services and activities and make them available for alerting and searching Generates email alerts when certain events occur ONLY CLOUDERA Operational Reporting Maintains a complete record of configuration changes for SOX compliance ONLY CLOUDERA Visualize current and historical disk usage by user, group and directory and track MapReduce activity on the cluster by job or user 50 ©2011 Cloudera, Inc. All Rights Reserved. Cloudera Support Our on call to help you meet your SLAs Feature Benefit Flexible Support Windows Choose from 8x5 or 24x7 options to meet SLA requirements Configuration Checks Verify that your Hadoop cluster is fine-tuned for your environment Issue Resolution and Escalation Processes Proven processes ensure that support cases get resolved with maximum efficiency Comprehensive Knowledgebase Browse through hundreds of Articles and Tech Notes to expand upon your knowledge of Apache Hadoop Certified Connectors Connect your Apache Hadoop cluster to your existing data analysis tools such as IBM Netezza, Revolution Analytics, and MicroStrategy Proactive Notification of New Developments and Events Stay up to speed with what’s going on in the Apache Hadoop community 51 ©2011 Cloudera, Inc. All Rights Reserved. Cloudera Enterprise The Fastest Path to Success Running Apache Hadoop in Production. Why Cloudera Enterprise? Apache Hadoop is a distributed system that presents unique operational challenges The fixed cost of managing an internal patch and release infrastructure is prohibitive Apache Hadoop skills and expertise are scarce It’s challenging to track consistently to community development efforts Only Cloudera Enterprise Has a management application that supports the full lifecycle of operationalizing Apache Hadoop ••• Has production support backed by the Apache committers ••• Has the depth of experience supporting hundreds of production Apache Hadoop clusters 52 ©2011 Cloudera, Inc. All Rights Reserved.