Getting Started with Hadoop
Who We Are
Mission: To help organizations profit from their data
How We Do It
Credentials
Technical Team
Leadership
We deliver relevant
products and services.
The Apache Hadoop
experts.
Unmatched knowledge
and experience.
Strong executive team
with proven abilities.
 A distribution of Apache Hadoop
that is tested, certified and
supported
 Comprehensive support and
professional service offerings
 A suite of management software
for Hadoop operations
 Training and certification
programs for developers,
administrators, managers and
data scientists
 Number 1 distribution of Apache
Hadoop in the world
 Founders, committers and
contributors to Hadoop
 Largest contributor to the open
source Hadoop ecosystem
 A wealth of experience in the
design and delivery of production
software
 More committers on staff than
any other company
 More than 100 customers across
a wide variety of industries
 Strong growth in revenue and
new accounts
2
©2011 Cloudera, Inc. All Rights Reserved.
Mike Olson
CEO
Kirk Dunn
COO
Charles
Zedlewski
VP, Product
Mary
Rorabaugh
CFO
Jeff
Hammerbacher
Chief Scientist
Amr Awadalla
VP Engineering
Doug Cutting
Chief Architect
Omer Trajman
VP, Customer
Solutions
Users of Cloudera
Financial
Web
Telecom
3
©2011 Cloudera, Inc. All Rights Reserved.
Media
Retail &
Consumer
What is Apache Hadoop?
Hadoop is a platform for data
storage and processing that is…
 Scalable
 Fault tolerant
 Open source
Flexibility
CORE HADOOP COMPONENTS
Hadoop
Distributed File
System (HDFS)
MapReduce
File Sharing & Data
Protection Across
Physical Servers
Distributed Computing
Across Physical Servers
Scalability
 A single repository for storing
processing & analyzing any type
of data
 Scale-out architecture divides
workloads across multiple
nodes
 Not bound by a single schema
 Flexible file system eliminates
ETL bottlenecks
4
©2011 Cloudera, Inc. All Rights Reserved.
Low Cost
 Can be deployed on commodity
hardware
 Open source platform guards
against vendor lock
What Makes Hadoop Different?
• Ability to scale out to Petabytes in size using
commodity hardware
• Processing (MapReduce) jobs are sent to the
data versus shipping the data to be
processed
• Hadoop doesn’t impose a single data format
so it can easily handle structure, semistructure and unstructured data
• Manages fault tolerance and data replication
automatically
5
©2011 Cloudera, Inc. All Rights Reserved.
GIGABYTES OF DATA CREATED (IN BILLIONS)
Why the Need for Hadoop?
10,000
1.8 trillion gigabytes of data was
created in 2011…
 More than 90% is unstructured data
 Approx. 500 quadrillion files
 Quantity doubles every 2 years
5,000
0
2005
2015
2010
STRUCTURED DATA
Source: IDC 2011
6
©2011 Cloudera, Inc. All Rights Reserved.
UNSTRUCTURED DATA
Hadoop Use Cases
Application
Industry
Application
Social Network Analysis
Web
Clickstream Sessionization
Content Optimization
Media
Clickstream Sessionization
Network Analytics
Telco
Mediation
Loyalty & Promotions
Analysis
Retail
Data Factory
Fraud Analysis
Financial
Trade Reconciliation
Entity Analysis
Federal
SIGINT
Sequencing Analysis
Bioinformatics
Genome Mapping
7
©2011 Cloudera, Inc. All Rights Reserved.
Use Case
DATA PROCESSING
ADVANCED ANALYTICS
Use Case
Hadoop in the Enterprise
OPERATORS
ENGINEERS
ANALYSTS
BUSINESS USERS
Management
Tools
IDE’s
BI / Analytics
Enterprise
Reporting
Enterprise Data
Warehouse
CUSTOMERS
Web
Application
Logs
Files
Web Data
Relational
Databases
8
©2011 Cloudera, Inc. All Rights Reserved.
What is CDH?
Cloudera’s Distribution Including
Apache Hadoop (CDH) is an enterprise-ready
distribution of Hadoop that is…




100% Apache open source
Contains all components needed for deployment
Fully documented and supported
Released on a reliable schedule
Fastest Path to Success
Stable and Reliable
 No need to write your own scripts or
do integration testing on different
components
 Extensive Cloudera QA systems,
software & processes
 Works with a wide range of operating
systems, hardware, databases and
data warehouses
 Proven at scale in dozens of
enterprise environments
 Tested & run in production at scale
9
©2011 Cloudera, Inc. All Rights Reserved.
Community Driven
 Incorporates only main-line
components from the Apache
Hadoop ecosystem – no forks or
proprietary underpinnings
 FREE
Cloudera’s Commitment to the Open
Source Community
Component
Cloudera Committers
Cloudera Founder
2011 Commits
Common
6
Yes
#1
HDFS
6
Yes
#2
MapReduce
5
Yes
#1
HBase
2
No
#2
Zookeeper
1
Yes
#2
Oozie
1
Yes
#1
Pig
0
No
#3
Hive
1
No
#2
Sqoop
2
Yes
#1
Flume
3
Yes
#1
Hue
3
Yes
#1
Snappy
2
No
#1
Bigtop
8
Yes
#1
Avro
4
Yes
#1
Whirr
2
Yes
#1
10
©2011 Cloudera, Inc. All Rights Reserved.
Components of CDH
Cloudera Enterprise
User Interface
HUE
Workflow
File System Mount
APACHE OOZIE
FUSE-DFS
Scheduling
APACHE OOZIE
Languages / Compilers
APACHE PIG, APACHE HIVE
Data Integration
Fast Read/Write
Access
APACHE FLUME, APACHE SQOOP
APACHE HBASE
Coordination
11
©2011 Cloudera, Inc. All Rights Reserved.
APACHE ZOOKEEPER
Hadoop Distributed File System
Block Size = 64MB
Replication Factor = 3
1
2
1
4
2
5
5
1
2
HDFS
3
3
4
4
5
5
2
1
3
Cost is $400-$500/TB
3
4
12
©2011 Cloudera, Inc. All Rights Reserved.
5
Components of Hadoop
• NameNode – Holds all metadata for HDFS
– Needs to be a highly reliable machine
• RAID drives – typically RAID 10
• Dual power supplies
• Dual network cards – Bonded
– The more memory the better – typical 36GB to 64GB
• Secondary NameNode – Provides check
pointing for the NameNode. Same hardware
as the NameNode should be used
13
©2011 Cloudera, Inc. All Rights Reserved.
Components of Hadoop
• DataNodes – Hardware will depend on the
specific needs of the cluster
– No RAID needed, JBOD (just a bunch of
disks) is used
– Typical ratio is:
• 1 hard drive
• 2 cores
• 4GB of RAM
14
©2011 Cloudera, Inc. All Rights Reserved.
Networking
• One of the most important things to
consider when setting up a Hadoop cluster
• Typically a top of rack is used with Hadoop
with a core switch
• Careful on over subscribing the backplane
of the switch!
15
©2011 Cloudera, Inc. All Rights Reserved.
Map
• Records from the data source (lines out of files, rows of a
database, etc) are fed into the map function as key*value
pairs: e.g., (filename, line).
• map() produces one or more intermediate values along
with an output key from the input.
(key 1,
values)
Map
Task
(key 2,
values)
(key 1, int.
values)
Shuffle
Phase
(key 3,
values)
(key 1, int.
values)
(key 1, int.
values)
16
©2011 Cloudera, Inc. All Rights Reserved.
Reduce
Task
Final (key,
values)
Reduce
• After the map phase is over, all the intermediate values for
a given output key are combined together into a list
• reduce() combines those intermediate values into one or
more final values for that same output key
(key 1,
values)
Map
Task
(key 2,
values)
(key 1, int.
values)
Shuffle
Phase
(key 3,
values)
(key 1, int.
values)
(key 1, int.
values)
17
©2011 Cloudera, Inc. All Rights Reserved.
Reduce
Task
Final (key,
values)
MapReduce Execution
18
©2011 Cloudera, Inc. All Rights Reserved.
Sqoop
SQL to Hadoop
 Tool to import/export any JDBC-supported database into Hadoop
 Transfer data between Hadoop and external databases or EDW
 High performance connectors for some RDBMS
 Developed at Cloudera
19
©2011 Cloudera, Inc. All Rights Reserved.
Flume
Distributed, reliable, available service for efficiently moving
large amounts of data as it is produced
 Suited for gathering logs from multiple systems
 Inserting them into HDFS as they are generated
Design goals
 Reliability, Scalability, Manageability, Extensibility
Developed at Cloudera
20
©2011 Cloudera, Inc. All Rights Reserved.
Flume: high-level architecture
Master send
configuration to all
Agents
Agent
Agent
Agent
Agent
Configurable levels of reliability
Guarantee delivery in event of
failure
Deployable, centrally administered
encrypt
MASTER
Processor
Processor
compress
batch
Optionally pre-process incoming
data: perform transformations,
suppressions, metadata enrichment
encrypt
Writes to multiple HDFS file formats
(text, sequence, JSON, Avro, others)
Parallelized writes across many
collectors – as much write throughput
as
Collector(s)
21
©2011 Cloudera, Inc. All Rights Reserved.
Flexibly deploy decorators at any
step to improve performance,
reliability or security
HBase
Column-family store. Based on design of Google BigTable
 Provides interactive access to information
 Holds extremely large datasets (multi-TB)
 Constrained access model
 (key, value) lookup
 Limited transactions (only one row)
22
©2011 Cloudera, Inc. All Rights Reserved.
HBase
23
©2011 Cloudera, Inc. All Rights Reserved.
Hive
SQL-based data warehousing application
 Language is SQL-like
 Supports SELECT, JOIN, GROUP BY, etc.
 Features for analyzing very large data sets
 Partition columns, Sampling, Buckets
 Example:
SELECT s.word, s.freq, k.freq FROM shakespeares
JOIN ON (s.word= k.word) WHERE s.freq >= 5;
24
©2011 Cloudera, Inc. All Rights Reserved.
Pig
Data-flow oriented language – “Pig latin”
 Datatypes include sets, associative arrays, tuples
 High-level language for routing data, allows easy
integration of Java for complex tasks
 Example:
emps=LOAD 'people.txt’ AS(id,name,salary);
rich = FILTER emps BY salary > 100000; srtd =
ORDER rich BY salary DESC; STORE srtd INTO ’
rich_people.txt';
25
©2011 Cloudera, Inc. All Rights Reserved.
Oozie
Oozie is a workflow/cordination service to manage data
processing jobs for Hadoop
26
©2011 Cloudera, Inc. All Rights Reserved.
Zookeeper
Zookeeper is a distributed consensus engine
 Provides well-defined concurrent access semantics:
 Leader election
 Service discovery
 Distributed locking / mutual exclusion
 Message board / mailboxes
27
©2011 Cloudera, Inc. All Rights Reserved.
Pipes and Streaming
Multi-language connector libraries for MapReduce
 Write native-code MapReduce in C++
 Write MapReduce passes in any scripting language,
including
 Perl
 Python
28
©2011 Cloudera, Inc. All Rights Reserved.
FUSE - DFS
Allows mounting of HDFS volumes via Linux FUSE file
system
 Does allow easy integration with other systems for data
import/export
 Does not imply HDFS can be used for general-purpose
file system
29
©2011 Cloudera, Inc. All Rights Reserved.
Hadoop Security
 Authentication is secured by Kerberos v5 and integrated with LDAP
 Hadoop server can ensure that users and groups are who they say they are
 Job Control includes Access Control Lists, which means Jobs can specify who
can view logs, counters, configurations and who can modify a job
 Tasks now run as the user who launched the job
30
©2011 Cloudera, Inc. All Rights Reserved.
Cloudera Enterprise
Cloudera Enterprise makes
open source Hadoop enterprise-easy
 Simplify and Accelerate Hadoop Deployment
 Reduce Adoption Costs and Risks
CLOUDERA ENTERPRISE COMPONENTS
Cloudera
Manager
Production-Level
Support
End-to-End Management
Application for Apache
Hadoop
Our Team of Experts OnCall to Help You Meet
Your SLAs
 Lower the Cost of Administration
 Increase the Transparency Control of Hadoop
 Leverage the Experience of Our Experts
EFFECTIVENESS
EFFICIENCY
Ensuring You
Get Value From Your Hadoop Deployment
Enabling You to
Affordably Run Hadoop in Production
31
©2011 Cloudera, Inc. All Rights Reserved.
Cloudera Manager
The industry’s first
for Apache Hadoop
the
Apache Hadoop stack
HDFS
MAPREDUCE
Automates the
of Apache Hadoop
HBASE
DISCOVER
ZOOKEEPER
OOZIE
HUE
32
©2011 Cloudera, Inc. All Rights Reserved.
DIAGNOSE
ACT
OPTIMIZE
Cloudera Enterprise
Including Cloudera Support
Feature
Benefit
Flexible Support Windows
Choose from 8x5 or 24x7 options to meet SLA
requirements
Configuration Checks
Verify that your Hadoop cluster is fine-tuned for your
environment
Issue Resolution and
Escalation Processes
Proven processes ensure that support cases get
resolved with maximum efficiency
Comprehensive
Knowledgebase
Browse through hundreds of Articles and Tech Notes
to expand upon your knowledge of Apache Hadoop
Certified Connectors
Connect your Apache Hadoop cluster to your existing
data analysis tools such as IBM Netezza and
Revolution Analytics
Notification of New
Developments and Events
Stay up to speed with what’s going on in the Apache
Hadoop community
34
©2011 Cloudera, Inc. All Rights Reserved.
Cloudera University
Public and Private Training to Enable Your Success
Class
Description
Developer Training & Certification
Hands-on training and certification for developers who want
to analyze their data but are new to Apache Hadoop
(4 Days)
System Administrator Training &
Certification (3 Days)
Hands-on training and certification for administrators who
will be responsible for setting up, configuring, monitoring an
Apache Hadoop cluster
HBase Training (2 Day)
Covers the HBase architecture, data model, and Java API as
well as some advanced topics and best practices
Analyzing Data with Hive and Pig
Hive and Pig training is designed for people who have a
basic understanding of how Apache Hadoop works and want
to utilize these languages for analysis of their data
(2 Days)
Essentials for Managers (1 Day)
Provides decision-makers the information they need to know
about Apache Hadoop, answering questions such as “when
is Hadoop appropriate?”, “what are people using Hadoop
for?” and “what do I need to know about choosing Hadoop?”
35
©2011 Cloudera, Inc. All Rights Reserved.
Cloudera Consulting Services
Put Our Expertise To Work For You.
Cloudera’s team of Solutions Architects provides guidance and
hands-on expertise to address unique enterprise challenges.
Service
Description
Use Case Discovery
Assess the appropriateness and value of Hadoop
for your organization
New Hadoop Deployment
Set up and configure high performance,
production-ready Hadoop clusters
Proof of Concept
Verify the prototype functionality and project
feasibility for a new Hadoop cluster
Production Pilot
Deploy your first production-level project using
Hadoop
Process and Team Development
Define the requirements and processes for
creating a new Hadoop team
Hadoop Deployment Certification
Perform periodic health checks to certify and tune
up existing Hadoop clusters
36
©2011 Cloudera, Inc. All Rights Reserved.
Journey of the Cloudera Customer
Discover the Benefits
of Apache Hadoop
Flexibility to store
and mine all types
of data
Cloudera’s
Distribution
Subscribe to
Cloudera Enterprise
The fastest, surest
path to success with
Apache Hadoop
Simplify and
accelerate Apache
Hadoop deployment
37
©2011 Cloudera, Inc. All Rights Reserved.
Cloudera in Production


Consulting Services
Cloudera University
Cloudera Services
OPERATORS
ENGINEERS
ANALYSTS
BUSINESS USERS
CUSTOMERS
IDE’s
BI / Analytics
Enterprise
Reporting
Web
Application
Cloudera Enterprise
Management
Tools


Cloudera Management Suite
Cloudera Support
Enterprise Data
Warehouse
Cloudera’s Distribution
Including Apache Hadoop (CDH)
&
SCM Express
Logs
Files
Operational Rules
Engines
Web Data
Relational
Databases
38
©2011 Cloudera, Inc. All Rights Reserved.
Get
Hadoop
Cloudera helps you profit
from all your data.
+1 (888) 789-1488
sales@cloudera.com
cloudera.com
twitter.com/
cloudera
facebook.com/
cloudera
39
©2011 Cloudera, Inc. All Rights Reserved.
Cloudera Manager
The
application that:
Hadoop management
Manages the
Manages and monitors the
Incorporates comprehensive
Has
built-in
40
©2011 Cloudera, Inc. All Rights Reserved.
Cloudera Manager
Key
and
ONLY
CLOUDERA
Installs the complete Hadoop stack in minutes. The simple, wizard-based
interface guides you through the steps.
Gives you complete, end-to-end visibility and control over your Hadoop
cluster from a single interface
ONLY
CLOUDERA
Set server roles, configure services and manage security across the cluster
Gracefully start, stop and restart of services as needed
ONLY
CLOUDERA
Maintains a complete record of configuration changes for SOX compliance
ONLY
CLOUDERA
ONLY
CLOUDERA
Monitors dozens of service performance metrics and alerts you when you
approach critical thresholds
Gather, view and search Hadoop logs collected from across the cluster
Scans Hadoop logs for irregularities and warns you before they impact the
cluster
41
©2011 Cloudera, Inc. All Rights Reserved.
Cloudera Manager
Key
and
ONLY
CLOUDERA
Establishes the time context globally for almost all views
Correlates jobs, activities, logs, system changes, configuration changes and
service metrics along a single timeline to simplify diagnosis
ONLY
CLOUDERA
ONLY
CLOUDERA
Takes a snapshot of the cluster state and automatically sends it to Cloudera
support to assist with resolution
Creates and aggregates relevant Hadoop events pertaining to system health, log
messages, user services and activities and make them available for alerting and
searching
Generates email alerts when certain events occur
ONLY
CLOUDERA
Visualize current and historical disk usage by user, group and directory
Track MapReduce activity on the cluster by job or user
View information pertaining to hosts in your cluster including status, resident
memory, virtual memory and roles
42
©2011 Cloudera, Inc. All Rights Reserved.
Two Editions:
Max Number of Nodes Supported
FREE EDITION
ENTERPRISE EDITION**
50
Unlimited
Automated Deployment
Host-Level Monitoring
Secure Communication Between Server & Agents
Configuration Management
Manage HDFS, MapReduce, HBase, Hue, Oozie & Zookeeper
Audit Trails
Start/Stop/Restart Services
Add/Restart/Decomission Role Instances
Configuration Versioning & History
Support for Kerberos
Service Monitoring
Proactive Health Checks
Status & Health Summary
Intelligent Log Management
Events Management & Alerts
Activity Monitoring
Operational Reporting
Global Time Control
Support Integration
** Part of the Cloudera Enterprise subscription
43
©2011 Cloudera, Inc. All Rights Reserved.
View Service Health and Performance
44
©2011 Cloudera, Inc. All Rights Reserved.
Get Host-Level Snapshots
45
©2011 Cloudera, Inc. All Rights Reserved.
Monitor and Diagnose Cluster Workloads
46
©2011 Cloudera, Inc. All Rights Reserved.
Gather, View and Search Hadoop Logs
47
©2011 Cloudera, Inc. All Rights Reserved.
Track Events From Across the Cluster
48
©2011 Cloudera, Inc. All Rights Reserved.
Run Reports on System Performance & Usage
49
©2011 Cloudera, Inc. All Rights Reserved.
New in Cloudera Manager 3.7
Proactive Health Checks
ONLY
CLOUDERA
Intelligent Log Management
Global Time Control
Support Integration
Event Management
ONLY
CLOUDERA
ONLY
CLOUDERA
ONLY
CLOUDERA
Alerts
Audit Trails
ONLY
CLOUDERA
Monitors dozens of service performance metrics and alerts you
when you approach critical thresholds
Gathers and scans Hadoop logs for irregularities and warns you
before they impact the cluster
Correlates jobs, activities, logs, system changes, configuration
changes and service metrics along a single timeline to simplify
diagnosis
Takes a snapshot of the cluster state and automatically sends it to
Cloudera support to assist with resolution
Creates and aggregates relevant Hadoop events pertaining to
system health, log messages, user services and activities and make
them available for alerting and searching
Generates email alerts when certain events occur
ONLY
CLOUDERA
Operational Reporting
Maintains a complete record of configuration changes for SOX
compliance
ONLY
CLOUDERA
Visualize current and historical disk usage by user, group and
directory and track MapReduce activity on the cluster by job or user
50
©2011 Cloudera, Inc. All Rights Reserved.
Cloudera Support
Our
on call to help you meet your SLAs
Feature
Benefit
Flexible Support Windows
Choose from 8x5 or 24x7 options to meet SLA
requirements
Configuration Checks
Verify that your Hadoop cluster is fine-tuned for your
environment
Issue Resolution and Escalation
Processes
Proven processes ensure that support cases get
resolved with maximum efficiency
Comprehensive Knowledgebase
Browse through hundreds of Articles and Tech Notes
to expand upon your knowledge of Apache Hadoop
Certified Connectors
Connect your Apache Hadoop cluster to your existing
data analysis tools such as IBM Netezza, Revolution
Analytics, and MicroStrategy
Proactive Notification of New
Developments and Events
Stay up to speed with what’s going on in the Apache
Hadoop community
51
©2011 Cloudera, Inc. All Rights Reserved.
Cloudera Enterprise
The Fastest Path to Success
Running Apache Hadoop in Production.
Why Cloudera Enterprise?
 Apache Hadoop is a distributed system that
presents unique operational challenges
 The fixed cost of managing an internal patch
and release infrastructure is prohibitive
 Apache Hadoop skills and expertise are scarce
 It’s challenging to track consistently to
community development efforts
Only Cloudera Enterprise
Has a management application that
supports the full lifecycle of operationalizing
Apache Hadoop
•••
Has production support backed by the
Apache committers
•••
Has the depth of experience supporting
hundreds of production Apache Hadoop clusters
52
©2011 Cloudera, Inc. All Rights Reserved.