Hadoop - Mil-OSS

advertisement
Hadoop Update
Big Data Analytics
May 23nd 2012
Matt Mead, Cloudera
What is Hadoop?
Apache Hadoop is an open source
platform for data storage and processing
that is…
 Scalable
 Fault tolerant
 Distributed
CORE HADOOP SYSTEM COMPONENTS
Hadoop Distributed
File System (HDFS)
MapReduce
Self-Healing, High
Bandwidth Clustered
Storage
Provides storage and computation
in a single, scalable system.
Distributed Computing
Framework
Why Use Hadoop?
Move beyond rigid legacy frameworks.
Hadoop handles any data
type, in any quantity.
Hadoop grows with your
business.
 Structured, unstructured
 Schema, no schema
Hadoop helps you derive
the complete value of all
your data.
 No vendor lock-in
 Proven at petabyte scale
 High volume, low volume
 Capacity and performance
grow simultaneously
 All kinds of analytic
applications
 Leverages commodity
hardware to mitigate costs
1
Hadoop is 100% Apache®
licensed and open source.
2
 Community development
 Rich ecosystem of related
projects
3
 Drives revenue by extracting
value from data that was
previously out of reach
 Controls costs by storing data
more affordably than any
other platform
The Need for CDH
1. The Apache Hadoop ecosystem is complex
– Many different components – lots of moving parts
– Most companies require more than just HDFS and MapReduce
– Creating a Hadoop stack is time-consuming and requires specific expertise
• Component and version selection
• Integration (internal & external)
• System test w/end-to-end workflows
2. Enterprises consume software in a certain way
–
–
–
–
System, not silo
Tested and stable
Documented and supported
Predictable release schedule
Core Values of CDH
A
with everything you need for production use.
Components of the
CDH Stack
File System Mount
UI Framework
FUSE-DFS
Workflow
APACHE OOZIE
Storage
SDK
HUE
Scheduling
APACHE OOZIE
HUE SDK
Computation
Metadata
APACHE HIVE
Integration
Languages / Compilers
Data
Integration
APACHE PIG, APACHE HIVE, APACHE MAHOUT
APACHE FLUME,
APACHE SQOOP
Fast
Read/Write
Access
Coordination
APACHE HBASE
HDFS, MAPREDUCE
Coordination
APACHE ZOOKEEPER
Access
The Need for CDH
A set of open source components,
HDFS – Distributed, scalable, fault tolerant file system
MapReduce – Parallel processing framework for large
data sets
Apache Hive – SQL-like language and metadata
repository
Apache Pig – High level language for expressing data
analysis programs
Apache HBase – Hadoop database for random, realtime read/write access
Apache Oozie – Server-based workflow engine for
Hadoop activities
Apache Zookeeper – Highly reliable distributed
coordination service
Apache Sqoop – Integrating Hadoop with RDBMS
Apache Flume – Distributed service for collecting and
aggregating log and event data
Fuse-DFS – Module within Hadoop for mounting HDFS
as a traditional file system
Apache Mahout – Library of machine learning
algorithms for Apache Hadoop
Hue – Browser-based desktop interface for interacting
with Hadoop
Apache Whirr – Library for running Hadoop in the
cloud
Core Hadoop Use Cases
INDUSTRY TERM
VERTICAL
INDUSTRY TERM
Social Network Analysis
Web
Clickstream Sessionization
Content Optimization
Media
Engagement
Network Analytics
Telco
Mediation
Loyalty & Promotions Analysis
Retail
Data Factory
Fraud Analysis
Financial
Trade Reconciliation
Entity Analysis
Federal
SIGINT
Sequencing Analysis
Bioinformatics
Genome Mapping
2
DATA PROCESSING
ADVANCED ANALYTICS
1
Two Core Use Cases
Applied Across Verticals
FMV & Image Processing
Data Processing – Full Motion Video & Image Processing
•
Record by record -> Easy Parallelization
–
“Unit of work” is important
–
Raw data in HDFS
•
Adaptation of existing image analyzers to Map Only / Map Reduce
•
Scales horizontally
•
Simple detections
–
Vehicles
–
Structures
–
Faces
Cybersecurity Analysis
Advanced Analytics – Cybersecurity Analysis
•
Rates and flows – ingest can be in excess of the multiple gigabyte per second range
•
Can be complex because of mixed-workload clusters
•
Typically involves ad-hoc analysis
–
Question oriented analytics
•
“Productionized” use cases allow insight by non-analysts
•
Existing open source solution SHERPASURFING
–
Focuses on the cybersecurity analysis underpinnings for common data-sets (pcap, netflow, audit logs, etc.)
–
Provides a means to ask questions without reinventing all the plumbing
Index Preparation
Data Processing – Index Preparation
•
Hadoop’s Seminal Use Case
•
Dynamic Partitioning -> Easy Parallelization
•
String Interning
•
Inverse Index Construction
•
Dimensional data capture
•
Destination indices
•
–
Lucene/Solr (and derivatives)
–
Endeca
Existing solution USA Search (http://usasearch.howto.gov/)
Data Landing Zone
Data Processing – Schema-less Enterprise Data Warehouse / Landing Zone
•
Begins as storage, light ingest processing, retrieval
•
Capacity scales horizontally
•
Schema-less -> holds arbitrary content
•
Schema-less -> allows ad-hoc fusion and analysis
•
Additional analytic workload forces decisions
Hadoop: Getting Started
•
Reactive
–
•
•
Forced by scale or cost of scaling
Proactive
–
Seek talent ahead of need to build
–
Identify data-sets
–
Determine high-value use cases that change organizational outcomes
–
Start with 10-20 nodes and 10+TB unless data-sets are super-dimensional
Either way
–
Talent a major challenge
–
Start with “Data Processing” use cases
–
Physical infrastructure is complex, make the software infrastructure simple to manage
Customer Success
Self-Source Deployment vs. Cloudera Enterprise – 500 node deployment
Option 2: Self-Source
Estimated Cost: $4.8 million
Deployment Time: ~ 6 Months
$5M
Cost, $Millions
$4M
$3M
Option 1: Use Cloudera Enterprise
Estimated Cost: $2 million
Deployment Time: ~ 2 Months
$2M
$1M
1
2
3
4
5
Time required for Production Deployment (Months)
Note: Cost estimates include personnel, software & hardware
Source: Cloudera internal estimates
6
Customer Success
Cloudera Enterprise Subscription vs. Self-Source
Item
Cloudera Enterprise
Self-Source or Contract
Support Offering
World-Class, Global, Dedicated
Contributors and Committers
Must recruit, hire, train and retain
Hadoop experts
Monitoring and Management
Fully Integrated application for
Hadoop Intelligence
Must be developed and
maintained in house
Support for the Full Hadoop Stack
Full Stack*
Unknown
Regular Scheduled Releases
Yearly Major, Quarterly Minor,
Hot Fixes?
N/A
Training and Certification for the Full
Hadoop Stack
Available Worldwide
None
Support for Full Lifecycle
All Inclusive Development
through Production
Community support
Rich Knowledge-base
500+ Articles
None
Production Solution Guides
Included
None
* Flume, FuseDFS, HBase, HDFS, Hive, Hue, Mahout, MR1, MR2, Oozie, Pig, Sqoop, Zookeeper
Contact Us
• Erin Hawley
– Business Development, Cloudera DoD Engagement
– ehawley@cloudera.com
• Matt Mead
– Sr. Systems Engineer, Cloudera Federal Engagements
– mmead@cloudera.com
Download