Hadoop - interface:systems

BigData

Vom Experiment zur Produktion

Mario Vosschmidt

Consulting Systems Engineer

1 © 2014 NetApp, Inc. All rights reserved. NetApp Proprietary – Limited Use Only

Agenda

BigData oder SmartData?

1) Was ist „BigData“

2) Anforderungen und Herausforderungen

3) Auf welche Szenarien konzentrieren wir uns?

4) Wie sehen Lösungsansätze aus?

5) Wie implementiere ich diese Lösungen?

6) Zusammenfassung

2 © 2014 NetApp, Inc. All rights reserved. NetApp Proprietary – Limited Use Only

The Big Data

Landscape

3

BigData

The 3V Paradigm

Variety

 Multiple data sources

 Multiple data formats

Velocity

 High speed processing

 Fast changing requirements

Volume

 Huge amounts of data

 Process and persist

4

NetApp Confidential - Internal Use Only

Entering a New Era of Scale

5

6

Big Data Solution Portfolio

A B C s of Big Data at Netapp

Insight from extremely large datasets

Big

Data

Secure boundless data storage

Performance for data intensive workloads

Not Even to The “Peak”

VISIBILITY

Peak of Inflated Expectations

Plateau of Productivity

Slope of Enlightenment

Trough of Disillusionment

Technology Trigger

TIME

35 Zettabytes

Estimated size of the digital universe in 2020

30 Billion

Pieces of new content to

Facebook per month

5 Billion

Smart phones

80%

Unstructured data

7

Big Data Vendor Landscape

A Lot of Hype and Buzz – Everyone is Jumping In

200

150

100

50

400

350

300

250

0

Jan-08

Funding for Hadoop and NoSQL

Cloudera series B

MapR series A

451 Research

Cloudera series D

10gen series D

MapR series B

DataStax series B

Neo Technology series A

Opera Solutions series A

Platfora series A

Couchbase series C

Cloudera series C

Nov-11

Market is expected to grow from $3.2 billion in 2010 to $16.9 billion in 2015

Most firms are taking a pragmatic approach

 Big data is in the very early stages of maturity

 Best practices are not mature

IDC Big Data Survey

8

NetApp Confidential - Internal Use Only

"The Big Data market is expanding rapidly …

For technology buyers, opportunities exist to use Big Data technology to improve operational efficiency and to drive innovation.

Use cases are already present across industries and geographic regions."

Dan Vesset, Vice President, IDC

8

Data Growth Impact on Business

“Big Data” refers to datasets whose size is beyond the ability of typical tools to capture, store, manage and analyze

Information Becomes a Propellant to Business

Speed

Inflection

Point

Data Becomes a

Burden to IT Infrastructure

Complexity

Volume

2020

9

2010

The Big Data Opportunities

Financial Services

 Fraud detection & prevention

 Anti-money laundering

 Risk management

Government

 Law enforcement

 Counter-terrorism

 Research and Education

10

Manufacturing

 Supply chain optimization

 Defect tracking

 Root cause analysis

 RFID correlation

Healthcare

 Drug development

 Patient Records

 Evidence-based medicine

Why Should You Care?

It’s the Value of Your Data

Top line revenue

– Leverage their data assets into business advantage

 5 Billion Records

 Anywhere, Anytime

 Faster time to market

 50% Increase in Revenue

Bottom Line savings

– Lower the cost of compliance

– Manage ever growing data efficiently

11

NetApp Confidential - Internal Use Only

 Over 1PB of data

 Growth of 175% YOY

 90 days of data within

 24 hours of a failure

NetApp Big Data

13

Why NetApp?

Practical solutions that solve today’s problems

Get

Control

Break

Through

Gain

Insight

NetApp helps you turn your exploding data from threat to opportunity. Manage your data effectively and affordably.

Break through the limits. With

NetApp, you can take on even the most massive and complex data projects.

Turn insight to action. NetApp helps you get to clarity and insight faster and more reliably.

14

Experience Managing Data at Scale

NetApp’s Largest Customer

100 PB

4 Customers

50 PB

10 Customers

20 PB

50 Customers

10 PB

100 Customers

NetApp Big Data Strategy

Open

Best-of-Breed

Choice

 Best of breed storage for Big Data

Applications

 Built on open standards with bestin-class partnerships

 Validated with ecosystem leaders

 Complete server, network and storage

“Racks”

 Delivered via trusted high-value partners

15

NetApp Confidential - Internal Use Only

15

16

Analytics

Smart Data

Big Analytics Strategy

Smart Data

DSS

/

DW (traditional analytics)

 Solutions partners include IBM, Oracle, Microsoft,

ParAccel, Exasol and SAND

Big Analytics

 Enterprise class Hadoop-based solutions

 MapR, Hortonworks, Cloudera

Leverage partners to complete Big Analytics stack

 Solutions for validated server, network and storage

1

7

18

Big Analytics Solutions

Data Warehouse

Fast, space-efficient backup and recovery with storage utilization up to 90%. Less raw capacity with modular scalability

Mixed Use Database, Cubes

Optimized for IBM,

Oracle and Microsoft.

Simplified data management and protection. Zero down time

Hadoop

Enterprise class Hadoop with

Lower total cost of ownership and based on open standards

The Value Proposition:

Some problems require and Enterprise Class Hadoop Solution

Enterprise Class Hadoop

Packaged ready-to-deploy modular

Compute / Memory intensive Hadoop cluster

 Compute intensive applications

Tic Data Analysis

 Extremely tight Service Level expectations

 Severe financial consequences if the analytic run is late

Enterprise Class Hadoop

Packaged ready-to-deploy modular Hadoop cluster

The Data has intrinsic value $$$

 Usable capacity must expand faster than compute

 Higher storage performance

Real human consequences if the system fails

(Threats, treatments, financial losses)

System has to allow for asymmetric growth

White Box Hadoop

Values associated with early adopters of

Hadoop

 Social Media Space

Contributors to Apache

 Strong bias to JBOD

Skeptical of ALL vendors

Enterprise Class Hadoop

Bounded Compute algorithm / Memory intensive Hadoop cluster

 Compute intensive applications

 Additional CPUs do not improve run time

Extremely tight Service Level expectations

 Severe financial consequences if the analytic run is late

 Need for deeper storage per datanode

Storage Capacity

19

NetApp Confidential - Internal Use Only

Challenges with Hadoop in Enterprise

Availability

 NameNode is a single point of failure

 Slow recovery from disk drive failure

 Expensive process to replace failed disks online

 Most common Hadoop support issue is disk drive failure

Operations

 Requires three copies of data, larger footprint, and more storage

 Limited flexibility; storage and servers tied together affects scalability

 Low cluster efficiency, higher network congestion

Implementation

 Need to keep up with fast-paced patches, projects of open source platform

 Need to decide on distribution of Hadoop

 Skills are not common

 Integration with existing IT infrastructure can be difficult

 Tuning expertise needed to make Hadoop perform optimally

20

Cisco and NetApp Confidential. For Internal Use Only. Do Not Distribute.

© 2014 NetApp, Inc. All rights reserved. NetApp Proprietary – Limited Use Only

20

Why Big Data and Analytics as a service is important!

21 © 2014 NetApp, Inc. All rights reserved. NetApp Proprietary – Limited Use Only

FlexPod Converged Infrastructure Family

FlexPod

®

Express

MSB/Branch Office

For smaller, less-dynamic requirements and VAR velocity

App App App

FlexPod Data Center

Enterprise/Service Provider

Massively scalable shared virtual data center infrastructure

FlexPod Select

Dedicated

Big data analytics, scientific,

HPC

App App App App App App App

Compute Pool

Network Pool

Compute Pool

Network Pool

Compute

Nodes

Network / Direct

Storage Pool

Cisco UCS C-Series

Nexus ® 3K

FAS2xx0,

Two fixed pod sizes

Cisco UCS Director,

VMware ® , and Microsoft ®

Storage Pool

Cisco UCS C-Series/B-Series,

Nexus ® 5k

FAS Storage

Flexible pod sizes

FlexPod validated management and ecosystem

Storage

Cisco UCS C-Series

Nexus, Catalyst ® , MDS

E-Series, FAS

Reference architecture and/or designs

Application-based management

Netapp Reference Architecture

23

NetApp Confidential - Internal Use Only

Example: FlexPod Select with Cloudera

Cisco UCS ®

C-Series Rack

Mount Servers

Cisco UCS Fabric

Interconnect

Cisco UCS

Manager

 Converged big data platform from

NetApp and Cisco for Hadoop

 Enterprise-class Hadoop: Innovative storage, servers, networking validated with leading Hadoop distributions

 Faster time to value : Prevalidated configuration accelerates deployment

 High availability : Less downtime, higher serviceability to meet tight SLAs around data applications and processes

 Flexible scaling: Independently scale servers and storage; modular design for scaling as data needs grow

NetApp

®

FAS

Storage Systems

NetApp E-Series

Storage Array

* NetApp 50% Storage Guarantee http://www.netapp.com/us/solutions/infrastructure/virtualization/guarantee.html

24

FlexPod Select with Hadoop

NetApp and Cisco deliver enterprise class Hadoop for high availability, performance, scalability

Cloudera or Hortonworks Distribution of Hadoop

Master

Expansion

Architected for the enterprise

 Superior NameNode protection

 Faster recovery from failover

 Lower cluster downtime

Faster time to value

 Validated, presized configurations

 Low-latency, high-bandwidth networking

 12 DataNodes in master, 16 in expansion

Coexistence with current applications and infrastructure

 Supports existing applications from

SAP, Microsoft, Oracle

 Data management and monitoring with Cloudera Manager, Cisco UCS ®

Manager

26

27

Service-Level Expectations Around Data

High-Value Time-Sensitive Problems

Accelerate time to insights

Fast deployment with validated, preconfigured, reference designs

Store, process, analyze all data for new opportunities and business impact

More time to focus on data analysis rather than deal with cluster downtime

Making the Hadoop experience better

Optimized, tuned, fully configured cluster

Hadoop integrated with storage, compute, networking

Monitoring and management tools with SANtricity® and from partners (Cloudera Manager, Cisco UCS® Manager)

High density and capacity reduce data center footprint

Reduce risk in an open ecosystem

Compatibility with existing infrastructure and applications

Best-in-class partnerships, not entire stack from one vendor

Future-proof against lock-in and benefit from evolving ecosystem

FlexPod Select for

Hadoop with

Cloudera

28

Ease of Setup and Deployment

Preconfigured – Pre-Vaildated

Use Case Example: NetApp Auto Support

Phone home data representing information about the status NetApp storage controllers

 Correlate disk latency (hot) with disk type

 24 billion records

 4 weeks to run query

 Hadoop implementation 10.5 hours

 Bug detection through pattern matching

 240 billion records – Too large to run

 Hadoop implementation 18 hours

30

Wireless Service Provider

Archiving & Indexing Tools

NetApp Hadoop Solution

DN DN DN DN

DN DN DN DN

Hadoop Distributed

File System (HDFS)

32

Agent Servers

AS AS AS

Remote Site

Collector Servers

CS CS CS

Central Site

Agent Servers

AS AS AS

Remote Site

The solution consists of an eight node Hadoop cluster at the core site. All the data from the remote sites are transported over WAN into the central site.

The data gets collected, ingested, compressed and archived into the Hadoop cluster via HDFS. The data is then categorized, put into separate containers, and indexed based on its record keeping tags.

Telco Industry

Provides wireless voice and data services globally

32

Analytics & Enterprise Apps Environment

OLAP

OLTP

Mobile Devices

Location/GPS

Logs

Sensors

Applications

ETL

Other

Data

Source s

Reporting/Dashboard/Visualization

Applications

Analytics

Data Management

Storage File Systems

ETL

Content Shared Storage

Infrastructure

Storage

Data

Manageme nt

OLAP

(All other storage, i.e. internal DAS)

33

34

Bandwidth

Big Bandwidth Solutions

Full Motion Video Video Storage for Surveillance

Scalable density and performance to ingest and simultaneously analyze

UAV and satellite video data

High bandwidth & density supporting hundreds or thousands of HD cameras

Media Content Management

High ingest & play-out rates with support for media and entertainment workflows

HPC: Lustre, GPFS, BeeGfs

Massively parallel distributed file system for large scale cluster computing and

O&G Seismic Processing

Big Bandwidth Solutions

Applications

Storage File Systems

Density

Reliability

Modularity

E-Series Storage

Performance

Efficiency

Flexibility

Full-Motion Video Storage Solution

High bandwidth HD Video Ingest

• Satellite

• UAV

Full-Motion Video

Built on E-Stack

E5460 Stack

Quantum® StorNext File System

Massively Scalable

Single Data Container

Multi-Stream

Video Playout

Processing

• Exploitation

• Analyst

Viewing

Turnkey solution in a 40U industry-standard rack

 Single architecture for ingest, exploitation and dissemination

 1.8PB Raw Capacity

– 4000+ hours of uncompressed

720p HD video

 >20 GB/s R/W Performance,

>30 GB/s Peak Performance

 Scale to multiple Petabytes in a single data container

HPC: Lustre

 Performance to meet the needs of the world’s fastest

Supercomputers

 High Bandwidth & Density

– 1.8PB & 30GB/s per

40U rack

 Highly available

– No Single points of failure

– Extensive RAS features

 NetApp provided 7x24 Lustre

Support

 NetApp Professional Services

38

NetApp Confidential – Limited Use

Lawrence Livermore National Lab

Sequoia – announced as the fastest supercomputer and storage combination on the planet at ISC 2012

 Supercomputer storage to support twenty thousand trillion arithmetic operations per second with access speeds up to 1 TB/sec

 55PB of usable storage

 Simulations for nuclear weapons viability

 Counter Terrorism

 Energy Security

 Understanding Climate Change

Press Release: http://www.netapp.com/us/company/news/news-rel-20110928-990734.html

NetApp Confidential – Limited Use 39

Video Surveillance Storage

Enhance public safety with better physical security

 Industry trends are exploding storage

 Analog to Digital

 SD to HD

 7 days to 30+ Days

 Open Platform Solution

 Best of breed industry partners

 Flexible deployments

 Modular scalability

 99.999% up time

40

Unique Out-of-Band Recording

No servers required between cameras and storage

 save HW/SW, licensing, footprint, very robust, save a lot of network cabling, easy to scale.

41

NetApp Confidential - Internal Use Only

Media Content Management

 Highly scalable digital repository

 Consolidates collaborative production

 Multi-format distribution workflows

 Industry-leading bandwidth per rack to reduce bottlenecks

 Highest capacity density to minimize power and cooling

 Single namespace for multi-petabyte repositories

 Unmatched breadth of production client support

42

NetApp Confidential – Limited Use

Content Management

44 NetApp Confidential – Limited Use

Big Content Solutions

File Services Enterprise Content Repository

Multi-application workloads

Non-disruptive operation

Integrated data protection, efficiency

Distributed Content Repository

Infinite container

Fixed content

Non-disruptive operation

Integrated data protection, efficiency

Large, multi-site repository

Policy based data management

Metadata-enabled object storage

45

NetApp Confidential – Limited Use

File Services

ONTAP Cluster Mode

46

 Heterogeneous cluster:

 A mix of controller types in a single cluster per workload needs

 Entry, mid, and high-end platforms

 Native and third-party storage

(FAS and V-Series)

 Multiprotocol: NFS, pNFS, CIFS, iSCSI, FCP

 Integrated Data Protection

 Virtual storage tier:

 Match data to disk price and performance

 Manage multiple tiers in the same namespace or many

NetApp Confidential – Limited Use

Enterprise Content Repositories

ONTAP Cluster Mode with Infinite Volume

Single large content repository

 Scales to PBs and billions of files across cluster

 Native storage efficiency

Simplified operations

 Multi-tenancy

 Simplifies application workflows

 Load balances data at ingest

 Starts small, grow granularly

High availability

 Protects against disk and hardware failures

 Snapshots & Replication for quick recovery

 Manage & Upgrade non-disruptively

47

Content Repository

Object Storage Insights

 Flat Namespace

 No filesystem hierarchy

 Metadata separated

 Not within data space

 Metadata serve as descriptors

 Can change over time

 However Data is persistent

 Objects referenced by ID

Index

 Write once read many

Similar to library

Objects do not change

Single writer multiple readers

48

NetApp Confidential - Internal Use Only

 Less data management overhead

 High Metadata rates

 Less space management

 Data are replicated across Geos

 Simplified rights management

Distributed Content Repositories

StorageGRID

Large content repository for big, unstructured data

 Billions of data sets, dozens of petabytes

Create, manage and consume content globally

 Predictable access to data independent of location

 Policy-controlled data stores at each site

Intelligent data classification and access

 Metadata-based management

49

StorageGRID Functional Diagram

NAS

I/O

Object Ingest and Retrieval

NAS

Protocols

(SG 9)

HTTP API / CDMI

Metadata Tagging and Query

Global Object Namespace

Object-Level Data Management

Location-Transparent Distributed Object Store

Policy-Driven

Data Placement

Storage Systems

“We’ve increased the number of retail partners we work with from 2,000 to almost 20,000 in just a few years. In the past 6 years, we’ve seen a

1,900% increase in transactions. This plus the massive increase in digital images uploaded by consumers demanded a more robust and highly scalable storage infrastructure .”

Zach Wickes, Vice President of Technology, PNI

51

Media Content Repository

PNI Digital Media

 High-performance, scalable storage infrastructure built to support 17 million revenue-generating transactions annually

 100% uptime even during peak holiday access when transaction increase 6 to 10 times

 3PB of rich media data

 Consumer access to 950 million digital images

 20,000 worldwide retail locations, online fulfillment partners and in-store kiosks

 WalMart Canada, Costco, Sam’s Club,

Tesco, CVS/pharmacy, and Kodak

 NetApp FAS6280 and FAS3200, Data

ONTAP, and FlashCache

NetApp Confidential – Limited Use

52

Health in the Cloud

 STaaS offering for healthcare providers

 Medical Image Archive Cloud

 Two sites with ~1PB each

 2TB+ local cache at each edge site

 8x growth in capacity last 12 months

 100% uptime since start of service

 “Forever” retention policies

 ~60% of customers use hybrid cloud model

 Solution offers a proven 100% up-time with automated data movement from on-premise to offpremise public clouds with “keep forever” retention policy and indefinite growth

Press Release: http://www.netapp.com/us/company/news/news-rel-20111128-36413.html

Integrated Big Data Solutions and Expertise

Planning and implementation expertise for Big Data

Turn-key solution stacks and Big Data services

Big Data System Integrators Solutions Built on

NetApp

®

53 NetApp Confidential – Limited Use

Reference Material

54 © 2014 NetApp, Inc. All rights reserved. NetApp Proprietary – Limited Use Only

55

Flexpod Select

Common Architecture

Software Solution

Solution Rack

+

Appliance

Application Packaging

Visualization

Analytics

Integration

Management

Efficiency

Validated Architecture

& SKUs

Infrastructure Integration

& Distribution

Operational Integration

& System Integrators

56

Big Data Summary

 Enable enterprise customers to gain business advantage

 Practical solutions proven to reduce complexity, increase efficiency and lower cost of ownership

 Open standards based with bestin-class partnerships

For more information : http://www.netapp.com/us/company/leadership/big-data/

57

Next Steps - Team with the Experts

 Strategic Assessment

 Business goals

 Data growth needs

 Use case discovery

(partner delivery)

 Consult

 Solution architecture and design (NetApp delivery)

 Deploy

 Installation and implementation

(NetApp delivery)

 Solution implementation

(partner delivery)

Support options:

Global support available from

NetApp and partners

Thank You

NetApp Confidential - Internal Use Only