HIS2013_richard

Hadoop @

… and other stuff

Who Am I?

• http://www.linkedin.com/in/richardbhpark

• rpark@linkedin.com

I'm this guy!

Hadoop… what is it good for?

Directly influenced by Hadoop

Indirectly influenced by Hadoop

Additionally,

50% for business analytics

A long long time ago (or 2009)

•

40 Million members

•

Apache Hadoop 0.19

•

20 node cluster

•

Machines built from Frys (pizza boxes)

•

PYMK in 3 days!

Now-ish

•

Over 5000 nodes

•

6 clusters (1 production, 1 dev, 2 etl, 2 test)

•

Apache Hadoop 1.04 (Hadoop 2.0 soon-ish)

•

Security turned on

•

About 900 users

•

15-20K Hadoop Jobs submissions a day

•

PYMK < 12 hours!

Current Setup

•

Use Avro (mostly)

•

Dev/Adhoc cluster o Used for development and testing of workflows o For analytic queries

•

Prod clusters o Data that will appear on our website o Only reviewed workflows

•

ETL clusters

•

Walled off

Three Common Problems

Data In

Process

Data

Hadoop Cluster

(not to scale)

Data Out

Data In

Databases (c. 2009-2010)

•

Originally pulled directly through JDBC on backup DB o Pulled deltas when available and merged

•

Data comes extra late (wait for replication of replicas) o Large data pulls affected by daily locks

•

Very manual. Schema, connections, repairs (manual)

•

No delta’s meant no Scoop

•

Costly (Oracle)

DWH

Live Site

Live Site

Live Site 24 hr

Offline

Offline copy copy copy

5-12 hr

Hadoop

Databases (Present)

•

Commit logs/deltas from Production

•

Copied directly to HDFS

•

Converted/merged to Avro

•

Schema is inferred

Hadoop

Live Site

Live Site

Live Site < 12 hr

Databases (Future 2014?)

•

Diffs sent directly to Hadoop

•

Avro format

•

Lazily merge

•

Explicit schema

Datastores Databus ( < 15 min )

Hadoop

Webtrack (c. 2009-2011)

•

Flat files (xml)

•

Pulled from every servers periodically, grouped and gzipped

•

Uploaded into Hadoop

•

Failures nearly untraceable

NAS

?

I seriously don’t know how many hops and copies

NAS Hadoop

Webtrack (Present)

•

Apache Kafka!! Yay!

•

Avro in, Avro out

•

Automatic pulls into Hadoop

•

Auditing

Kafka

Kafka

Kafka

Kafka

5-10 mins end to end

Kafka

Kafka

Kafka

Kafka

Hadoop

Apache Kafka

•

LinkedIn Events

•

Service metrics

•

Use schema registry o Compact data (md5) o Auto register o Validate schema o Get latest schema

•

Migrating to Kafka 0.8

o Replication

Apache Kafka + Hadoop = Camus

•

Avro only

•

Uses zookeeper o Discover new topics o Find all brokers o Find all partitions

•

Mappers pull from Kafka

•

Keeps offsets in HDFS

•

Partitions into hour

•

Counts incoming events

Kafka Auditing

•

Use Kafka to Audit itself

•

Tool to audit and alert

•

Compare counts

•

Kafka 0.8?

Lesson’s We Learned

•

Avoid lots of small files

•

Automation with Auditing = sleep for me

•

Group similar data = smaller/faster

•

Spend time writing to spend less time reading o Convert to binary, partition, compress

•

Future: o adaptive replication (higher for new, lower for old) o Metadata store (hcat) o Columnar store (Orc?, Parquett?)

Processing Data

Pure Java

•

Time consuming writing jobs

•

Little code re-use

•

Shoot yourself in the face

•

Only used when necessary o Performance o Memory

•

Lots of libraries to help (boiler plate stuff)

Little Piggy (Apache Pig)

•

Mainly a pigsty (Pig 11.0)

•

Used by data products

•

Transparent

•

Good performance, tunable

•

UDF’s, Datafu

•

Tuples and bags? WTF

Hive

•

Hive 11

•

Only for Adhoc queries o Biz ops, PM’s, analyst

•

Hard to tune

•

Easy to use

•

Lots of adoption

•

Etl data in external tables :/

•

Hive server 2 for JDBC

Disturbing Mascot

Future in Processing

•

Giraph

•

Impala, Shark/Spark… etc

•

Tez

•

Crunch

•

Other?

•

Say no to streaming

Workflows

Azkaban

•

Run hadoop jobs in order

•

Run regular schedules

•

Be notified on failures

•

Understand how flows are executed

•

View execution history

•

Easy to use

Azkaban @ LinkedIn

•

Used in LinkedIn since early 2009

•

Powers all our Hadoop data products

•

Been using 2.0+ since late 2012

•

2.0 and 2.1 quietly released early 2013

Azkaban @ LinkedIn

•

One Azkaban instance per cluster

•

6 clusters total

•

900 Users

•

1500 projects

•

10,000 flows

•

2500 flow executing per day

•

6500 jobs executing per day

Azkaban (before)

Engineer designed UI...

Azkaban 2.0

Azkaban Features

•

Schedules DAGs for executions

•

Web UI

•

Simple job files to create dependencies

•

Authorization/Authentication

•

Project Isolation

•

Extensible through plugins (works with any version of Hadoop)

•

Prison for dark wizards

Azkaban - Upload

•

Zip Job files, jars, project files

Azkaban - Execute

Azkaban - Schedule

Azkaban - Viewer Plugins

HDFS Browser

Reportal

Future Azkaban Work

•

Higher availability

•

Generic Triggering/Actions

•

Embedded graphs

•

Conditional branching

•

Admin client

Data Out

Voldemort

•

Distributed Key-Value Store

•

Based on Amazon Dynamo

•

Pluginable

•

Open-source

Voldemort Read-Only

•

Filesystem store for RO

•

Create data files and index on Hadoop

•

Copy data to Voldemort

•

Swap

Voldemort + Hadoop

•

Transfers are parallel

•

Transfer records in bulk

•

Ability to Roll back

•

Simple, operationally low maintenance

•

Why not Hbase, Cassandra?

o Legacy, and no compelling reason to change o Simplicity is nice o Real answer: I don’t know. It works, we’re happy.

Apache Kafka

•

Reverse the flow

•

Messages produced by Hadoop

•

Consumer upstream takes action

•

Used for emails, r/w store updates, where

Voldemort doesn’t make sense etc

Nearing the End

Misc Hadoop at LinkedIn

•

Believe in optimization o File size, task count and utilization o Reviews, culture

•

Strict limits o Quotas size/file count o 10K task limit

•

Use capacity scheduler o Default queue with 15m limit o marathon for others

We do a lot with little…

•

50-60% cluster utilization o Or about 5x more work than some other companies

•

Every job is reviewed for production o Teaches good practices o Schedule to optimize utilization o Prevents future headaches

•

These keep our team size small o Since 2009, hadoop users grew 90x, clusters grew

25x, LinkedIn employees grew 15x o hadoop team 5x (to 5 people)

More info

Our data site: data.linkedin.com

Kafka: kafka.apache.com

Azkaban: azkaban.github.io/azkaban2

Voldemort: project-voldemort.com

The End

HIS2013_richard

Hadoop @

Data In

Processing Data

Data Out

Nearing the End

Related documents

Products

Support

HIS2013_richard

Hadoop @

Data In

Processing Data

Data Out

Nearing the End

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib