HIS2013_richard

advertisement

Hadoop @

… and other stuff

Who Am I?

• http://www.linkedin.com/in/richardbhpark

• rpark@linkedin.com

I'm this guy!

Hadoop… what is it good for?

Directly influenced by Hadoop

Indirectly influenced by Hadoop

Additionally,

50% for business analytics

A long long time ago (or 2009)

40 Million members

Apache Hadoop 0.19

20 node cluster

Machines built from Frys (pizza boxes)

PYMK in 3 days!

Now-ish

Over 5000 nodes

6 clusters (1 production, 1 dev, 2 etl, 2 test)

Apache Hadoop 1.04 (Hadoop 2.0 soon-ish)

Security turned on

About 900 users

15-20K Hadoop Jobs submissions a day

PYMK < 12 hours!

Current Setup

Use Avro (mostly)

Dev/Adhoc cluster o Used for development and testing of workflows o For analytic queries

Prod clusters o Data that will appear on our website o Only reviewed workflows

ETL clusters

Walled off

Three Common Problems

Data In

Process

Data

Hadoop Cluster

(not to scale)

Data Out

Data In

Databases (c. 2009-2010)

Originally pulled directly through JDBC on backup DB o Pulled deltas when available and merged

Data comes extra late (wait for replication of replicas) o Large data pulls affected by daily locks

Very manual. Schema, connections, repairs (manual)

No delta’s meant no Scoop

Costly (Oracle)

DWH

Live Site

Live Site

Live Site 24 hr

Offline

Offline copy copy copy

5-12 hr

Hadoop

Databases (Present)

Commit logs/deltas from Production

Copied directly to HDFS

Converted/merged to Avro

Schema is inferred

Hadoop

Live Site

Live Site

Live Site < 12 hr

Databases (Future 2014?)

Diffs sent directly to Hadoop

Avro format

Lazily merge

Explicit schema

Datastores Databus ( < 15 min )

Hadoop

Webtrack (c. 2009-2011)

Flat files (xml)

Pulled from every servers periodically, grouped and gzipped

Uploaded into Hadoop

Failures nearly untraceable

NAS

?

I seriously don’t know how many hops and copies

NAS Hadoop

Webtrack (Present)

Apache Kafka!! Yay!

Avro in, Avro out

Automatic pulls into Hadoop

Auditing

Kafka

Kafka

Kafka

Kafka

5-10 mins end to end

Kafka

Kafka

Kafka

Kafka

Hadoop

Apache Kafka

LinkedIn Events

Service metrics

Use schema registry o Compact data (md5) o Auto register o Validate schema o Get latest schema

Migrating to Kafka 0.8

o Replication

Apache Kafka + Hadoop = Camus

Avro only

Uses zookeeper o Discover new topics o Find all brokers o Find all partitions

Mappers pull from Kafka

Keeps offsets in HDFS

Partitions into hour

Counts incoming events

Kafka Auditing

Use Kafka to Audit itself

Tool to audit and alert

Compare counts

Kafka 0.8?

Lesson’s We Learned

Avoid lots of small files

Automation with Auditing = sleep for me

Group similar data = smaller/faster

Spend time writing to spend less time reading o Convert to binary, partition, compress

Future: o adaptive replication (higher for new, lower for old) o Metadata store (hcat) o Columnar store (Orc?, Parquett?)

Processing Data

Pure Java

Time consuming writing jobs

Little code re-use

Shoot yourself in the face

Only used when necessary o Performance o Memory

Lots of libraries to help (boiler plate stuff)

Little Piggy (Apache Pig)

Mainly a pigsty (Pig 11.0)

Used by data products

Transparent

Good performance, tunable

UDF’s, Datafu

Tuples and bags? WTF

Hive

Hive 11

Only for Adhoc queries o Biz ops, PM’s, analyst

Hard to tune

Easy to use

Lots of adoption

Etl data in external tables :/

Hive server 2 for JDBC

Disturbing Mascot

Future in Processing

Giraph

Impala, Shark/Spark… etc

Tez

Crunch

Other?

Say no to streaming

Workflows

Azkaban

Run hadoop jobs in order

Run regular schedules

Be notified on failures

Understand how flows are executed

View execution history

Easy to use

Azkaban @ LinkedIn

Used in LinkedIn since early 2009

Powers all our Hadoop data products

Been using 2.0+ since late 2012

2.0 and 2.1 quietly released early 2013

Azkaban @ LinkedIn

One Azkaban instance per cluster

6 clusters total

900 Users

1500 projects

10,000 flows

2500 flow executing per day

6500 jobs executing per day

Azkaban (before)

Engineer designed UI...

Azkaban 2.0

Azkaban Features

Schedules DAGs for executions

Web UI

Simple job files to create dependencies

Authorization/Authentication

Project Isolation

Extensible through plugins (works with any version of Hadoop)

Prison for dark wizards

Azkaban - Upload

Zip Job files, jars, project files

Azkaban - Execute

Azkaban - Schedule

Azkaban - Viewer Plugins

HDFS Browser

Reportal

Future Azkaban Work

Higher availability

Generic Triggering/Actions

Embedded graphs

Conditional branching

Admin client

Data Out

Voldemort

Distributed Key-Value Store

Based on Amazon Dynamo

Pluginable

Open-source

Voldemort Read-Only

Filesystem store for RO

Create data files and index on Hadoop

Copy data to Voldemort

Swap

Voldemort + Hadoop

Transfers are parallel

Transfer records in bulk

Ability to Roll back

Simple, operationally low maintenance

Why not Hbase, Cassandra?

o Legacy, and no compelling reason to change o Simplicity is nice o Real answer: I don’t know. It works, we’re happy.

Apache Kafka

Reverse the flow

Messages produced by Hadoop

Consumer upstream takes action

Used for emails, r/w store updates, where

Voldemort doesn’t make sense etc

Nearing the End

Misc Hadoop at LinkedIn

Believe in optimization o File size, task count and utilization o Reviews, culture

Strict limits o Quotas size/file count o 10K task limit

Use capacity scheduler o Default queue with 15m limit o marathon for others

We do a lot with little…

50-60% cluster utilization o Or about 5x more work than some other companies

Every job is reviewed for production o Teaches good practices o Schedule to optimize utilization o Prevents future headaches

These keep our team size small o Since 2009, hadoop users grew 90x, clusters grew

25x, LinkedIn employees grew 15x o hadoop team 5x (to 5 people)

More info

Our data site: data.linkedin.com

Kafka: kafka.apache.com

Azkaban: azkaban.github.io/azkaban2

Voldemort: project-voldemort.com

The End

Download