… and other stuff
Who Am I?
• http://www.linkedin.com/in/richardbhpark
• rpark@linkedin.com
I'm this guy!
Hadoop… what is it good for?
Directly influenced by Hadoop
Indirectly influenced by Hadoop
Additionally,
50% for business analytics
A long long time ago (or 2009)
•
40 Million members
•
Apache Hadoop 0.19
•
20 node cluster
•
Machines built from Frys (pizza boxes)
•
PYMK in 3 days!
Now-ish
•
Over 5000 nodes
•
6 clusters (1 production, 1 dev, 2 etl, 2 test)
•
Apache Hadoop 1.04 (Hadoop 2.0 soon-ish)
•
Security turned on
•
About 900 users
•
15-20K Hadoop Jobs submissions a day
•
PYMK < 12 hours!
Current Setup
•
Use Avro (mostly)
•
Dev/Adhoc cluster o Used for development and testing of workflows o For analytic queries
•
Prod clusters o Data that will appear on our website o Only reviewed workflows
•
ETL clusters
•
Walled off
Three Common Problems
Data In
Process
Data
Hadoop Cluster
(not to scale)
Data Out
Databases (c. 2009-2010)
•
Originally pulled directly through JDBC on backup DB o Pulled deltas when available and merged
•
Data comes extra late (wait for replication of replicas) o Large data pulls affected by daily locks
•
Very manual. Schema, connections, repairs (manual)
•
No delta’s meant no Scoop
•
Costly (Oracle)
DWH
Live Site
Live Site
Live Site 24 hr
Offline
Offline copy copy copy
5-12 hr
Hadoop
Databases (Present)
•
Commit logs/deltas from Production
•
Copied directly to HDFS
•
Converted/merged to Avro
•
Schema is inferred
Hadoop
Live Site
Live Site
Live Site < 12 hr
Databases (Future 2014?)
•
Diffs sent directly to Hadoop
•
Avro format
•
Lazily merge
•
Explicit schema
Datastores Databus ( < 15 min )
Hadoop
Webtrack (c. 2009-2011)
•
Flat files (xml)
•
Pulled from every servers periodically, grouped and gzipped
•
Uploaded into Hadoop
•
Failures nearly untraceable
NAS
?
I seriously don’t know how many hops and copies
NAS Hadoop
Webtrack (Present)
•
Apache Kafka!! Yay!
•
Avro in, Avro out
•
Automatic pulls into Hadoop
•
Auditing
Kafka
Kafka
Kafka
Kafka
5-10 mins end to end
Kafka
Kafka
Kafka
Kafka
Hadoop
Apache Kafka
•
LinkedIn Events
•
Service metrics
•
Use schema registry o Compact data (md5) o Auto register o Validate schema o Get latest schema
•
Migrating to Kafka 0.8
o Replication
Apache Kafka + Hadoop = Camus
•
Avro only
•
Uses zookeeper o Discover new topics o Find all brokers o Find all partitions
•
Mappers pull from Kafka
•
Keeps offsets in HDFS
•
Partitions into hour
•
Counts incoming events
Kafka Auditing
•
Use Kafka to Audit itself
•
Tool to audit and alert
•
Compare counts
•
Kafka 0.8?
Lesson’s We Learned
•
Avoid lots of small files
•
Automation with Auditing = sleep for me
•
Group similar data = smaller/faster
•
Spend time writing to spend less time reading o Convert to binary, partition, compress
•
Future: o adaptive replication (higher for new, lower for old) o Metadata store (hcat) o Columnar store (Orc?, Parquett?)
Pure Java
•
Time consuming writing jobs
•
Little code re-use
•
Shoot yourself in the face
•
Only used when necessary o Performance o Memory
•
Lots of libraries to help (boiler plate stuff)
Little Piggy (Apache Pig)
•
Mainly a pigsty (Pig 11.0)
•
Used by data products
•
Transparent
•
Good performance, tunable
•
UDF’s, Datafu
•
Tuples and bags? WTF
Hive
•
Hive 11
•
Only for Adhoc queries o Biz ops, PM’s, analyst
•
Hard to tune
•
Easy to use
•
Lots of adoption
•
Etl data in external tables :/
•
Hive server 2 for JDBC
Disturbing Mascot
Future in Processing
•
Giraph
•
Impala, Shark/Spark… etc
•
Tez
•
Crunch
•
Other?
•
Say no to streaming
Workflows
Azkaban
•
Run hadoop jobs in order
•
Run regular schedules
•
Be notified on failures
•
Understand how flows are executed
•
View execution history
•
Easy to use
Azkaban @ LinkedIn
•
Used in LinkedIn since early 2009
•
Powers all our Hadoop data products
•
Been using 2.0+ since late 2012
•
2.0 and 2.1 quietly released early 2013
Azkaban @ LinkedIn
•
One Azkaban instance per cluster
•
6 clusters total
•
900 Users
•
1500 projects
•
10,000 flows
•
2500 flow executing per day
•
6500 jobs executing per day
Azkaban (before)
Engineer designed UI...
Azkaban 2.0
Azkaban Features
•
Schedules DAGs for executions
•
Web UI
•
Simple job files to create dependencies
•
Authorization/Authentication
•
Project Isolation
•
Extensible through plugins (works with any version of Hadoop)
•
Prison for dark wizards
Azkaban - Upload
•
Zip Job files, jars, project files
Azkaban - Execute
Azkaban - Schedule
Azkaban - Viewer Plugins
HDFS Browser
Reportal
Future Azkaban Work
•
Higher availability
•
Generic Triggering/Actions
•
Embedded graphs
•
Conditional branching
•
Admin client
Voldemort
•
Distributed Key-Value Store
•
Based on Amazon Dynamo
•
Pluginable
•
Open-source
Voldemort Read-Only
•
Filesystem store for RO
•
Create data files and index on Hadoop
•
Copy data to Voldemort
•
Swap
Voldemort + Hadoop
•
Transfers are parallel
•
Transfer records in bulk
•
Ability to Roll back
•
Simple, operationally low maintenance
•
Why not Hbase, Cassandra?
o Legacy, and no compelling reason to change o Simplicity is nice o Real answer: I don’t know. It works, we’re happy.
Apache Kafka
•
Reverse the flow
•
Messages produced by Hadoop
•
Consumer upstream takes action
•
Used for emails, r/w store updates, where
Voldemort doesn’t make sense etc
Misc Hadoop at LinkedIn
•
Believe in optimization o File size, task count and utilization o Reviews, culture
•
Strict limits o Quotas size/file count o 10K task limit
•
Use capacity scheduler o Default queue with 15m limit o marathon for others
We do a lot with little…
•
50-60% cluster utilization o Or about 5x more work than some other companies
•
Every job is reviewed for production o Teaches good practices o Schedule to optimize utilization o Prevents future headaches
•
These keep our team size small o Since 2009, hadoop users grew 90x, clusters grew
25x, LinkedIn employees grew 15x o hadoop team 5x (to 5 people)
More info
Our data site: data.linkedin.com
Kafka: kafka.apache.com
Azkaban: azkaban.github.io/azkaban2
Voldemort: project-voldemort.com
The End