Monitoring and diagnostics tools for Hadoop

Matthew Jacobs
About this Talk
• Building monitoring and diagnostic tools
for Hadoop
• How we think about Hadoop monitoring
and diagnostics
• Interesting problems we have
• A few things we've learned in the process
What is Hadoop?
• Platform for distributed processing and
storage of petabytes of data on clusters of
commodity hardware
• Operating system for the cluster
• Services that interact and are composable
• HDFS, MapReduce, HBase, Pig, Hive, ZK, etc...
• Open source
• Different Apache projects, different communities
Managing the Complexity
• Hadoop distributions, e.g. Cloudera's CDH
• Packaged services, well tested
• Existing tools
• Ganglia, Nagios, Chef, Puppet, etc.
• Management tools for Hadoop
• Cloudera Manager
• Deployment, configuration, reporting,
monitoring, diagnosis
• Used by operators @ Fortune 50 companies
Thinking about Hadoop
Hadoop: services with many hosts
rather than: hosts with many services
• Tools should be service-oriented
• Most general existing management tools
are host-oriented
• Provide insight into the operation of the
• Challenges:
• Knowing what to collect
• Collecting, storing efficiently at scale
• Deciding how to present data
Hadoop Monitoring Data (1)
• Operators care about
• Resource and scheduling information
• Performance and health metrics
• Important log events
• Come from
• Metrics exposed via JMX (metrics/metrics2)
• Logs (Hadoop services, OS)
• Operating system (/proc, syscalls, etc.)
Hadoop Monitoring Data (2)
• Choosing what to collect
• Not all! Some are just confusing
• e.g. DN corrupt replicas vs. blocks with corrupt
• We’re filtering for users
• Add more when we see customer
• But…
• Interfaces change between versions
• Just messy
Example Metric Data, HDFS
• I/O metrics, read/write bytes, counts
• Blocks, replicas, corruptions
• FS info, volume failures, usage/capacity
• NameNode info, time since checkpoint,
transactions since checkpoint, num DNs
• Many more...
Hadoop Monitoring, What to show
• Building an intuitive user interface is hard
• Especially for a complex system like Hadoop
• Need service-oriented view
• Pre-baked visualizations (charts,
heatmaps, etc.)
• Generic data visualization capabilities
• Experts know exactly what they want to see
• e.g. chart number of corrupt DN block replicas
by rack
• Inform operators when something is wrong
• E.g. datanode has too many corrupt blocks
• Hard problem
• No single solution
• Need multiple tools for diagnosis
• Really don't want to be wrong
• Operators lose faith in the tool
Health Checks
• Set of rule-based checks for specific
• Simple, stateless, based on metric data
• Well targeted, catch real problems
• Easy to get 'right'
• Learn from real customer problems
• Add checks when customers hit hard-to-diagnose
• E.g. customer saw slow HBase reads
• Hard to find! bad switch → packet frame errors
Health Checks, Examples
• HDFS missing blocks, corrupt replicas
• DataNode connectivity, volume failures
• NameNode checkpoint age, safe mode
• GC duration, number file descriptors, etc...
• Canary-based checks
• e.g. can write a file to HDFS,
can perform basic HBase operations
• Many more...
Health Checks (2)
• Not good for performance, context-aware
• Have to build manually, time consuming
• Can take these further
• Add more knowledge about root cause
• Taking actions in some cases
Anomaly Detection
• Simple statistics, e.g. std deviation
• More clever machine learning algorithms
• Local outliers in high-dimensional 'metric
• Streaming algorithm seems feasible
• Identify what's abnormal for a particular
• Must use carefully – outlier != problem
• Measure of 'potential interestingness'
Other Diagnostic Tools and Challenges
• Anomaly detection via log data
• Need data across services
• E.g. slow HBase reads caused by HDFS
• Better instrumentation in platform
• E.g. Dapper-like tracing through the stack
• Future work to extend to HDFS
Challenges: Hadoop Fault Tolerance
• Hadoop is built to tolerate failures
• E.g. HDFS replication
• Not clear when to report a problem
• E.g. 1 failed DN maybe not concerning
Challenges in Diagnostics (2)
• Entities interact
• E.g. health of HDFS depends on health of DNs, NNs,
• Relations describe graph of computation to evaluate
• Evaluating cluster/service/host health becomes
• Data arrives from different sources at different
• When to evaluate health? Every minute? When data
• Complete failures >> partial failures
Challenges Operating at Scale (1)
• Building a distributed system to monitor a
distributed system
• Collect metrics for lots of 'things' (entities)
• DataNodes, NameNodes, TaskTrackers,
JobTrackers, RegionServers, Regions, etc.
• Hosts, disks, NICs, data directories, etc.
• Aggregate many metrics too
• e.g. aggregate DN metrics → HDFS-wide metrics
aggregate region metrics → table metrics
• Cluster-wide, service-wide, rack-wide, etc.
• Becomes a big data problem
Challenges Operating at Scale (2)
• At 1000 nodes...
• Hundreds of thousands of entities
• Millions of metrics written per minute
• Increase polling? Every 30 sec? 10 sec?
• Simple RDBMS is OK for a while...
• Shard, partition, etc.
Storage for Monitoring Data
• Hadoop (HBase) + OpenTSDB is great
• But we don't eat our own tail...
• Can use other TS databases
• Modify Hbase, make 'embedded' version
• Just a single node, just a single Region
• Or use LevelDB
• Fast key-value store from Google, open
LevelDB (or HBase), an example
• Data model, simplified
• Have time series for many entities: tsId
• e.g. DNs, Regions, hosts, disks, etc.
• Have many metric streams: metricId
• e.g. DN bytes read, JVM gc count, etc.
• LevelDB, fast key-value store
• Key: byte array of
• Value: data
LevelDB example (2)
• Can write many data points per row
• Timestamp in key is timestamp base
• Write each data point time delta before value
• E.g. value: “<delta1><val1><delta2><val2>...”
or “<delta1><delta2>...<val1><val2>...”
• Will compress well
• Very similar to what OpenTSDB does