HADOOP MONITORING AND DIAGNOSTICS: CHALLENGES AND LESSONS LEARNED Matthew Jacobs [email protected] About this Talk • Building monitoring and diagnostic tools for Hadoop • How we think about Hadoop monitoring and diagnostics • Interesting problems we have • A few things we've learned in the process What is Hadoop? • Platform for distributed processing and storage of petabytes of data on clusters of commodity hardware • Operating system for the cluster • Services that interact and are composable • HDFS, MapReduce, HBase, Pig, Hive, ZK, etc... • Open source • Different Apache projects, different communities Managing the Complexity • Hadoop distributions, e.g. Cloudera's CDH • Packaged services, well tested • Existing tools • Ganglia, Nagios, Chef, Puppet, etc. • Management tools for Hadoop • Cloudera Manager • Deployment, configuration, reporting, monitoring, diagnosis • Used by operators @ Fortune 50 companies Thinking about Hadoop Hadoop: services with many hosts rather than: hosts with many services • Tools should be service-oriented • Most general existing management tools are host-oriented Monitoring • Provide insight into the operation of the system • Challenges: • Knowing what to collect • Collecting, storing efficiently at scale • Deciding how to present data Hadoop Monitoring Data (1) • Operators care about • Resource and scheduling information • Performance and health metrics • Important log events • Come from • Metrics exposed via JMX (metrics/metrics2) • Logs (Hadoop services, OS) • Operating system (/proc, syscalls, etc.) Hadoop Monitoring Data (2) • Choosing what to collect • Not all! Some are just confusing • e.g. DN corrupt replicas vs. blocks with corrupt replicas • We’re filtering for users • Add more when we see customer problems • But… • Interfaces change between versions • Just messy Example Metric Data, HDFS • I/O metrics, read/write bytes, counts • Blocks, replicas, corruptions • FS info, volume failures, usage/capacity • NameNode info, time since checkpoint, transactions since checkpoint, num DNs failed • Many more... Hadoop Monitoring, What to show • Building an intuitive user interface is hard • Especially for a complex system like Hadoop • Need service-oriented view • Pre-baked visualizations (charts, heatmaps, etc.) • Generic data visualization capabilities • Experts know exactly what they want to see • e.g. chart number of corrupt DN block replicas by rack Diagnostics • Inform operators when something is wrong • E.g. datanode has too many corrupt blocks • Hard problem • No single solution • Need multiple tools for diagnosis • Really don't want to be wrong • Operators lose faith in the tool Health Checks • Set of rule-based checks for specific problems • Simple, stateless, based on metric data • Well targeted, catch real problems • Easy to get 'right' • Learn from real customer problems • Add checks when customers hit hard-to-diagnose problems • E.g. customer saw slow HBase reads • Hard to find! bad switch → packet frame errors Health Checks, Examples • HDFS missing blocks, corrupt replicas • DataNode connectivity, volume failures • NameNode checkpoint age, safe mode • GC duration, number file descriptors, etc... • Canary-based checks • e.g. can write a file to HDFS, can perform basic HBase operations • Many more... Health Checks (2) • Not good for performance, context-aware issues • Have to build manually, time consuming • Can take these further • Add more knowledge about root cause • Taking actions in some cases Anomaly Detection • Simple statistics, e.g. std deviation • More clever machine learning algorithms • Local outliers in high-dimensional 'metric space' • Streaming algorithm seems feasible • Identify what's abnormal for a particular cluster • Must use carefully – outlier != problem • Measure of 'potential interestingness' Other Diagnostic Tools and Challenges • Anomaly detection via log data • Need data across services • E.g. slow HBase reads caused by HDFS latency • Better instrumentation in platform • E.g. Dapper-like tracing through the stack HBASE-6449 • Future work to extend to HDFS Challenges: Hadoop Fault Tolerance • Hadoop is built to tolerate failures • E.g. HDFS replication • Not clear when to report a problem • E.g. 1 failed DN maybe not concerning enough Challenges in Diagnostics (2) • Entities interact • E.g. health of HDFS depends on health of DNs, NNs, etc… • Relations describe graph of computation to evaluate health • Evaluating cluster/service/host health becomes challenging • Data arrives from different sources at different times • When to evaluate health? Every minute? When data changes? • Complete failures >> partial failures Challenges Operating at Scale (1) • Building a distributed system to monitor a distributed system • Collect metrics for lots of 'things' (entities) • DataNodes, NameNodes, TaskTrackers, JobTrackers, RegionServers, Regions, etc. • Hosts, disks, NICs, data directories, etc. • Aggregate many metrics too • e.g. aggregate DN metrics → HDFS-wide metrics aggregate region metrics → table metrics • Cluster-wide, service-wide, rack-wide, etc. • Becomes a big data problem Challenges Operating at Scale (2) • At 1000 nodes... • Hundreds of thousands of entities • Millions of metrics written per minute • Increase polling? Every 30 sec? 10 sec? • Simple RDBMS is OK for a while... • Shard, partition, etc. Storage for Monitoring Data • Hadoop (HBase) + OpenTSDB is great • But we don't eat our own tail... • Can use other TS databases • Modify Hbase, make 'embedded' version • Just a single node, just a single Region • Or use LevelDB • Fast key-value store from Google, open source LevelDB (or HBase), an example • Data model, simplified • Have time series for many entities: tsId • e.g. DNs, Regions, hosts, disks, etc. • Have many metric streams: metricId • e.g. DN bytes read, JVM gc count, etc. • LevelDB, fast key-value store • Key: byte array of “<tsId><metricId><timestamp>” • Value: data LevelDB example (2) • Can write many data points per row • Timestamp in key is timestamp base • Write each data point time delta before value • E.g. value: “<delta1><val1><delta2><val2>...” or “<delta1><delta2>...<val1><val2>...” • Will compress well • Very similar to what OpenTSDB does QUESTIONS?