Monitoring and diagnostics tools for Hadoop

HADOOP MONITORING AND DIAGNOSTICS: CHALLENGES AND LESSONS LEARNED Matthew Jacobs mj@cloudera.com About this Talk • Building monitoring and diagnostic tools for Hadoop • How we think about Hadoop monitoring and diagnostics • Interesting problems we have • A few things we've learned in the process What is Hadoop? • Platform for distributed processing and storage of petabytes of data on clusters of commodity hardware • Operating system for the cluster • Services that interact and are composable • HDFS, MapReduce, HBase, Pig, Hive, ZK, etc... • Open source • Different Apache projects, different communities Managing the Complexity • Hadoop distributions, e.g. Cloudera's CDH • Packaged services, well tested • Existing tools • Ganglia, Nagios, Chef, Puppet, etc. • Management tools for Hadoop • Cloudera Manager • Deployment, configuration, reporting, monitoring, diagnosis • Used by operators @ Fortune 50 companies Thinking about Hadoop Hadoop: services with many hosts rather than: hosts with many services • Tools should be service-oriented • Most general existing management tools are host-oriented Monitoring • Provide insight into the operation of the system • Challenges: • Knowing what to collect • Collecting, storing efficiently at scale • Deciding how to present data Hadoop Monitoring Data (1) • Operators care about • Resource and scheduling information • Performance and health metrics • Important log events • Come from • Metrics exposed via JMX (metrics/metrics2) • Logs (Hadoop services, OS) • Operating system (/proc, syscalls, etc.) Hadoop Monitoring Data (2) • Choosing what to collect • Not all! Some are just confusing • e.g. DN corrupt replicas vs. blocks with corrupt replicas • We’re filtering for users • Add more when we see customer problems • But… • Interfaces change between versions • Just messy Example Metric Data, HDFS • I/O metrics, read/write bytes, counts • Blocks, replicas, corruptions • FS info, volume failures, usage/capacity • NameNode info, time since checkpoint, transactions since checkpoint, num DNs failed • Many more... Hadoop Monitoring, What to show • Building an intuitive user interface is hard • Especially for a complex system like Hadoop • Need service-oriented view • Pre-baked visualizations (charts, heatmaps, etc.) • Generic data visualization capabilities • Experts know exactly what they want to see • e.g. chart number of corrupt DN block replicas by rack Diagnostics • Inform operators when something is wrong • E.g. datanode has too many corrupt blocks • Hard problem • No single solution • Need multiple tools for diagnosis • Really don't want to be wrong • Operators lose faith in the tool Health Checks • Set of rule-based checks for specific problems • Simple, stateless, based on metric data • Well targeted, catch real problems • Easy to get 'right' • Learn from real customer problems • Add checks when customers hit hard-to-diagnose problems • E.g. customer saw slow HBase reads • Hard to find! bad switch → packet frame errors Health Checks, Examples • HDFS missing blocks, corrupt replicas • DataNode connectivity, volume failures • NameNode checkpoint age, safe mode • GC duration, number file descriptors, etc... • Canary-based checks • e.g. can write a file to HDFS, can perform basic HBase operations • Many more... Health Checks (2) • Not good for performance, context-aware issues • Have to build manually, time consuming • Can take these further • Add more knowledge about root cause • Taking actions in some cases Anomaly Detection • Simple statistics, e.g. std deviation • More clever machine learning algorithms • Local outliers in high-dimensional 'metric space' • Streaming algorithm seems feasible • Identify what's abnormal for a particular cluster • Must use carefully – outlier != problem • Measure of 'potential interestingness' Other Diagnostic Tools and Challenges • Anomaly detection via log data • Need data across services • E.g. slow HBase reads caused by HDFS latency • Better instrumentation in platform • E.g. Dapper-like tracing through the stack HBASE-6449 • Future work to extend to HDFS Challenges: Hadoop Fault Tolerance • Hadoop is built to tolerate failures • E.g. HDFS replication • Not clear when to report a problem • E.g. 1 failed DN maybe not concerning enough Challenges in Diagnostics (2) • Entities interact • E.g. health of HDFS depends on health of DNs, NNs, etc… • Relations describe graph of computation to evaluate health • Evaluating cluster/service/host health becomes challenging • Data arrives from different sources at different times • When to evaluate health? Every minute? When data changes? • Complete failures >> partial failures Challenges Operating at Scale (1) • Building a distributed system to monitor a distributed system • Collect metrics for lots of 'things' (entities) • DataNodes, NameNodes, TaskTrackers, JobTrackers, RegionServers, Regions, etc. • Hosts, disks, NICs, data directories, etc. • Aggregate many metrics too • e.g. aggregate DN metrics → HDFS-wide metrics aggregate region metrics → table metrics • Cluster-wide, service-wide, rack-wide, etc. • Becomes a big data problem Challenges Operating at Scale (2) • At 1000 nodes... • Hundreds of thousands of entities • Millions of metrics written per minute • Increase polling? Every 30 sec? 10 sec? • Simple RDBMS is OK for a while... • Shard, partition, etc. Storage for Monitoring Data • Hadoop (HBase) + OpenTSDB is great • But we don't eat our own tail... • Can use other TS databases • Modify Hbase, make 'embedded' version • Just a single node, just a single Region • Or use LevelDB • Fast key-value store from Google, open source LevelDB (or HBase), an example • Data model, simplified • Have time series for many entities: tsId • e.g. DNs, Regions, hosts, disks, etc. • Have many metric streams: metricId • e.g. DN bytes read, JVM gc count, etc. • LevelDB, fast key-value store • Key: byte array of “<tsId><metricId><timestamp>” • Value: data LevelDB example (2) • Can write many data points per row • Timestamp in key is timestamp base • Write each data point time delta before value • E.g. value: “<delta1><val1><delta2><val2>...” or “<delta1><delta2>...<val1><val2>...” • Will compress well • Very similar to what OpenTSDB does QUESTIONS?

Monitoring and diagnostics tools for Hadoop

Related documents

Products

Support

Monitoring and diagnostics tools for Hadoop

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib