Apache Kafka Guide Important Notice © 2010-2021 Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, and any other product or service names or slogans contained in this document are trademarks of Cloudera and its suppliers or licensors, and may not be copied, imitated or used, in whole or in part, without the prior written permission of Cloudera or the applicable trademark holder. If this documentation includes code, including but not limited to, code examples, Cloudera makes this available to you under the terms of the Apache License, Version 2.0, including any required notices. A copy of the Apache License Version 2.0, including any notices, is included herein. A copy of the Apache License Version 2.0 can also be found here: https://opensource.org/licenses/Apache-2.0 Hadoop and the Hadoop elephant logo are trademarks of the Apache Software Foundation. All other trademarks, registered trademarks, product names and company names or logos mentioned in this document are the property of their respective owners. Reference to any products, services, processes or other information, by trade name, trademark, manufacturer, supplier or otherwise does not constitute or imply endorsement, sponsorship or recommendation thereof by us. Complying with all applicable copyright laws is the responsibility of the user. Without limiting the rights under copyright, no part of this document may be reproduced, stored in or introduced into a retrieval system, or transmitted in any form or by any means (electronic, mechanical, photocopying, recording, or otherwise), or for any purpose, without the express written permission of Cloudera. Cloudera may have patents, patent applications, trademarks, copyrights, or other intellectual property rights covering subject matter in this document. Except as expressly provided in any written license agreement from Cloudera, the furnishing of this document does not give you any license to these patents, trademarks copyrights, or other intellectual property. For information about patents covering Cloudera products, see http://tiny.cloudera.com/patents. The information in this document is subject to change without notice. Cloudera shall not be liable for any damages resulting from technical errors or omissions which may be present in this document, or from use of this document. Cloudera, Inc. 395 Page Mill Road Palo Alto, CA 94306 info@cloudera.com US: 1-888-789-1488 Intl: 1-650-362-0488 www.cloudera.com Release Information Version: CDH 6.0.x Date: August 2, 2021 Table of Contents Apache Kafka Guide.................................................................................................7 Ideal Publish-Subscribe System............................................................................................................................7 Kafka Architecture................................................................................................................................................7 Topics.....................................................................................................................................................................................8 Brokers...................................................................................................................................................................................8 Records..................................................................................................................................................................................9 Partitions................................................................................................................................................................................9 Record Order and Assignment.............................................................................................................................................10 Logs and Log Segments........................................................................................................................................................11 Kafka Brokers and ZooKeeper..............................................................................................................................................12 Kafka Setup............................................................................................................14 Hardware Requirements....................................................................................................................................14 Brokers.................................................................................................................................................................................14 ZooKeeper............................................................................................................................................................................14 Kafka Performance Considerations....................................................................................................................15 Operating System Requirements........................................................................................................................15 SUSE Linux Enterprise Server (SLES).....................................................................................................................................15 Kernel Limits ........................................................................................................................................................................15 Kafka in Cloudera Manager....................................................................................16 Kafka Clients..........................................................................................................17 Commands for Client Interactions......................................................................................................................17 Kafka Producers..................................................................................................................................................18 Kafka Consumers................................................................................................................................................19 Subscribing to a topic...........................................................................................................................................................19 Groups and Fetching............................................................................................................................................................20 Protocol between Consumer and Broker.............................................................................................................................20 Rebalancing Partitions.........................................................................................................................................................22 Consumer Configuration Properties.....................................................................................................................................23 Retries..................................................................................................................................................................................23 Kafka Clients and ZooKeeper..............................................................................................................................23 Kafka Brokers.........................................................................................................25 Single Cluster Scenarios.....................................................................................................................................25 Leader Positions...................................................................................................................................................................25 In-Sync Replicas....................................................................................................................................................................26 Topic Configuration............................................................................................................................................26 Topic Creation......................................................................................................................................................................27 Topic Properties...................................................................................................................................................................27 Partition Management.......................................................................................................................................27 Partition Reassignment........................................................................................................................................................28 Adding Partitions.................................................................................................................................................................28 Choosing the Number of Partitions......................................................................................................................................28 Controller.............................................................................................................................................................................28 Kafka Integration....................................................................................................30 Kafka Security.....................................................................................................................................................30 Client-Broker Security with TLS............................................................................................................................................30 Using Kafka’s Inter-Broker Security......................................................................................................................................33 Enabling Kerberos Authentication.......................................................................................................................................34 Enabling Encryption at Rest.................................................................................................................................................35 Topic Authorization with Kerberos and Sentry.....................................................................................................................36 Managing Multiple Kafka Versions.....................................................................................................................39 Kafka Feature Support in Cloudera Manager and CDH........................................................................................................39 Client/Broker Compatibility Across Kafka Versions..............................................................................................................40 Upgrading your Kafka Cluster..............................................................................................................................................40 Managing Topics across Multiple Kafka Clusters................................................................................................42 Consumer/Producer Compatibility.......................................................................................................................................42 Topic Differences between Clusters......................................................................................................................................43 Optimize Mirror Maker Producer Location..........................................................................................................................43 Destination Cluster Configuration........................................................................................................................................43 Kerberos and Mirror Maker.................................................................................................................................................43 Setting up Mirror Maker in Cloudera Manager...................................................................................................................43 Setting up an End-to-End Data Streaming Pipeline............................................................................................44 Data Streaming Pipeline......................................................................................................................................................44 Ingest Using Kafka with Apache Flume................................................................................................................................44 Using Kafka with Apache Spark Streaming for Stream Processing......................................................................................51 Developing Kafka Clients....................................................................................................................................52 Simple Client Examples........................................................................................................................................................52 Moving Kafka Clients to Production.....................................................................................................................................55 Kafka Metrics......................................................................................................................................................57 Metrics Categories...............................................................................................................................................................57 Viewing Metrics...................................................................................................................................................................57 Building Cloudera Manager Charts with Kafka Metrics.......................................................................................................58 Kafka Administration..............................................................................................59 Kafka Administration Basics...............................................................................................................................59 Broker Log Management.....................................................................................................................................................59 Record Management...........................................................................................................................................................59 Broker Garbage Log Collection and Log Rotation................................................................................................................60 Adding Users as Kafka Administrators.................................................................................................................................60 Migrating Brokers in a Cluster............................................................................................................................60 Using rsync to Copy Files from One Broker to Another........................................................................................................61 Setting User Limits for Kafka..............................................................................................................................61 Quotas................................................................................................................................................................61 Setting Quotas.....................................................................................................................................................................62 Kafka Administration Using Command Line Tools..............................................................................................62 Unsupported Command Line Tools.....................................................................................................................................62 Notes on Kafka CLI Administration.....................................................................................................................................63 kafka-topics..........................................................................................................................................................................64 kafka-configs........................................................................................................................................................................64 kafka-console-consumer......................................................................................................................................................64 kafka-console-producer.......................................................................................................................................................65 kafka-consumer-groups.......................................................................................................................................................65 kafka-reassign-partitions.....................................................................................................................................................65 kafka-log-dirs.......................................................................................................................................................................69 kafka-*-perf-test..................................................................................................................................................................70 Enabling DEBUG or TRACE in command line scripts............................................................................................................71 Understanding the kafka-run-class Bash Script.................................................................................................................71 JBOD...................................................................................................................................................................71 JBOD Setup and Migration...................................................................................................................................................71 JBOD Operational Procedures..............................................................................................................................................74 Kafka Performance Tuning......................................................................................77 Tuning Brokers....................................................................................................................................................77 Tuning Producers................................................................................................................................................77 Tuning Consumers..............................................................................................................................................78 Mirror Maker Performance................................................................................................................................78 Kafka Tuning: Handling Large Messages.............................................................................................................78 Kafka Cluster Sizing............................................................................................................................................79 Cluster Sizing - Network and Disk Message Throughput.....................................................................................................79 Choosing the Number of Partitions for a Topic....................................................................................................................80 Kafka Performance Broker Configuration...........................................................................................................82 JVM and Garbage Collection................................................................................................................................................82 Network and I/O Threads....................................................................................................................................................82 ISR Management.................................................................................................................................................................82 Log Cleaner..........................................................................................................................................................................83 Kafka Performance: System-Level Broker Tuning...............................................................................................83 File Descriptor Limits............................................................................................................................................................83 Filesystems...........................................................................................................................................................................84 Virtual Memory Handling....................................................................................................................................................84 Networking Parameters.......................................................................................................................................................84 Configuring JMX Ephemeral Ports.......................................................................................................................................84 Kafka-ZooKeeper Performance Tuning...............................................................................................................85 Kafka Reference.....................................................................................................86 Metrics Reference..............................................................................................................................................86 Useful Shell Command Reference....................................................................................................................161 Hardware Information.......................................................................................................................................................161 Disk Space..........................................................................................................................................................................161 I/O Activity and Utilization.................................................................................................................................................161 File Descriptor Usage.........................................................................................................................................................162 Network Ports, States, and Connections............................................................................................................................162 Process Information...........................................................................................................................................................162 Kernel Configuration..........................................................................................................................................................162 Kafka Public APIs..................................................................................................163 Kafka Frequently Asked Questions.......................................................................164 Basics................................................................................................................................................................164 Use Cases.........................................................................................................................................................166 References........................................................................................................................................................172 Appendix: Apache License, Version 2.0.................................................................173 Apache Kafka Guide Apache Kafka Guide Apache Kafka is a streaming message platform. It is designed to be high performance, highly available, and redundant. Examples of applications that can use such a platform include: • Internet of Things. TVs, refrigerators, washing machines, dryers, thermostats, and personal health monitors can all send telemetry data back to a server through the Internet. • Sensor Networks. Areas (farms, amusement parks, forests) and complex devices (engines) can be designed with an array of sensors to track data or current status. • Positional Data. Delivery trucks or massively multiplayer online games can send location data to a central platform. • Other Real-Time Data. Satellites and medical sensors can send information to a central area for processing. Ideal Publish-Subscribe System The ideal publish-subscribe system is straight-forward: Publisher A’s messages must make their way to Subscriber A, Publisher B’s messages must make their way to Subscriber B, and so on. Figure 1: Ideal Publish-Subscribe System An ideal system has the benefit of: • • • • • Unlimited Lookback. A new Subscriber A1 can read Publisher A’s stream at any point in time. Message Retention. No messages are lost. Unlimited Storage. The publish-subscribe system has unlimited storage of messages. No Downtime. The publish-subscribe system is never down. Unlimited Scaling. The publish-subscribe system can handle any number of publishers and/or subscribers with constant message delivery latency. Now let's see how Kafka’s implementation relates to this ideal system. Kafka Architecture As is the case with all real-world systems, Kafka's architecture deviates from the ideal publish-subscribe system. Some of the key differences are: • • • • Messaging is implemented on top of a replicated, distributed commit log. The client has more functionality and, therefore, more responsibility. Messaging is optimized for batches instead of individual messages. Messages are retained even after they are consumed; they can be consumed again. The results of these design decisions are: • • • • Extreme horizontal scalability Very high throughput High availability but, different semantics and message delivery guarantees Apache Kafka Guide | 7 Apache Kafka Guide The next few sections provide an overview of some of the more important parts, while later section describe design specifics and operations in greater detail. Topics In the ideal system presented above, messages from one publisher would somehow find their way to each subscriber. Kafka implements the concept of a topic. A topic allows easy matching between publishers and subscribers. Figure 2: Topics in a Publish-Subscribe System A topic is a queue of messages written by one or more producers and read by one or more consumers. A topic is identified by its name. This name is part of a global namespace of that Kafka cluster. Specific to Kafka: • Publishers are called producers. • Subscribers are called consumers. As each producer or consumer connects to the publish-subscribe system, it can read from or write to a specific topic. Brokers Kafka is a distributed system that implements the basic features of the ideal publish-subscribe system described above. Each host in the Kafka cluster runs a server called a broker that stores messages sent to the topics and serves consumer requests. 8 | Apache Kafka Guide Apache Kafka Guide Figure 3: Brokers in a Publish-Subscribe System Kafka is designed to run on multiple hosts, with one broker per host. If a host goes offline, Kafka does its best to ensure that the other hosts continue running. This solves part of the “No Downtime” and “Unlimited Scaling” goals from the ideal publish-subscribe system. Kafka brokers all talk to Zookeeper for distributed coordination, additional help for the "Unlimited Scaling" goal from the ideal system. Topics are replicated across brokers. Replication is an important part of “No Downtime,” “Unlimited Scaling,” and “Message Retention” goals. There is one broker that is responsible for coordinating the cluster. That broker is called the controller. As mentioned earlier, an ideal topic behaves as a queue of messages. In reality, having a single queue has scaling issues. Kafka implements partitions for adding robustness to topics. Records In Kafka, a publish-subscribe message is called a record. A record consists of a key/value pair and metadata including a timestamp. The key is not required, but can be used to identify messages from the same data source. Kafka stores keys and values as arrays of bytes. It does not otherwise care about the format. The metadata of each record can include headers. Headers may store application-specific metadata as key-value pairs. In the context of the header, keys are strings and values are byte arrays. For specific details of the record format, see the Record definition in the Apache Kafka documentation. Partitions Instead of all records handled by the system being stored in a single log, Kafka divides records into partitions. Partitions can be thought of as a subset of all the records for a topic. Partitions help with the ideal of “Unlimited Scaling”. Records in the same partition are stored in order of arrival. When a topic is created, it is configured with two properties: Apache Kafka Guide | 9 Apache Kafka Guide partition count The number of partitions that records for this topic will be spread among. replication factor The number of copies of a partition that are maintained to ensure consumers always have access to the queue of records for a given topic. Each topic has one leader partition. If the replication factor is greater than one, there will be additional follower partitions. (For the replication factor = M, there will be M-1 follower partitions.) Any Kafka client (a producer or consumer) communicates only with the leader partition for data. All other partitions exist for redundancy and failover. Follower partitions are responsible for copying new records from their leader partitions. Ideally, the follower partitions have an exact copy of the contents of the leader. Such partitions are called in-sync replicas (ISR). With N brokers and topic replication factor M, then • If M < N, each broker will have a subset of all the partitions • If M = N, each broker will have a complete copy of the partitions In the following illustration, there are N = 2 brokers and M = 2 replication factor. Each producer may generate records that are assigned across multiple partitions. Figure 4: Records in a Topic are Stored in Partitions, Partitions are Replicated across Brokers Partitions are the key to keeping good record throughput. Choosing the correct number of partitions and partition replications for a topic • Spreads leader partitions evenly on brokers throughout the cluster • Makes partitions within the same topic are roughly the same size. • Balances the load on brokers. Record Order and Assignment By default, Kafka assigns records to a partitions round-robin. There is no guarantee that records sent to multiple partitions will retain the order in which they were produced. Within a single consumer, your program will only have record ordering within the records belonging to the same partition. This tends to be sufficient for many use cases, but does add some complexity to the stream processing logic. Tip: Kafka guarantees that records in the same partition will be in the same order in all replicas of that partition. 10 | Apache Kafka Guide Apache Kafka Guide If the order of records is important, the producer can ensure that records are sent to the same partition. The producer can include metadata in the record to override the default assignment in one of two ways: • The record can indicate a specific partition. • The record can includes an assignment key. The hash of the key and the number of partitions in the topic determines which partition the record is assigned to. Including the same key in multiple records ensures all the records are appended to the same partition. Logs and Log Segments Within each topic, each partition in Kafka stores records in a log structured format. Conceptually, each record is stored sequentially in this type of “log”. Figure 5: Partitions in Log Structured Format Note: These references to “log” should not be confused with where the Kafka broker stores their operational logs. In actuality, each partition does not keep all the records sequentially in a single file. Instead, it breaks each log into log segments. Log segments can be defined using a size limit (for example, 1 GB), as a time limit (for example, 1 day), or both. Administration around Kafka records often occurs at the log segment level. Each of the partitions is broken into segments, with Segment N containing the most recent records and Segment 1 containing the oldest retained records. This is configurable on a per-topic basis. Apache Kafka Guide | 11 Apache Kafka Guide Figure 6: Partition Log Segments Kafka Brokers and ZooKeeper The broker, topic, and partition information are maintained in Zookeeper. In particular, the partition information, including partition and replica locations, updates fairly frequently. Because of frequent metadata refreshes, the connection between the brokers and the Zookeeper cluster needs to be reliable. Similarly, if the Zookeeper cluster has other intensive processes running on it, that can add sufficient latency to the broker/Zookeeper interactions to cause issues. • Kafka Controller maintains leadership via Zookeeper (shown in orange) • Kafka Brokers also store other relevant metadata in Zookeeper (also in orange) • Kafka Partitions maintain replica information in Zookeeper (shown in blue) 12 | Apache Kafka Guide Apache Kafka Guide Figure 7: Broker/ZooKeeper Dependencies Apache Kafka Guide | 13 Kafka Setup Kafka Setup Hardware Requirements Kafka can function on a fairly small amount of resources, especially with some configuration tuning. Out of the box configurations can run on little as 1 core and 1 GB memory with storage scaled based on data retention requirements. These are the defaults for both broker and Mirror Maker in Cloudera Manager version 6.x. Brokers Kafka requires a fairly small amount of resources, especially with some configuration tuning. By default, Kafka, can run on as little as 1 core and 1GB memory with storage scaled based on requirements for data retention. CPU is rarely a bottleneck because Kafka is I/O heavy, but a moderately-sized CPU with enough threads is still important to handle concurrent connections and background tasks. Kafka brokers tend to have a similar hardware profile to HDFS data nodes. How you build them depends on what is important for your Kafka use cases. Use the following guidelines: To affect performance of these features: Adjust these parameters: Message Retention Disk size Client Throughput (Producer & Consumer) Network capacity Producer throughput Disk I/O Consumer throughput Memory A common choice for a Kafka node is as follows: Component Broker Memory/Java Heap • RAM: 64 GB • Recommended Java heap: 4 GB CPU 12- 24 cores Set this value using the Java Heap Size of Broker Kafka configuration property. Disk • 1 HDD For operating system • 1 HDD for Zookeeper dataLogDir • 10- HDDs, using Raid 10, for Kafka data See Other Kafka Broker Properties table. MirrorMaker 1 GB heap Set this value using the Java Heap Size of MirrorMaker Kafka configuration property. 1 core per 3-4 streams No disk space needed on MirrorMaker instance. Destination brokers should have sufficient disk space to store the topics being copied over. Networking requirements: Gigabit Ethernet or 10 Gigabit Ethernet. Avoid clusters that span multiple data centers. ZooKeeper It is common to run ZooKeeper on 3 broker nodes that are dedicated for Kafka. However, for optimal performance Cloudera recommends the usage of dedicated Zookeeper hosts. This is especially true for larger, production environments. 14 | Apache Kafka Guide Kafka Setup Kafka Performance Considerations The simplest recommendation for running Kafka with maximum performance is to have dedicated hosts for the Kafka brokers and a dedicated ZooKeeper cluster for the Kafka cluster. If that is not an option, consider these additional guidelines for resource sharing with the Kafka cluster: Do not run in VMs It is common practice in modern data centers to run processes in virtual machines. This generally allows for better sharing of resources. Kafka is sufficiently sensitive to I/O throughput that VMs interfere with the regular operation of brokers. For this reason, it is highly recommended to not use VMs for Kafka; if you are running Kafka in a virtual environment you will need to rely on your VM vendor for help optimizing Kafka performance. Do not run other processes with Brokers or ZooKeeper Due to I/O contention with other processes, it is generally recommended to avoid running other such processes on the same hosts as Kafka brokers. Keep the Kafka-ZooKeeper Connection Stable Kafka relies heavily on having a stable ZooKeeper connection. Putting an unreliable network between Kafka and ZooKeeper will appear as if ZooKeeper is offline to Kafka. Examples of unreliable networks include: • Do not put Kafka/ZooKeeper nodes on separated networks • Do not put Kafka/ZooKeeper nodes on the same network with other high network loads Operating System Requirements SUSE Linux Enterprise Server (SLES) Unlike CentOS, SLES limits virtual memory by default. Changing this default requires adding the following entries to the /etc/security/limits.conf file: * hard as unlimited * soft as unlimited Kernel Limits There are three settings you must configure properly for the kernel. • File Descriptors You can set these in Cloudera Manager via Kafka > Configuration > Maximum Process File Descriptors. We recommend a configuration of 100000 or higher. • Max Memory Map You must configure this in your specific kernel settings. We recommend a configuration of 32000 or higher. • Max Socket Buffer Size Set the buffer size larger than any Kafka send buffers that you define. Apache Kafka Guide | 15 Kafka in Cloudera Manager Kafka in Cloudera Manager Cloudera Manager is the recommended way to administer your cluster, including administering Kafka. When you open the Kafka service from Cloudera Manager, you see the following dashboard: Figure 8: Kafka Service in Cloudera Manager 16 | Apache Kafka Guide Kafka Clients Kafka Clients Kafka clients are created to read data from and write data to the Kafka system. Clients can be producers, which publish content to Kafka topics. Clients can be subscribers, which read content from Kafka topics. Commands for Client Interactions Assuming you have Kafka running on your cluster, here are some commands that describe the typical steps you would need to exercise Kafka functionality: Create a topic kafka-topics --create --zookeeper zkinfo --replication-factor 1 --partitions 1 --topic test where the ZooKeeper connect string zkinfo is a comma-separated list of the Zookeeper nodes in host: port format. Validate the topic was created successfully kafka-topics --list --zookeeper zkinfo Produce messages The following command can be used to publish a message to the Kafka cluster. After the command, each typed line is a message that is sent to Kafka. After the last message, send an EOF or stop the command with Ctrl-D. $ kafka-console-producer --broker-list kafkainfo --topic test My first message. My second message. ^D where kafkainfo is a comma-separated list of the Kafka brokers in host:port format. Using more than one makes sure that the command can find a running broker. Consume messages The following command can be used to subscribe to a message from the Kafka cluster. kafka-console-consumer --bootstrap-server kafkainfo --topic test --from-beginning The output shows the same messages that you entered during your producer. Set a ZooKeeper root node It’s possible to use a root node (chroot) for all Kafka nodes in ZooKeeper by setting a value for zookeeper.chroot in Cloudera Manager. Append this value to the end of your ZooKeeper connect string. Set chroot in Cloudera Manager: zookeeper.chroot=/kafka Form the ZooKeeper connect string as follows: --zookeeper zkinfo/kafka Apache Kafka Guide | 17 Kafka Clients If you set chroot and then use only the host and port in the connect string, you'll see the following exception: InvalidReplicationFactorException: replication factor: 3 larger than available brokers: 0 Kafka Producers Kafka producers are the publishers responsible for writing records to topics. Typically, this means writing a program using the KafkaProducer API. To instantiate a producer: KafkaProducer<String, String> producer = new KafkaProducer<>(producerConfig); Most of the important producer settings, and mentioned below, are in the configuration passed by this constructor. Serialization of Keys and Values For each producer, there are two serialization properties that must be set, key.serializer (for the key) and value.serializer (for the value). You can write custom code for serialization or use one of the ones already provided by Kafka. Some of the more commonly used ones are: • ByteArraySerializer: Binary data • StringSerializer: String representations Managing Record Throughput There are several settings to control how many records a producer accumulates before actually sending the data to the cluster. This tuning is highly dependent on the data source. Some possibilities include: • batch.size: Combine this fixed number of records before sending data to the cluster. • linger.ms: Always wait at least this amount of time before sending data to the cluster; then send however many records has accumulated in that time. • max.request.size: Put an absolute limit on data size sent. This technique prevents network congestion caused by a single transfer request containing a large amount of data relative to the network speed. • compression.type: Enable compression of data being sent. • retries: Enable the client for retries based on transient network errors. Used for reliability. Acknowledgments The full write path for records from a producer is to the leader partition and then to all of the follower replicas. The producer can control which point in the path triggers an acknowledgment. Depending on the acks setting, the producer may wait for the write to propagate all the way through the system or only wait for the earliest success point. Valid acks values are: • 0: Do not wait for any acknowledgment from the partition (fastest throughput). • 1: Wait only for the leader partition response. • all: Wait for follower partitions responses to meet minimum (slowest throughput). Partitioning In Kafka, the partitioner determines how records map to partitions. Use the mapping to ensure the order of records within a partition and manage the balance of messages across partitions. The default partitioner uses the entire key to determine which partition a message corresponds to. Records with the same key are always mapped to the same partition (assuming the number of partitions does not change for a topic). Consider writing a custom partitioner if you have information about how your records are distributed that can produce more efficient load balancing across partitions. A custom partitioner lets you take advantage of the other data in the record to control partitioning. 18 | Apache Kafka Guide Kafka Clients If a partitioner is not provided to the KafkaProducer, Kafka uses a default partitioner. The ProducerRecord class is the actual object processed by the KafkaProducer. It takes the following parameters: • Kafka Record: The key and value to be stored. • Intended Destination: The destination topic and the specific partition (optional). Kafka Consumers Kafka consumers are the subscribers responsible for reading records from one or more topics and one or more partitions of a topic. Consumers subscribing to a topic can happen manually or automatically; typically, this means writing a program using the KafkaConsumer API. To instantiate a consumer: KafkaConsumer<String, String> kafkaConsumer = new KafkaConsumer<>(consumerConfig); The KafkaConsumer class has two generic type parameters. Just as producers can send data (the values) with keys, the consumer can read data by keys. In this example both the keys and values are strings. If you define different types, you need to define a deserializer to accommodate the alternate types. For deserializers you need to implement the org.apache.kafka.common.serialization.Deserializer interface. The most important configuration parameters that we need to specify are: • bootstrap.servers: A list of brokers to initially connect to. List 2 to 3 brokers; you don't needed to list the full cluster. • group.id: Every consumer belongs to a group. That way they’ll share the partitions of a topic. • key.deserializer/value.deserializer: Specify how to convert the Java representation to a sequence of bytes to send data through the Kafka protocol. Subscribing to a topic Subscribing to a topic using the subscribe() method call: kafkaConsumer.subscribe(Collections.singletonList(topic), rebalanceListener); Here we specify a list of topics that we want to consume from and a 'rebalance listener.' Rebalancing is an important part of the consumer's life. Whenever the cluster or the consumers’ state changes, a rebalance will be issued. This will ensure that all the partitions are assigned to a consumer. After subscribing to a topic, the consumer polls to see if there are new records: while (true) { data = kafkaConsumer.poll(); // do something with 'data' } The poll returns multiple records that can be processed by the client. After processing the records the client commits offsets synchronously, thus waiting until processing completes before continuing to poll. The last important point is to save the progress. This can be done by the commitSync() and commitAsync() methods respectively. PLACEHOLDER FOR CODE SNIPPET Auto commit is not recommended; manual commit is appropriate in the majority of use cases. Apache Kafka Guide | 19 Kafka Clients Groups and Fetching Kafka consumers are usually assigned to a group. This happens statically by setting the group.id configuration property in the consumer configuration. Consuming with groups will result in the consumers balancing the load in the group. That means each consumer will have their fair share of partitions. Also it can never be more consumers than partitions as that way there would be idling consumers. As shown in the figure below, both consumer groups share the partitions and each partition multicasts messages to both consumer groups. The consumers pull messages from the broker instead of the broker periodically pushing what is available. This helps the consumer as it won’t be overloaded and it can query the broker at its own speed. Furthermore, to avoid tight looping, it uses a so called “long-poll”. The consumer sends a fetch request to poll for data and receives a reply only when enough data accumulates on the broker. Figure 9: Consumer Groups and Fetching from Partitions Protocol between Consumer and Broker This section details how the protocol works, what messages are going on the wire and how that contributes to the overall behavior of the consumer. When discussing the internals of the consumers, there are a couple of basic terms to know: Heartbeat When the consumer is alive and is part of the consumer group, it sends heartbeats. These are short periodic messages that tell the brokers that the consumer is alive and everything is fine. Session Often one missing heartbeat is not a big deal, but how do you know if a consumer is not sending heartbeats for long enough to indicate a problem? A session is such a time interval. If the consumer didn’t send any heartbeats for longer than the session, the broker can consider the consumer dead and remove it from the group. Coordinator The special broker which manages the group on the broker side is called the coordinator. The coordinator handles heartbeats and assigns the leader. Every group has a coordinator that organizes the startup of a consumer group and assist whenever a consumer leaves the group. Leader The leader consumer is elected by the coordinator. Its job is to assign partitions to every consumer in the group at startup or whenever a consumer leaves or joins the group. The leader holds the assignment strategy, it is decoupled from the broker. That means consumers can reconfigure the partition assignment strategy without restarting the brokers. 20 | Apache Kafka Guide Kafka Clients Startup Protocol As mentioned before, the consumers are working usually in groups. So a major part of the startup process is spent with figuring out the consumer group. At startup, the first step is to match protocol versions. It is possible that the broker and the consumer are of different versions (the broker is older and the consumer is newer, or vice versa). This matching is done by the API_VERSIONS request. Figure 10: Startup Protocol The next step is to collect cluster information, such as the addresses of all the brokers (prior to this point we used the bootstrap server as a reference), partition counts, and partition leaders. This is done in the METADATA request. After acquiring the metadata, the consumer has the information needed to join the group. By this time on the broker side, a coordinator has been selected per consumer group. The consumers must find their coordinator with the FIND_COORDINATOR request. After finding the coordinator, the consumer(s) are ready to join the group. Every consumer in the group sends their own member-specific metadata to the coordinator in the JOIN_GROUP request. The coordinator waits until all the consumers have sent their request, then assigns a leader for the group. At the response plus the collected metadata are sent to the leader, so it knows about its group. The remaining step is to assign partitions to consumers and propagate this state. Similar to the previous request, all consumers send a SYNC_GROUP request to the coordinator; the leader provides the assignments in this request. After it receives the sync request from each group member, the coordinator propagates this member state in the response. By the end of this step, the consumers are ready and can start consuming. Consumption Protocol When consuming, the first step is to query where should the consumer start. This is done in the OFFSET_FETCH request. This is not mandatory: the consumer can also provide the offset manually. After this, the consumer is free to pull data from the broker. Data consumption happens in the FETCH requests. These are the long-pull requests. They are answered only when the broker has enough data; the request can be outstanding for a longer period of time. Apache Kafka Guide | 21 Kafka Clients Figure 11: Consumption Protocol From time to time, the application has to either manually or automatically save the offsets in an OFFSET_COMMIT request and send heartbeats too in the HEARTBEAT requests. The first ensures that the position is saved while the latter ensures that the coordinator knows that the consumer is alive. Shutdown Protocol The last step when the consumption is done is to shut down the consumer gracefully. This is done in one single step, called the LEAVE_GROUP protocol. Figure 12: Shutdown Protocol Rebalancing Partitions You may notice that there are multiple points in the protocol between consumers and brokers where failures can occur. There are points in the normal operation of the system where you need to change the consumer group assignments. For example, to consume a new partition or to respond to a consumer going offline. The process or responding to cluster information changing is called rebalance. It can occur in the following cases: • A consumer leaves. It can be a software failure where the session times out or a connection stalls for too long, but it can also be a graceful shutdown. • A consumer joins. It can be a new consumer but an old one that just recovered from a software failure (automatically or manually). • Partition is adjusted. A partition can simply go offline because of a broker failure or a partition coming back online. Alternatively an administrator can add or remove partitions to/from the broker. In these cases the consumers must reassign who is consuming. • The cluster is adjusted. When a broker goes offline, the partitions that are lead by this broker will be reassigned. In turn the consumers must rebalance so that they consume from the new leader. When a broker comes back, then eventually a preferred leader election happens which restores the original leadership. The consumers must follow this change as well. 22 | Apache Kafka Guide Kafka Clients On the consumer side, this rebalance is propagated to the client via the ConsumerRebalanceListener interface. It has two methods. The first, onPartitionsRevoked, will be invoked when any partition goes offline. This call happens before the changes would reflect in any of the consumers, so this is the chance to save offsets if manual offset commit is used. On the other hand onPartitionsAssigned is invoked after partition reassignment. This would allow for the programmer to detect which partitions are currently assigned to the current consumer. Complete examples can be found in the development section. Consumer Configuration Properties There are some very important configurations that any user of Kafka must know: • heartbeat.interval.ms: The interval of the heartbeats. For example, if the heartbeat interval is set to 3 seconds, the consumer sends a short heartbeat message to the broker every 3 seconds to indicate that it is alive. • session.timeout.ms: The consumer tells this timeout to the coordinator. This is used to control the heartbeats and remove the dead consumers. If it’s set to 10 seconds, the consumer can miss sending 2 heartbeats, assuming the previous heartbeat setting. If we increase the timeout, the consumer has more room for delays but the broker notices lagging consumers later. • max.poll.interval.ms: It is a very important detail: the consumers must maintain polling and should never do long-running processing. If a consumer is taking too much time between two polls, it will be detached from the consumer group. We can tune this configuration according to our needs. Note that if a consumer is stuck in processing, it will be noticed later if the value is increased. • request.timeout.ms: Generally every request has a timeout. This is an upper bound that the client waits for the server’s response. If this timeout elapses, then retries might happen if the number of retries are not exhausted. Retries In Kafka retries typically happen on only for certain kinds of errors. When a retriable error is returned, the clients are constrained by two facts: the timeout period and the backoff period. The timeout period tells how long the consumer can retry the operation. The backoff period how often the consumer should retry. There is no generic approach for "number of retries." Number of retries are usually controlled by timeout periods. Kafka Clients and ZooKeeper The default consumer model provides the metadata for offsets in the Kafka cluster. There is a topic named __consumer_offsets that the Kafka Consumers write their offsets to. Figure 13: Kafka Consumer Dependencies In releases before version 2.0 of CDK Powered by Apache Kafka, the same metadata was located in ZooKeeper. The new model removes the dependency and load from Zookeeper. In the old approach: • The consumers save their offsets in a "consumer metadata" section of ZooKeeper. • With most Kafka setups, there are often a large number of Kafka consumers. The resulting client load on ZooKeeper can be significant, therefore this solution is discouraged. Apache Kafka Guide | 23 Kafka Clients Figure 14: Kafka Consumer Dependencies (Old Approach) 24 | Apache Kafka Guide Kafka Brokers Kafka Brokers This section covers some of how a broker operates in greater detail. As we go over some of these details, we will illustrate how these pieces can cause brokers to have issues. Single Cluster Scenarios The figure below shows a simplified version of a Kafka cluster in steady state. There are N brokers, two topics with nine partitions each. The M replicated partitions are not shown for simplicity. This is going to be the baseline for discussions in later sections. Figure 15: Kafka Cluster in Steady State Leader Positions In the baseline example, each broker shown has three partitions per topic. In the figure above, the Kafka cluster has well balanced leader partitions. Recall the following: • Producer writes and consumer reads occur at the partition level • Leader partitions are responsible for ensuring that the follower partitions keep their records in sync In the baseline example, since the leader partitions were evenly distributed, most of the time the load to the overall Kafka cluster will be relatively balanced. In the example below, since a large chunk of the leaders for Topic A and Topic B are on Broker 1, a lot more of the overall Kafka workload will occur at Broker 1. This will cause a backlog of work, which slows down the cluster throughput, which will worsen the backlog. Apache Kafka Guide | 25 Kafka Brokers Figure 16: Kafka Cluster with Leader Partition Imbalance Even if a cluster starts with perfectly balanced topics, failures of brokers can cause these imbalances: if leader of a partition goes down one of the replicas will become the leader. When the original (preferred) leader comes back, it will get back leadership only if automatic leader rebalancing is enabled; otherwise the node will become a replica and the cluster gets imbalanced. In-Sync Replicas Let’s look at Topic A from the previous example with follower partitions: • Broker 1 has six leader partitions, broker 2 has two leader partitions, and broker 3 has one leader partition. • Assuming a replication factor of 3. Assuming all replicas are in-sync, then any leader partition can be moved from Broker 1 to another broker without issue. However, in the case where some of the follower partitions have not caught up, then the ability to change leaders or have a leader election will be hampered. Figure 17: Kafka Topic with Leader and Follower Partitions Topic Configuration We already introduced the concept of topics. When managing a Kafka cluster, configuring a topic can require some planning. For small clusters or low record throughput, topic planning isn’t particularly tricky, but as you scale to the large clusters and high record throughput, such planning becomes critical. 26 | Apache Kafka Guide Kafka Brokers Topic Creation To be able to use a topic, it has to be created. This can happen automatically or manually. When enabled, the Kafka cluster creates topics on demand. Automatic Topic Creation If auto.create.topics.enable is set to true and a client is requesting metadata about a non-existent topic, then the broker will create a topic with the given name. In this case, its replication factor and partition count is derived from the broker configuration. Corresponding configuration entries are default.replication.factor and num.partitions. The default value for each these properties is 1. This means the topic will not scale well and will not be tolerant to broker failure. It is recommended to set these default value higher or even better switching off this feature and creating topics manually with configuration suited for the use case at hand. Manual topic creation You can create topics from the command line with kafka-topics tool. Specify a replication factor and partition count, with optional configuration overrides. Alternatively, for each partition, you can specify which brokers have to hold a copy of that partition. Topic Properties There are numerous properties that influence how topics are handled by the cluster. These can be set with kafka-topics tool on topic creation or later on with kafka-configs. The most commonly used properties are: • min.insync.replicas: specifies how many brokers have to replicate the records before the leader sends back an acknowledgment to the producer (if producer property acks is set to all). With a replication factor of 3, a minimum in-sync replicas of 2 guarantees a higher level of durability. It is not recommended that you set this value equal to the replication factor as it makes producing to the topic impossible if one of the brokers is temporarily down. • retention.bytes and retention.ms: determines when a record is considered outdated. When data stored in one partition exceeds given limits, broker starts a cleanup to save disk space. • segment.bytes and segment.ms: determines how much data is stored in the same log segment (that is, in the same file). If any of these limits is reached, a new log segment is created. • unclean.leader.election.enable: if true, replicas that are not in-sync may be elected as new leaders. This only happens when there are no live replicas in-sync. As enabling this feature may result in data loss, it should be switched on only if availability is more important than durability. • cleanup.policy: either delete or compact. If delete is set, old log segments will be deleted. Otherwise, only the latest record is retained. This process is called log compaction. This is covered in greater detail in the Record Management on page 59 section. If you do not specify these properties, the prevailing broker level configuration will take effect. A complete list of properties can be found in the Topic-Level Configs section of the Apache Kafka documentation. Partition Management Partitions are at the heart of how Kafka scales performance. Some of the administrative issues around partitions can be some of the biggest challenges in sustaining high performance. When creating a topic, you specify which brokers should have a copy of which partition or you specify replication factor and number of partitions and the controller generates a replica assignment for you. If there are multiple brokers that are assigned a partition, the first one in the list is always the preferred leader. Whenever the leader of a partition goes down, Kafka moves leadership to another broker. Whether this is possible depends on the current set of in-sync replicas and the value of unclean.leader.election.enable. However, no new Kafka broker will start to replicate the partition to reach replication factor again. This is to avoid unnecessary load on brokers when one of them is temporarily down. Kafka will regularly try to balance leadership between brokers by electing the preferred leader. But this balance is based on number of leaderships and not throughput. Apache Kafka Guide | 27 Kafka Brokers Partition Reassignment In some cases require manual reassignment of partitions: • If the initial distribution of partitions and leaderships creates an uneven load on brokers. • If you want to add or remove brokers from the cluster. Use kafka-reassign-partitions tool to move partitions between brokers. The typical workflow consist of the following: • Generate a reassignment file by specifying topics to move and which brokers to move to (by setting --topic-to-move-json-file and --broker-list to --generate command). • Optionally edit the reassignment file and verify it with the tool. • Actually re-assigning partitions (with option --execute). • Verify if the process has finished as intended (with option --verify). Note: When specifying throttles for inter broker communication, make sure you use the command with --verify option to remove limitations on replication speed. Adding Partitions You can use kafka-topics tool to increase the number of partitions in a given topic. However, note that adding partitions will in most cases break the guarantee preserving the order of records with the same key, because it changes which partition a record key is produced to. Although order of records is preserved for both the old partition the key was produced to and the new one, it still might happen that records from the new partition are consumed before records from the old one. Choosing the Number of Partitions When choosing the number of partitions for a topic, you have to consider the following: • More partitions mean higher throughput. • You should not have more than a few tens of thousands of partitions in a Kafka cluster. • In case of an unclean shutdown of one of the brokers, the partitions it was leader for have to be led by other brokers and moving leadership for a few thousand partitions one by one can take seconds of unavailability. • Kafka keeps all log segment files open at all times. More partitions can mean more file handles to have open. • More partitions can cause higher latency. Controller The controller is one of the brokers that has additional partition and replica management responsibilities. It will control / be involved whenever partition metadata or state is changed, such as when: • Topics or partitions are created or deleted. • Brokers join or leave the cluster and partition leader or replica reassignment is needed. It also tracks the list of in sync replicas (ISRs) and maintains broker, partition, and ISR data in Zookeeper. Controller Election Any of the brokers can play the role of the controller, but in a healthy cluster there is exactly one controller. Normally this is the broker that started first, but there are certain situations when a re-election is needed: • If the controller is shut down or crashes. • If it loses connection to Zookeeper. When a broker starts or participates in controller reelection, it will attempt to create an ephemeral node (“/controller”) in ZooKeeper. If it succeeds, the broker becomes the controller. If it fails, there is already a controller, but the broker will watch the node. 28 | Apache Kafka Guide Kafka Brokers If the controller loses connection to ZooKeeper or stops ZooKeeper will remove the ephemeral node and the brokers will get a notification to start a controller election. Every controller election will increase the “controller epoch”. The controller epoch is used to detect situations when there are multiple active controllers: if a broker gets a message with a lower epoch than the result of the last election, it can be safely ignored. It is also used to detect a “split brain” situation when multiple nodes believe that they are in the controller role. Having 0 or 2+ controllers means the cluster is in a critical state, as broker and partition state changes are blocked. Therefore it’s important to ensure that the controller has a stable connection to ZooKeeper to avoid controller elections as much as possible. Apache Kafka Guide | 29 Kafka Integration Kafka Integration Kafka Security Client-Broker Security with TLS Kafka allows clients to connect over TLS. By default, TLS is disabled, but can be turned on as needed. Step 1: Generating Keys and Certificates for Kafka Brokers Generate the key and the certificate for each machine in the cluster using the Java keytool utility. See Generate TLS Certificates. Make sure that the common name (CN) matches the fully qualified domain name (FQDN) of your server. The client compares the CN with the DNS domain name to ensure that it is connecting to the correct server. Step 2: Creating Your Own Certificate Authority You have generated a public-private key pair for each machine and a certificate to identify the machine. However, the certificate is unsigned, so an attacker can create a certificate and pretend to be any machine. Sign certificates for each machine in the cluster to prevent unauthorized access. A Certificate Authority (CA) is responsible for signing certificates. A CA is similar to a government that issues passports. A government stamps (signs) each passport so that the passport becomes difficult to forge. Similarly, the CA signs the certificates, and the cryptography guarantees that a signed certificate is computationally difficult to forge. If the CA is a genuine and trusted authority, the clients have high assurance that they are connecting to the authentic machines. openssl req -new -x509 -keyout ca-key -out ca-cert -days 365 The generated CA is a public-private key pair and certificate used to sign other certificates. Add the generated CA to the client truststores so that clients can trust this CA: keytool -keystore {client.truststore.jks} -alias CARoot -import -file {ca-cert} Note: If you configure Kafka brokers to require client authentication by setting ssl.client.auth to be requested or required on the Kafka brokers config, you must provide a truststore for the Kafka brokers as well. The truststore must have all the CA certificates by which the clients keys are signed. The keystore created in step 1 stores each machine’s own identity. In contrast, the truststore of a client stores all the certificates that the client should trust. Importing a certificate into a truststore means trusting all certificates that are signed by that certificate. This attribute is called the chain of trust. It is particularly useful when deploying SSL on a large Kafka cluster. You can sign all certificates in the cluster with a single CA, and have all machines share the same truststore that trusts the CA. That way, all machines can authenticate all other machines. Step 3: Signing the Certificate Now you can sign all certificates generated by step 1 with the CA generated in step 2. 1. Create a certificate request from the keystore: keytool -keystore server.keystore.jks -alias localhost -certreq -file cert-file where: • keystore: the location of the keystore 30 | Apache Kafka Guide Kafka Integration • cert-file: the exported, unsigned certificate of the server 2. Sign the resulting certificate with the CA (in the real world, this can be done using a real CA): openssl x509 -req -CA ca-cert -CAkey ca-key -in cert-file -out cert-signed -days validity -CAcreateserial -passin pass:ca-password where: • • • • ca-cert: the certificate of the CA ca-key: the private key of the CA cert-signed: the signed certificate of the server ca-password: the passphrase of the CA 3. Import both the certificate of the CA and the signed certificate into the keystore: keytool -keystore server.keystore.jks -alias CARoot -import -file ca-cert keytool -keystore server.keystore.jks -alias localhost -import -file cert-signed The following Bash script demonstrates the steps described above. One of the commands assumes a password of SamplePassword123, so either use that password or edit the command before running it. #!/bin/bash #Step 1 keytool -keystore server.keystore.jks -alias localhost -validity 365 -genkey #Step 2 openssl req -new -x509 -keyout ca-key -out ca-cert -days 365 keytool -keystore server.truststore.jks -alias CARoot -import -file ca-cert keytool -keystore client.truststore.jks -alias CARoot -import -file ca-cert #Step 3 keytool -keystore server.keystore.jks -alias localhost -certreq -file cert-file openssl x509 -req -CA ca-cert -CAkey ca-key -in cert-file -out cert-signed -days 365 -CAcreateserial -passin pass:SamplePassword123 keytool -keystore server.keystore.jks -alias CARoot -import -file ca-cert keytool -keystore server.keystore.jks -alias localhost -import -file cert-signed Step 4: Configuring Kafka Brokers Kafka Brokers support listening for connections on multiple ports. If SSL is enabled for inter-broker communication (see below for how to enable it), both PLAINTEXT and SSL ports are required. To configure the listeners from Cloudera Manager, perform the following steps: 1. In Cloudera Manager, go to Kafka > Instances. 2. Go to Kafka Broker > Configurations. 3. In the Kafka Broker Advanced Configuration Snippet (Safety Valve) for Kafka Properties, enter the following information: listeners=PLAINTEXT://kafka-broker-host-name:9092,SSL://kafka-broker-host-name:9093 advertised.listeners=PLAINTEXT://kafka-broker-host-name:9092,SSL://kafka-broker-host-name:9093 where kafka-broker-host-name is the FQDN of the broker that you selected from the Instances page in Cloudera Manager. In the above sample configurations we used PLAINTEXT and SSL protocols for the SSL enabled brokers. For information about other supported security protocols, see Using Kafka’s Inter-Broker Security on page 33. 4. Repeat the previous step for each broker. The advertised.listeners configuration is needed to connect the brokers from external clients. 5. Deploy the above client configurations and rolling restart the Kafka service from Cloudera Manager. Apache Kafka Guide | 31 Kafka Integration Kafka CSD auto-generates listeners for Kafka brokers, depending on your SSL and Kerberos configuration. To enable SSL for Kafka installations, do the following: 1. Turn on SSL for the Kafka service by turning on the ssl_enabled configuration for the Kafka CSD. 2. Set security.inter.broker.protocol as SSL, if Kerberos is disabled; otherwise, set it as SASL_SSL. The following SSL configurations are required on each broker. Each of these values can be set in Cloudera Manager. Be sure to replace this example with the truststore password. For instructions, see Changing the Configuration of a Service or Role Instance. ssl.keystore.location=/var/private/ssl/kafka.server.keystore.jks ssl.keystore.password=SamplePassword123 ssl.key.password=SamplePassword123 ssl.truststore.location=/var/private/ssl/server.truststore.jks ssl.truststore.password=SamplePassword123 Other configuration settings may also be needed, depending on your requirements: • ssl.client.auth=none: Other options for client authentication are required, or requested, where clients without certificates can still connect. The use of requested is discouraged, as it provides a false sense of security and misconfigured clients can still connect. • ssl.cipher.suites: A cipher suite is a named combination of authentication, encryption, MAC, and a key exchange algorithm used to negotiate the security settings for a network connection using TLS or SSL network protocol. This list is empty by default. • ssl.enabled.protocols=TLSv1.2,TLSv1.1,TLSv1: Provide a list of SSL protocols that your brokers accept from clients. • ssl.keystore.type=JKS • ssl.truststore.type=JKS Communication between Kafka brokers defaults to PLAINTEXT. To enable secured communication, modify the broker properties file by adding security.inter.broker.protocol=SSL. For a list of the supported communication protocols, see Using Kafka’s Inter-Broker Security on page 33. Note: Due to import regulations in some countries, Oracle implementation of JCA limits the strength of cryptographic algorithms. If you need stronger algorithms, you must obtain the JCE Unlimited Strength Jurisdiction Policy Files and install them in the JDK/JRE as described in JCA Providers Documentation. After SSL is configured your broker, logs should show an endpoint for SSL communication: with addresses: PLAINTEXT -> EndPoint(192.168.1.1,9092,PLAINTEXT),SSL -> EndPoint(192.168.1.1,9093,SSL) You can also check the SSL communication to the broker by running the following command: openssl s_client -debug -connect localhost:9093 -tls1 This check can indicate that the server keystore and truststore are set up properly. Note: ssl.enabled.protocols should include TLSv1. 32 | Apache Kafka Guide Kafka Integration The output of this command should show the server certificate: -----BEGIN CERTIFICATE----{variable sized random bytes} -----END CERTIFICATE----subject=/C=US/ST=CA/L=Palo Alto/O=org/OU=org/CN=Franz Kafka issuer=/C=US/ST=CA/L=Palo Alto /O=org/OU=org/CN=kafka/emailAddress=kafka@your-domain.com If the certificate does not appear, or if there are any other error messages, your keystore is not set up properly. Step 5: Configuring Kafka Clients SSL is supported only for the new Kafka producer and consumer APIs. The configurations for SSL are the same for both the producer and consumer. If client authentication is not required in the broker, the following example shows a minimal configuration: security.protocol=SSL ssl.truststore.location=/var/private/ssl/kafka.client.truststore.jks ssl.truststore.password=SamplePassword123 If client authentication is required, a keystore must be created as in step 1, it needs to be signed by the CA as in step 3, and you must also configure the following properties: ssl.keystore.location=/var/private/ssl/kafka.client.keystore.jks ssl.keystore.password=SamplePassword123 ssl.key.password=SamplePassword123 Other configuration settings might also be needed, depending on your requirements and the broker configuration: • ssl.provider (Optional). The name of the security provider used for SSL connections. Default is the default security provider of the JVM. • ssl.cipher.suites (Optional). A cipher suite is a named combination of authentication, encryption, MAC, and a key exchange algorithm used to negotiate the security settings for a network connection using TLS or SSL network protocol. • ssl.enabled.protocols=TLSv1.2,TLSv1.1,TLSv1. This property should list at least one of the protocols configured on the broker side. • ssl.truststore.type=JKS • ssl.keystore.type=JKS Using Kafka’s Inter-Broker Security Kafka can expose multiple communication endpoints, each supporting a different protocol. Supporting multiple communication endpoints enables you to use different communication protocols for client-to-broker communications and broker-to-broker communications. Set the Kafka inter-broker communication protocol using the security.inter.broker.protocol property. Use this property primarily for the following scenarios: • Enabling SSL encryption for client-broker communication but keeping broker-broker communication as PLAINTEXT. Because SSL has performance overhead, you might want to keep inter-broker communication as PLAINTEXT if your Kafka brokers are behind a firewall and not susceptible to network snooping. • Migrating from a non-secure Kafka configuration to a secure Kafka configuration without requiring downtime. Use a rolling restart and keep security.inter.broker.protocol set to a protocol that is supported by all brokers until all brokers are updated to support the new protocol. Apache Kafka Guide | 33 Kafka Integration For example, if you have a Kafka cluster that needs to be configured to enable Kerberos without downtime, follow these steps: 1. 2. 3. 4. Set security.inter.broker.protocol to PLAINTEXT. Update the Kafka service configuration to enable Kerberos. Perform a rolling restart. Set security.inter.broker.protocol to SASL_PLAINTEXT. Kafka 2.0 and higher supports the combinations of protocols listed here. SSL Kerberos PLAINTEXT No No SSL Yes No SASL_PLAINTEXT No Yes SASL_SSL Yes Yes These protocols can be defined for broker-to-client interaction and for broker-to-broker interaction. The property security.inter.broker.protocol allows the broker-to-broker communication protocol to be different than the broker-to-client protocol, allowing rolling upgrades from non-secure to secure clusters. In most cases, set security.inter.broker.protocol to the protocol you are using for broker-to-client communication. Set security.inter.broker.protocol to a protocol different than the broker-to-client protocol only when you are performing a rolling upgrade from a non-secure to a secure Kafka cluster. Enabling Kerberos Authentication Apache Kafka supports Kerberos authentication, but it is supported only for the new Kafka Producer and Consumer APIs. If you already have a Kerberos server, you can add Kafka to your current configuration. If you do not have a Kerberos server, install it before proceeding. See Enabling Kerberos Authentication for CDH. If you already have configured the mapping from Kerberos principals to short names using the hadoop.security.auth_to_local HDFS configuration property, configure the same rules for Kafka by adding the sasl.kerberos.principal.to.local.rules property to the Advanced Configuration Snippet for Kafka Broker Advanced Configuration Snippet using Cloudera Manager. Specify the rules as a comma separated list. To enable Kerberos authentication for Kafka: 1. 2. 3. 4. 5. 6. In Cloudera Manager, navigate to Kafka > Configuration. Set SSL Client Authentication to none. Set Inter Broker Protocol to SASL_PLAINTEXT. Click Save Changes. Restart the Kafka service (Action > Restart). Make sure that listeners = SASL_PLAINTEXT is present in the Kafka broker logs, by default in /var/log/kafka/server.log. 7. Create a jaas.conf file with either cached credentials or keytabs. To use cached Kerberos credentials, where you use kinit first, use this configuration. KafkaClient { com.sun.security.auth.module.Krb5LoginModule required useTicketCache=true; }; 34 | Apache Kafka Guide Kafka Integration If you use a keytab, use this configuration. To generate keytabs, see Step 6: Get or Create a Kerberos Principal for Each User Account). KafkaClient { com.sun.security.auth.module.Krb5LoginModule required useKeyTab=true keyTab="/etc/security/keytabs/mykafkaclient.keytab" principal="mykafkaclient/clients.hostname.com@EXAMPLE.COM"; }; 8. Create the client.properties file containing the following properties. security.protocol=SASL_PLAINTEXT sasl.kerberos.service.name=kafka 9. Test with the Kafka console producer and consumer. To obtain a Kerberos ticket-granting ticket (TGT): kinit user 10. Verify that your topic exists. This does not use security features, but it is a best practice. kafka-topics --list --zookeeper zkhost:2181 11. Verify that the jaas.conf file is used by setting the environment. export KAFKA_OPTS="-Djava.security.auth.login.config=/home/user/jaas.conf" 12. Run a Kafka console producer. kafka-console-producer --broker-list anybroker:9092 --topic test1 --producer.config client.properties 13. Run a Kafka console consumer. kafka-console-consumer --new-consumer --topic test1 --from-beginning --bootstrap-server anybroker:9092 --consumer.config client.properties Enabling Encryption at Rest Data encryption is increasingly recognized as an optimal method for protecting data at rest. You can encrypt Kafka data using Cloudera Navigator Encrypt. Perform the following steps to encrypt Kafka data that is not in active use. 1. 2. 3. 4. Stop the Kafka service. Archive the Kafka data to an alternate location, using TAR or another archive tool. Unmount the affected drives. Install and configure Navigator Encrypt. See Installing Cloudera Navigator Encrypt. 5. Expand the TAR archive into the encrypted directories. Apache Kafka Guide | 35 Kafka Integration 6. Restart the Kafka service. Topic Authorization with Kerberos and Sentry Apache Sentry includes a Kafka binding you can use to enable authorization in Kafka with Sentry. For more information, see Authorization With Apache Sentry. Configuring Kafka to Use Sentry Authorization The following steps describe how to configure Kafka to use Sentry authorization. These steps assume you have installed Kafka and Sentry on your cluster. Sentry requires that your cluster include HDFS. After you install and start Sentry with the correct configuration, you can stop the HDFS service. For more information, see Installing and Upgrading the Sentry Service. Note: Cloudera's distribution of Kafka can make use of LDAP-based user groups when the LDAP directory is synchronized to Linux via tools such as SSSD. CDK does not support direct integration with LDAP, either through direct Kafka's LDAP authentication, or via Hadoop's group mapping (when hadoop.group.mapping is set to LdapGroupMapping). For more information, see Configuring LDAP Group Mappings. To configure Sentry authentication for Kafka: 1. 2. 3. 4. In Cloudera Manager, go to Kafka > Configuration. Select Enable Kerberos Authentication. Select a Sentry service in the Kafka service configuration. Add superusers. Superusers can perform any action on any resource in the Kafka cluster. The kafka user is added as a superuser by default. Superuser requests are authorized without going through Sentry, which provides enhanced performance. 5. Select Enable Sentry Privileges Caching to enhance performance. 6. Restart the Sentry services. Authorizable Resources Authorizable resources are resources or entities in a Kafka cluster that require special permissions for a user to be able to perform actions on them. Kafka has four authorizable resources. • Cluster: controls who can perform cluster-level operations such as creating or deleting a topic. This resource can only have one value, kafka-cluster, as one Kafka cluster cannot have more than one cluster resource. • Topic: controls who can perform topic-level operations such as producing and consuming topics. Its value must match exactly the topic name in the Kafka cluster. With CDH 5.15.0 and CDK 3.1 and later, wildcards (*) can be used to refer to any topic in the privilege. • Consumergroup: controls who can perform consumergroup-level operations such as joining or describing a consumergroup. Its value must exactly match the group.id of a consumergroup. With CDH 5.14.1 and later, you can use a wildcard (*) to refer to any consumer groups in the privilege. This resource is useful when used with Spark Streaming, where a generated group.id may be needed. • Host: controls from where specific operations can be performed. Think of this as a way to achieve IP filtering in Kafka. You can set the value of this resource to the wildcard (*), which represents all hosts. Note: Only IP addresses should be specified in the host component of Kafka Sentry privileges, hostnames are not supported. 36 | Apache Kafka Guide Kafka Integration Authorized Actions You can perform multiple actions on each resource. The following operations are supported by Kafka, though not all actions are valid on all resources. • • • • • • • • ALL is a wildcard action, and represents all possible actions on a resource. read write create delete alter describe clusteraction Authorizing Privileges Privileges define what actions are allowed on a resource. A privilege is represented as a string in Sentry. The following rules apply to a valid privilege. • Can have at most one Host resource. If you do not specify a Host resource in your privilege string, Host=* is assumed. • Must have exactly one non-Host resource. • Must have exactly one action specified at the end of the privilege string. For example, the following are valid privilege strings: Host=*->Topic=myTopic->action=ALL Topic=test->action=ALL Granting Privileges to a Role The following examples grant privileges to the role test, so that users in testGroup can create a topic named testTopic and produce to it. The user executing these commands must be added to the Sentry parameter sentry.service.allow.connect and also be a member of a group defined in sentry.service.admin.group. Before you can assign the test role, you must first create it. To create the test role: kafka-sentry -cr -r test To confirm that the role was created, list the roles: kafka-sentry -lr If Sentry privileges caching is enabled, as recommended, the new privileges you assign take some time to appear in the system. The time is the time-to-live interval of the Sentry privileges cache, which is set using sentry.kafka.caching.ttl.ms. By default, this interval is 30 seconds. For test clusters, it is beneficial to have changes appear within the system as fast as possible, therefore, Cloudera recommends that you either use a lower time interval, or disable caching with sentry.kafka.caching.enable. 1. Allow users in testGroup to write to testTopic from localhost, which allows users to produce to testTopic. Users need both write and describe permissions. kafka-sentry -gpr -r test -p "Host=127.0.0.1->Topic=testTopic->action=write" kafka-sentry -gpr -r test -p "Host=127.0.0.1->Topic=testTopic->action=describe" Apache Kafka Guide | 37 Kafka Integration 2. Assign the test role to the group testGroup: kafka-sentry -arg -r test -g testGroup 3. Verify that the test role is part of the group testGroup: kafka-sentry -lr -g testGroup 4. Create testTopic. $ kafka-topics --create --zookeeper localhost:2181 \ --replication-factor 1 \ --partitions 1 --topic testTopic kafka-topics --list --zookeeper localhost:2181 testTopic Now you can produce to and consume from the Kafka cluster. 1. Produce to testTopic. Note that you have to pass a configuration file, producer.properties, with information on JAAS configuration and other Kerberos authentication related information. See SASL Configuration for Kafka Clients. $ kafka-console-producer --broker-list localhost:9092 \ --topic testTopic --producer.config producer.properties This is a message This is another message 2. Grant the create privilege to the test role. $ kafka-sentry -gpr -r test -p "Host=127.0.0.1->Cluster=kafka-cluster->action=create" 3. Allow users in testGroup to describe testTopic from localhost, which the user creates and uses. $ kafka-sentry -gpr -r test -p "Host=127.0.0.1->Topic=testTopic->action=describe" 4. Grant the describe privilege to the test role. $ kafka-sentry -gpr -r test -p "Host=127.0.0.1->Consumergroup=testconsumergroup->action=describe" 5. Allow users in testGroup to read from a consumer group, testconsumergroup, that it will start and join. $ kafka-sentry -gpr -r test -p "Host=127.0.0.1->Consumergroup=testconsumergroup->action=read" 6. Allow users in testGroup to read from testTopic from localhost and to consume from testTopic. $ kafka-sentry -gpr -r test -p "Host=127.0.0.1->Topic=testTopic->action=read" 7. Consume from testTopic. Note that you have to pass a configuration file, consumer.properties, with information on JAAS configuration and other Kerberos authentication related information. The configuration file must also specify group.id as testconsumergroup. kafka-console-consumer --new-consumer --topic testTopic \ --from-beginning --bootstrap-server anybroker-host:9092 \ --consumer.config consumer.properties 38 | Apache Kafka Guide Kafka Integration This is a message This is another message Troubleshooting Kafka with Sentry If Kafka requests are failing due to authorization, the following steps can provide insight into the error: • Make sure you have run kinit as a user who has privileges to perform an operation. • Identify which broker is hosting the leader of the partition you are trying to produce to or consume from, as this leader is going to authorize your request against Sentry. One easy way of debugging is to just have one Kafka broker. Change log level of the Kafka broker by adding the following entry to the Kafka Broker in Logging Advanced Configuration Snippet (Safety Valve) and restart the broker: log4j.logger.org.apache.sentry=DEBUG Setting just Sentry to DEBUG mode avoids the debug output from undesired dependencies, such as Jetty. • Run the Kafka client or Kafka CLI with the required arguments and capture the Kafka log, which should be similar to: /var/log/kafka/kafka-broker-host-name.log • Look for the following information in the filtered logs: – – – – Groups that the Kafka client user or CLI is running as. Required privileges for the operation. Retrieved privileges from Sentry. Required and retrieved privileges comparison result. This log information can provide insight into which privilege is not assigned to a user, causing a particular operation to fail. Managing Multiple Kafka Versions Kafka Feature Support in Cloudera Manager and CDH Using the latest features of Kafka sometimes requires a newer version of Cloudera Manager and/or CDH that supports the feature. For the purpose of understanding upgrades, the table below includes several earlier versions where Kafka was known as CDK Powered by Apache Kafka (CDK in the table). Table 1: Kafka and CM/CDH Supported Versions Matrix Minimum Supported Version CDH Kafka Version Apache Kafka Version Cloudera Manager CDH (No Security) CDH (with Sentry) CDH 6.1.0 2.0.0 CM 6.1.0 CDH 6.1.0 CDH 6.1.0 CDH 6.0.0 1.0.1 CM 6.0.0 CDH 6.0.0 CDH 6.0.0 CDK 3.1.0 1.0.1 CM 5.13.0 CDH 5.13.0 CDH 5.13.0 Sentry-HA supported CDK 3.0.0 0.11.0 CM 5.13.0 CDH 5.13.0 CDH 5.13.0 Sentry-HA not supported CDK 2.2 0.10.2 CM 5.9.0 CDH 5.4.0 CDH 5.9.0 CDK 2.1 0.10.0 CM 5.9.0 CDH 5.4.0 CDH 5.9.0 Kafka Feature Notes Sentry Authorization Apache Kafka Guide | 39 Kafka Integration Minimum Supported Version CDH Kafka Version Apache Kafka Version Cloudera Manager CDH (No Security) CDH (with Sentry) CDK 2.0 0.9.0 CM5.5.3 CDH 5.4.0 CDH 5.4.0 Kafka Feature Notes Enhanced Security Client/Broker Compatibility Across Kafka Versions Maintaining compatibility across different Kafka clients and brokers is a common issue. Mismatches among client and broker versions can occur as part of any of the following scenarios: • Upgrading your Kafka cluster without upgrading your Kafka clients. • Using a third-party application that produces to or consumes from your Kafka cluster. • Having a client program communicating with two or more Kafka clusters (such as consuming from one cluster and producing to a different cluster). • Using Flume or Spark as a Kafka consumer. In these cases, it is important to understand client/broker compatibility across Kafka versions. Here are general rules that apply: • Newer Kafka brokers can talk to older Kafka clients. The reverse is not true: older Kafka brokers cannot talk to newer Kafka clients. • Changes in either the major part or the minor part of the upstream version major.minor determines whether the client and broker are compatible. Differences among the maintenance versions are not considered when determining compatibility. As a result, the general pattern for upgrading Kafka from version A to version B is: 1. 2. 3. 4. Change Kafka server.properties to refer to version A. Upgrade the brokers to version B and restart the cluster. Upgrade the clients to version B. After all the clients are upgraded, change the properties from the first step to version B and restart the cluster. Upgrading your Kafka Cluster At some point, you will want to upgrade your Kafka cluster. Or in some cases, when you upgrade to a newer CDH distribution, then Kafka will be upgraded along with the full CDH upgrade. In either case, you may wish to read the sections below before upgrading. General Upgrade Information Previously, Cloudera distributed Kafka as a parcel (CDK) separate from CDH. Installation of the separate Kafka required Cloudera Manager 5.4 or higher. For a list of available parcels and packages, see CDK Powered By Apache Kafka® Version and Packaging Information Kafka bundled along with CDH. As of CDH 6, Cloudera distributes Kafka bundled along with CDH. Kafka is in the parcel itself. Installation requires Cloudera Manager 6.0 or higher. For installation instructions for Kafka using Cloudera Manager, see Cloudera Installation Guide. Cloudera recommends that you deploy Kafka on dedicated hosts that are not used for other cluster roles. Important: You cannot install an old Kafka parcel on a new CDH 6.x cluster. Upgrading Kafka from CDH 6.0.0 to other CDH 6 versions To ensure there is no downtime during an upgrade, these instructions describe performing a rolling upgrade. 40 | Apache Kafka Guide Kafka Integration Before upgrading, ensure that you set inter.broker.protocol.version and log.message.format.version to the current Kafka version (see Table 1: Kafka and CM/CDH Supported Versions Matrix on page 39), and then unset them after the upgrade. This is a good practice because the newer broker versions can write log entries that the older brokers cannot read. If you need to rollback to the older version, and you have not set inter.broker.protocol.version and log.message.format.version, data loss can occur. From Cloudera Manager on the cluster to upgrade: 1. Explicitly set the Kafka protocol version to match what's being used currently among the brokers and clients. Update server.properties on all brokers as follows: a. Choose the Kafka service. b. Click Configuration. c. Use the Search field to find the Kafka Broker Advanced Configuration Snippet (Safety Valve) configuration property. d. Add the following properties to the safety valve: inter.broker.protocol.version = current_Kafka_version log.message.format.version = current_Kafka_version Make sure you enter full Kafka version numbers with three values, such as 0.10.0. Otherwise, you'll see an error similar to the following: 2018-06-14 14:25:47,818 FATAL kafka.Kafka$: java.lang.IllegalArgumentException: Version `0.10` is not a valid version at kafka.api.ApiVersion$$anonfun$apply$1.apply(ApiVersion.scala:72) at kafka.api.ApiVersion$$anonfun$apply$1.apply(ApiVersion.scala:72) at scala.collection.MapLike$class.getOrElse(MapLike.scala:128) e. Save your changes. The information is automatically copied to each broker. 2. Upgrade CDH. See Upgrading the CDH Cluster. Do not restart the Kafka service, select Activate Only and click OK. 3. Perform a rolling restart. Select Rolling Restart or Restart based on the downtime that you can afford. At this point, the brokers are running in compatibility mode with older clients. It can run in this mode indefinitely. If you do upgrade clients, after all clients are upgraded, remove the Safety Valve properties and restart the cluster. Upstream Upgrade Instructions The table below links to the upstream Apache Kafka documentation for detailed upgrade instructions. It is recommended that you read the instructions for your specific upgrade to identify any additional steps that apply to your specific upgrade path. Table 2: Version-Specific Upgrade Instructions CDH Kafka Version Upstream Kafka Version Detailed Upstream Instructions CDH 6.1.0 2.0.0 Upgrading from 0.8.x through 1.1.x to 2.0.0 CDH 6.0.0 1.0.1 Upgrading from 0.8.x through 0.11.0.x to 1.0.0 Kafka 3.1.0 1.0.1 Upgrading from 0.8.x through 0.11.0.x to 1.0.0 Kafka 3.0.0 0.11.0 Upgrading from 0.8.x through 0.10.2.x to 0.11.0.0 Apache Kafka Guide | 41 Kafka Integration CDH Kafka Version Upstream Kafka Version Detailed Upstream Instructions Kafka 2.2 0.10.2 Upgrading from 0.8.x through 0.10.1.x to 0.10.2.0 Kafka 2.1 0.10.0 Upgrading from 0.8.x through 0.9.x.to 0.10.0.0 Kafka 2.0 0.9.0 Upgrading from 0.8.0 through 0.8.2.x to 0.9.0.0 Managing Topics across Multiple Kafka Clusters You may have more than one Kafka cluster to support: • Geographic distribution • Disaster recovery • Organizational requirements You can distribute messages across multiple clusters. It can be handy to have a copy of one or more topics from other Kafka clusters available to a client on one cluster. Mirror Maker is a tool that comes bundled with Kafka to help automate the process of mirroring or publishing messages from one cluster to another. "Mirroring" occurs between clusters where "replication" distributes message within a cluster. Figure 18: Mirror Maker Makes Topics Available on Multiple Clusters While the diagram shows copying to one topic, Mirror Maker’s main mode of operation is running continuously, copying one or more topics from the source cluster to the destination cluster. Keep in mind the following design notes when configuring Mirror Maker: • Mirror Maker runs as a single process. • Mirror Maker can run with multiple consumers that read from multiple partitions in the source cluster. • Mirror Maker uses a single producer to copy messages to the matching topic in the destination cluster. Consumer/Producer Compatibility The Mirror Maker consumer needs to be client compatible with the source cluster. The Mirror Maker producer needs to be client compatible with the destination cluster. 42 | Apache Kafka Guide Kafka Integration See Client/Broker Compatibility Across Kafka Versions on page 40 for more details about what it means to be "compatible." Topic Differences between Clusters Because messages are copied from the source cluster to the destination cluster—potentially through many consumers funneling into a single producer—there is no guarantee of having identical offsets or timestamps between the two clusters. In addition, as these copies occur over the network, there can be some mismatching due to retries or dropped messages. Optimize Mirror Maker Producer Location Because Mirror Maker uses a single producer and since producers typically have more difficulty with high latency and/or unreliable connections, it is better to have the producer run “closer” to the destination cluster, meaning in the same data center or on the same rack. Destination Cluster Configuration Before starting Mirror Maker, make sure that the destination cluster is configured correctly: • Make sure there is sufficient disk space to copy the topic from the source cluster to the destination cluster. • Make sure the topic exists in the destination cluster or use the kafka-configs command to set the property auto.create.topics.enable=true. See Kafka Administration Using Command Line Tools on page 62. Kerberos and Mirror Maker As mentioned earlier, Mirror Maker runs as a single process. The resulting consumers and producers rely on a single configuration setup. Mirror Maker requires that the source cluster and the destination cluster belong to the same Kerberos realm. Setting up Mirror Maker in Cloudera Manager Where Cloudera Manager is managing the destination cluster: 1. 2. 3. 4. 5. In Cloudera Manager, select the Kafka service. Choose Action > Add Role Instances. Under Kafka Mirror Maker, click Select hosts. Select the host where Mirror Maker will run and click Continue. Fill in the Destination Broker List and Source Broker List with your source and destination Kafka clusters. Use host name, IP address, or fully qualified domain name. 6. Fill out the Topic Whitelist. The whitelist is required. 7. Fill out the TLS/SSL sections if security needs to be enabled. 8. Start the Mirror Maker instance. Settings to Avoid Data Loss The Avoid Data Loss option from earlier releases has been removed in favor of automatically setting the following properties. Also note that MirrorMaker starts correctly if you enter the numeric values in the configuration snippet (rather than using "max integer" for retries and "max long" for max.block.ms). Producer settings • acks=all • retries=2147483647 • max.block.ms=9223372036854775807 Apache Kafka Guide | 43 Kafka Integration Consumer setting • auto.commit.enable=false MirrorMaker setting • abort.on.send.failure=true Setting up an End-to-End Data Streaming Pipeline Data Streaming Pipeline The data streaming pipeline as shown here is the most common usage of Kafka. Figure 19: Data Streaming Pipeline Architecture Some things to note about the data streaming pipeline model: • Most systems have multiple data sources sending data over the Internet, such as per store or per device. • The ingestion service usually saves older data to some form of long-term data storage. • The stream processing service can perform near real-time computation on the data extracted from the message service, such as processing transactions, detecting fraud, or alerting systems. • The results of stream processing can be sent directly to another service (such as for reporting) or can be streamed back into Kafka for one or more other services to do further real time processing. The following sections show how other components in CDH map into the data streaming model. Ingest Using Kafka with Apache Flume Apache Flume is commonly used to collect Kafka topics into a long-term data store. 44 | Apache Kafka Guide Kafka Integration Figure 20: Collecting Kafka Topics using Flume Note: Do not configure a Kafka source to send data to a Kafka sink. If you do, the Kafka source sets the topic in the event header, overriding the sink configuration and creating an infinite loop, sending messages back and forth between the source and sink. If you need to use both a Kafka source and a sink, use an interceptor to modify the event header and set a different topic. For information on configuring Kafka to securely communicate with Flume, see Configuring Flume Security with Kafka. The following sections describe how to configure Kafka sub-components for directing topics to long-term storage: Sources Use the Kafka source to stream data in Kafka topics to Hadoop. The Kafka source can be combined with any Flume sink, making it easy to write Kafka data to HDFS, HBase, and Solr. The following Flume configuration example uses a Kafka source to send data to an HDFS sink: tier1.sources = source1 tier1.channels = channel1 tier1.sinks = sink1 tier1.sources.source1.type = org.apache.flume.source.kafka.KafkaSource tier1.sources.source1.zookeeperConnect = zk01.example.com:2181 tier1.sources.source1.topic = weblogs tier1.sources.source1.groupId = flume tier1.sources.source1.channels = channel1 tier1.sources.source1.interceptors = i1 tier1.sources.source1.interceptors.i1.type = timestamp tier1.sources.source1.kafka.consumer.timeout.ms = 100 tier1.channels.channel1.type = memory tier1.channels.channel1.capacity = 10000 tier1.channels.channel1.transactionCapacity = 1000 tier1.sinks.sink1.type = hdfs tier1.sinks.sink1.hdfs.path = /tmp/kafka/%{topic}/%y-%m-%d tier1.sinks.sink1.hdfs.rollInterval = 5 tier1.sinks.sink1.hdfs.rollSize = 0 tier1.sinks.sink1.hdfs.rollCount = 0 tier1.sinks.sink1.hdfs.fileType = DataStream Apache Kafka Guide | 45 Kafka Integration tier1.sinks.sink1.channel = channel1 For higher throughput, configure multiple Kafka sources to read from the same topic. If you configure all the sources with the same groupID, and the topic contains multiple partitions, each source reads data from a different set of partitions, improving the ingest rate. The following table describes parameters that the Kafka source supports. Required properties are listed in bold. Property Name Default Value Description type Must be set to org.apache. flume.source.kafka. KafkaSource. zookeeperConnect The URI of the ZooKeeper server or quorum used by Kafka. This can be a single host (for example, zk01. example.com:2181) or a comma-separated list of hosts in a ZooKeeper quorum (for example, zk01.example. com:2181,zk02.example. com:2181, zk03.example. com:2181). topic The Kafka topic from which this source reads messages. Flume supports only one topic per source. groupID flume The unique identifier of the Kafka consumer group. Set the same groupID in all sources to indicate that they belong to the same consumer group. batchSize 1000 The maximum number of messages that can be written to a channel in a single batch. batchDurationMillis 1000 The maximum time (in ms) before a batch is written to the channel. The batch is written when the batchSize limit or batchDurationMillis limit is reached, whichever comes first. Other properties supported by the Kafka consumer Source Tuning Notes The Kafka source overrides two Kafka consumer parameters: 46 | Apache Kafka Guide Used to configure the Kafka consumer used by the Kafka source. You can use any consumer properties supported by Kafka. Prepend the consumer property name with the prefix kafka. (for example, kafka.fetch.min.bytes). See the Apache Kafka documentation topic Consumer Configs for the full list of Kafka consumer properties. Kafka Integration 1. auto.commit.enable is set to false by the source, committing every batch. For improved performance, set this parameter to true using the kafka.auto.commit.enable setting. Note that this change can lead to data loss if the source goes down before committing. 2. consumer.timeout.ms is set to 10, so when Flume polls Kafka for new data, it waits no more than 10 ms for the data to be available. Setting this parameter to a higher value can reduce CPU utilization due to less frequent polling, but the trade-off is that it introduces latency in writing batches to the channel. Kafka Sinks Use the Kafka sink to send data to Kafka from a Flume source. You can use the Kafka sink in addition to Flume sinks, such as HBase or HDFS. The following Flume configuration example uses a Kafka sink with an exec source: tier1.sources = source1 tier1.channels = channel1 tier1.sinks = sink1 tier1.sources.source1.type = exec tier1.sources.source1.command = /usr/bin/vmstat 1 tier1.sources.source1.channels = channel1 tier1.channels.channel1.type = memory tier1.channels.channel1.capacity = 10000 tier1.channels.channel1.transactionCapacity = 1000 tier1.sinks.sink1.type = org.apache.flume.sink.kafka.KafkaSink tier1.sinks.sink1.topic = sink1 tier1.sinks.sink1.brokerList = kafka01.example.com:9092,kafka02.example.com:9092 tier1.sinks.sink1.channel = channel1 tier1.sinks.sink1.batchSize = 20 The following table describes parameters the Kafka sink supports. Required properties are listed in bold. Property Name Default Value Description type Must be set to org.apache. flume.sink.kafka.KafkaSink. brokerList The brokers the Kafka sink uses to discover topic partitions, formatted as a comma-separated list of hostname:port entries. You do not need to specify the entire list of brokers, but you specify at least two for high availability. topic default-flume-topic The Kafka topic to which messages are published by default. If the event header contains a topic field, the event is published to the designated topic, overriding the configured topic. batchSize 100 The number of messages to process in a single batch. Specifying a larger batchSize can improve throughput and increase latency. request.required.acks 0 The number of replicas that must acknowledge a message before it is written successfully. Possible values are: Apache Kafka Guide | 47 Kafka Integration Property Name Default Value Description 0 do not wait for an acknowledgment 1 wait for the leader to acknowledge only -1 wait for all replicas to acknowledge To avoid potential loss of data in case of a leader failure, set this to -1. Other properties supported by the Kafka producer Used to configure the Kafka producer used by the Kafka sink. You can use any producer properties supported by Kafka. Prepend the producer property name with the prefix kafka (for example, kafka.compression.codec). See the Apache Kafka documentation topic Producer Configs for the full list of Kafka producer properties. The Kafka sink uses the topic and key properties from the FlumeEvent headers to determine where to send events in Kafka. If the header contains the topic property, that event is sent to the designated topic, overriding the configured topic. If the header contains the key property, that key is used to partition events within the topic. Events with the same key are sent to the same partition. If the key parameter is not specified, events are distributed randomly to partitions. Use these properties to control the topics and partitions to which events are sent through the Flume source or interceptor. Kafka Channels CDH includes a Kafka channel to Flume in addition to the existing memory and file channels. You can use the Kafka channel: • To write to Hadoop directly from Kafka without using a source. • To write to Kafka directly from Flume sources without additional buffering. • As a reliable and highly available channel for any source/sink combination. The following Flume configuration uses a Kafka channel with an exec source and HDFS sink: tier1.sources = source1 tier1.channels = channel1 tier1.sinks = sink1 tier1.sources.source1.type = exec tier1.sources.source1.command = /usr/bin/vmstat 1 tier1.sources.source1.channels = channel1 tier1.channels.channel1.type = org.apache.flume.channel.kafka.KafkaChannel tier1.channels.channel1.capacity = 10000 tier1.channels.channel1.zookeeperConnect = zk01.example.com:2181 tier1.channels.channel1.parseAsFlumeEvent = false tier1.channels.channel1.topic = channel2 tier1.channels.channel1.consumer.group.id = channel2-grp tier1.channels.channel1.auto.offset.reset = earliest tier1.channels.channel1.kafka.bootstrap.servers = 48 | Apache Kafka Guide Kafka Integration kafka02.example.com:9092,kafka03.example.com:9092 tier1.channels.channel1.transactionCapacity = 1000 tier1.channels.channel1.kafka.consumer.max.partition.fetch.bytes=2097152 tier1.sinks.sink1.type = hdfs tier1.sinks.sink1.hdfs.path = /tmp/kafka/channel tier1.sinks.sink1.hdfs.rollInterval = 5 tier1.sinks.sink1.hdfs.rollSize = 0 tier1.sinks.sink1.hdfs.rollCount = 0 tier1.sinks.sink1.hdfs.fileType = DataStream tier1.sinks.sink1.channel = channel1 The following table describes parameters the Kafka channel supports. Required properties are listed in bold. Property Name Default Value Description type Must be set to org.apache. flume.channel.kafka. KafkaChannel. brokerList The brokers the Kafka channel uses to discover topic partitions, formatted as a comma-separated list of hostname:port entries. You do not need to specify the entire list of brokers, but you should specify at least two for high availability. zookeeperConnect The URI of the ZooKeeper server or quorum used by Kafka. This can be a single host (for example, zk01.example.com:2181) or a comma-separated list of hosts in a ZooKeeper quorum (for example, zk01.example. com:2181,zk02.example. com:2181, zk03.example. com:2181). topic flume-channel The Kafka topic the channel will use. groupID flume The unique identifier of the Kafka consumer group the channel uses to register with Kafka. parseAsFlumeEvent true Set to true if a Flume source is writing to the channel and expects AvroDataums with the FlumeEvent schema (org.apache.flume. source.avro.AvroFlumeEvent) in the channel. Set to false if other producers are writing to the topic that the channel is using. auto.offset.reset latest What to do when there is no initial offset in Kafka or if the current offset does not exist on the server (for example, because the data is deleted). • earliest: automatically reset the offset to the earliest offset Apache Kafka Guide | 49 Kafka Integration Property Name Default Value Description • latest: automatically reset the offset to the latest offset • none: throw exception to the consumer if no previous offset is found for the consumer's group • anything else: throw exception to the consumer. kafka.consumer.timeout.ms 100 Polling interval when writing to the sink. consumer.max.partition.fetch.bytes 1048576 The maximum amount of data per-partition the server will return. Other properties supported by the Kafka producer Used to configure the Kafka producer. You can use any producer properties supported by Kafka. Prepend the producer property name with the prefix kafka. (for example, kafka.compression.codec). See the Apache Kafka documentation topic Producer Configs for the full list of Kafka producer properties. CDH Flume Kafka Compatibility The section Client/Broker Compatibility Across Kafka Versions on page 40 covered the basics of Kafka client/broker compatibility. Flume has an embedded Kafka client which it uses to talk to Kafka clusters. Since the generally accepted practice is to have the broker running the same or newer version as the client, a CDH Flume version requires being matched to a minimum Kafka version. This is illustrated in the table below. Table 3: Flume Embedded Client and Kafka Compatibility CDH Flume Version Embedded Kafka Client Version Minimum Supported CDH Kafka Version (Remote or Local) CDH 6.0.0 1.0.1 CDH 6.0.0 CDH 5.14.x 0.10.2-kafka-2.2.0 Kafka 2.2 CDH 5.13.x 0.9.0-kafka-2.0.2 Kafka 2.0 CDH 5.12.x 0.9.0-kafka-2.0.2 Kafka 2.0 CDH 5.11.x 0.9.0-kafka-2.0.2 Kafka 2.0 CDH 5.10.x 0.9.0-kafka-2.0.2 Kafka 2.0 CDH 5.9.x 0.9.0-kafka-2.0.2 Kafka 2.0 CDH 5.8.x 0.9.0-kafka-2.0.0 Kafka 2.0 CDH 5.7.x 0.9.0-kafka-2.0.0 Kafka 2.0 Securing Flume with Kafka When using Flume with a secured Kafka service, you can use Cloudera Manager to generate security related Flume agent configuration. 50 | Apache Kafka Guide Kafka Integration In Cloudera Manager, on the Flume Configuration page, select the Kafka service you want to connect to. This generates the following files: • flume.keytab • jaas.conf It also generates security protocol and Kerberos service name properties for the Flume agent configuration. If TLS/SSL is also configured for Kafka brokers, the setting also adds SSL truststore properties to the beginning of the Flume agent configuration. Review the deployed agent configuration and if the defaults do not match your environment (such as the truststore password), you can override the settings by adding the same property to the agent configuration. Using Kafka with Apache Spark Streaming for Stream Processing For real-time stream computation, Apache Spark is the tool of choice in CDH. Figure 21: Data Streaming Pipeline with Spark CDH Spark/Kafka Compatibility The section Client/Broker Compatibility Across Kafka Versions on page 40 covered the basics of Kafka client/broker compatibility. Spark maintains two embedded Kafka clients and can be configured to use either one. This table in the Apache Spark documentation illustrates the two clients that are available. For information on how to configure Spark Streaming to receive data from Kafka, refer to the Kafka version you are using in the following table. Embedded Kafka Client CDH Spark (Choose based on Kafka Version cluster) Minimum Minimum CDH Kafka Apache Version Kafka (Remote Version or Local) Upstream Integration Guide API Stability CDH 6.0 spark-streaming-kafka-0-10 0.10.0 Kafka 2.1 Spark 2.2 + Kafka 0.10 Stable CDH 6.0 spark-streaming-kafka-0-8 0.8.2.1 Kafka 2.0 Spark 2.2 + Kafka 0.8 Deprecated Spark 2.2 spark-streaming-kafka-0-10 0.10.0 Kafka 2.1 Spark 2.2 + Kafka 0.10 Experimental Spark 2.2 spark-streaming-kafka-0-8 0.8.2.1 Kafka 2.0 Spark 2.2 + Kafka 0.8 Stable Apache Kafka Guide | 51 Kafka Integration Embedded Kafka Client CDH Spark (Choose based on Kafka Version cluster) Minimum Minimum CDH Kafka Apache Version Kafka (Remote Version or Local) Upstream Integration Guide API Stability Spark 2.1 spark-streaming-kafka-0-10 0.10.0 Kafka 2.1 Spark 2.1 + Kafka 0.10 Experimental Spark 2.1 spark-streaming-kafka-0-8 0.8.2.1 Kafka 2.0 Spark 2.1 + Kafka 0.8 Stable Spark 2.0 spark-streaming-kafka-0-8 0.8.2.? Kafka 2.0 Spark 2.0 + Kafka Stable Validating Kafka Integration with Spark Streaming To validate your Kafka integration with Spark Streaming, run the KafkaWordCount example in Spark. If you installed Spark using parcels, use the following command: /opt/cloudera/parcels/CDH/lib/spark/bin/run-example streaming.KafkaWordCount zkQuorum group topics numThreads If you installed Spark using packages, use the following command: /usr/lib/spark/bin/run-example streaming.KafkaWordCount zkQuorum group topics numThreads Replace the variables as follows: • zkQuorum: ZooKeeper quorum URI used by Kafka For example: zk01.example.com:2181,zk02.example.com:2181,zk03.example.com:2181 • group: Consumer group used by the application. • topics: Kafka topic containing the data for the application. • numThreads: Number of consumer threads reading the data. If this is higher than the number of partitions in the Kafka topic, some threads will be idle. Note: If multiple applications use the same group and topic, each application receives a subset of the data. Securing Spark with Kafka Using Spark Streaming with a Kafka service that’s already secured requires configuration changes on the Spark side. You can find a nice description of the required changes in the spark-dstream-secure-kafka-app sample project on GitHub. Developing Kafka Clients Previously, examples were provided for producing messages to and consuming messages from a Kafka cluster using the command line. For most cases, running Kafka producers and consumers using shell scripts and Kafka’s command line scripts cannot be used in practice. In those cases, native Kafka client development is the generally accepted option. Simple Client Examples Let’s start with a simple working example of a producer/consumer program. This section includes the following code examples: 52 | Apache Kafka Guide Kafka Integration pom.xml <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>com.cloudera.kafkaexamples</groupId> <artifactId>kafka-examples</artifactId> <packaging>jar</packaging> <version>1.0</version> <name>kafkadev</name> <url>http://maven.apache.org</url> <repositories> <repository> <id>cloudera</id> <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url> </repository> </repositories> <dependencies> <dependency> <groupId>org.apache.kafka</groupId> <artifactId>kafka-clients</artifactId> <version>1.0.1-cdh6.0.0</version> <scope>compile</scope> </dependency> </dependencies> <build> <plugins> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-compiler-plugin</artifactId> <version>3.7.0</version> <configuration> <source>1.8</source> <target>1.8</target> </configuration> </plugin> </plugins> </build> </project> SimpleProducer.java The example includes Java properties for setting up the client identified in the comments; the functional parts of the code are in bold. This code is compatible with versions as old as the 0.9.0-kafka-2.0.0 version of Kafka. package com.cloudera.kafkaexamples; import java.util.Date; import java.util.Properties; import import import import org.apache.kafka.clients.producer.KafkaProducer; org.apache.kafka.clients.producer.ProducerConfig; org.apache.kafka.clients.producer.ProducerRecord; org.apache.kafka.common.serialization.StringSerializer; public class SimpleProducer { public static void main(String[] args) { // Generate total consecutive events starting with ufoId long total = Long.parseLong("10"); long ufoId = Math.round(Math.random() * Integer.MAX_VALUE); // Set up client Java properties Properties props = new Properties(); props.setProperty(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "host1:9092,host2:9092,host3:9092"); Apache Kafka Guide | 53 Kafka Integration props.setProperty(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName()); props.setProperty(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName()); props.setProperty(ProducerConfig.ACKS_CONFIG, "1"); try (KafkaProducer<String, String> producer = new KafkaProducer<>(props)) { for (long i = 0; i < total; i++) { String key = Long.toString(ufoId++); long runtime = new Date().getTime(); double latitude = (Math.random() * (2 * 85.05112878)) - 85.05112878; double longitude = (Math.random() * 360.0) - 180.0; String msg = runtime + "," + latitude + "," + longitude; try { ProducerRecord<String, String> data = new ProducerRecord<String, String>("ufo_sightings", key, msg); producer.send(data); long wait = Math.round(Math.random() * 25); Thread.sleep(wait); } catch (Exception e) { e.printStackTrace(); } } } } } SimpleConsumer.java Note that this consumer is designed as an infinite loop. In normal operation of Kafka, all the producers could be idle while consumers are likely to be still running. The example includes Java properties for setting up the client identified in the comments; the functional parts of the code are in bold. This code is compatible with versions as old as the 0.9.0-kafka-2.0.0 version of Kafka. package com.cloudera.kafkaexamples; import java.util.Arrays; import java.util.Properties; import import import import import org.apache.kafka.clients.consumer.ConsumerConfig; org.apache.kafka.clients.consumer.ConsumerRecord; org.apache.kafka.clients.consumer.ConsumerRecords; org.apache.kafka.clients.consumer.KafkaConsumer; org.apache.kafka.common.serialization.StringDeserializer; public class SimpleConsumer { public static void main(String[] args) { // Set up client Java properties Properties props = new Properties(); props.setProperty(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "host1:9092,host2:9092,host3:9092"); // Just a user-defined string to identify the consumer group props.put(ConsumerConfig.GROUP_ID_CONFIG, "test"); // Enable auto offset commit props.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, "true"); props.put(ConsumerConfig.AUTO_COMMIT_INTERVAL_MS_CONFIG, "1000"); props.setProperty(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName()); props.setProperty(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName()); try (KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props)) { // List of topics to subscribe to consumer.subscribe(Arrays.asList("ufo_sightings")); while (true) { try { ConsumerRecords<String, String> records = consumer.poll(100); for (ConsumerRecord<String, String> record : records) { 54 | Apache Kafka Guide Kafka Integration System.out.printf("Offset = %d\n", record.offset()); System.out.printf("Key = %s\n", record.key()); System.out.printf("Value = %s\n", record.value()); } } catch (Exception e) { e.printStackTrace(); } } } } } Moving Kafka Clients to Production Now that you’ve seen the basic examples of a producer and consumer, prototyping your own designs shouldn’t be too difficult. However, your code will likely undergo several iterations that improve on scalability, debuggability, robustness, and maintainability. This section presents recommendations in the form of code snippets that illustrate some of the important ways to use the producer and consumer APIs. Reuse your Producer/Consumer object In these examples, the consumer constructor should be called once and the poll() method called within a loop. If this object is not reused, then a new connection to the broker is opened with each new KafkaConsumer object created. Recommended KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props); while (true) { ConsumerRecords<String, String> records = consumer.poll(100); } Not Recommended while (true) { KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props); ConsumerRecords<String, String> records = consumer.poll(100); } Similarly, it is recommended that you use one KafkaConsumer and/or KafkaProducer object per thread. Creating more objects opens multiple ports per broker connection. Overusing ephemeral ports can cause performance issues. In addition, Cloudera recommends to set and use a fixed client.id for producers and consumers when they are connecting to the brokers. If this is not done, Kafka will assign a new client id every time a new connection is established, which can severely increase resource utilization (memory) on the broker side. Each KafkaConsumer object requires calling poll() frequently As explained in the Apache Kafka documentation topic New Consumer Configs, any consumer connected to a partition will time out if poll() is not called within the period defined by max.poll.interval.ms. In the example below, the call to myDataProcess.doStuff(records) can cause poll() to be called infrequently. This could be due to a combination of reasons: • • • • Being a blocking method call. Doing work on a remote machine. Having highly variable processing time. Saving to storage that has highly variable I/O throughput. In such cases, consider having another thread or process doing the actual work and making the handoff as lightweight as possible. Apache Kafka Guide | 55 Kafka Integration Example: poll() gets KafkaException due to session timeout while (true) { KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props); ConsumerRecords<String, String> records = consumer.poll(100); // the call below should return quickly in all cases myDataProcess.doStuff(records); } Catch all exceptions from poll() From the poll() Javadoc page, you can see that the poll() method throws several exceptions. If the catch statements (bold in the example) are not complete, then any uncaught exception will end up in the finally statement calling KafkaConsumer#close(). This will not be the desired behavior in many cases. while (true) { try { ConsumerRecords<String, String> records = consumer.poll(100); } catch (Exception e) { e.printStackTrace(); } finally { consumer.close(); } Callback#onCompletion() should always exit without errors The interface org.apache.kafka.clients.producer.Callback (Javadoc) is used to define a class that can be used upon completion of a KafkaProducer#send() call. It allows for tracking, clean up, or other administrative code to be called. An example of unintended usage is to call KafkaProducer#send() within the Callback#onCompletion() method, essentially mimicking a retry. Because the onCompletion() method is always expected to return cleanly and the send() method makes no such guarantees, calling send() within the callback could end up hanging the code in case of network or broker issues. Check your API usage against the latest API The documentation for the latest upstream release of Apache Kafka indicates if there have been any changes to how the APIs are used (setup, read, write). Reviewing the latest information could help avoid upgrade-related changes to your producer or consumer. Some examples from past versions include: Old Class or Package New Class or Package kafka.producer.ProducerConfig java.util.Properties kafka.javaapi.* kafka.api.* kafka.producer.KeyedMessage kafka.clients.producer.ProducerRecord Hidden Dependency on Network Availability Network dependency is one of the more subtle issues. Given the consumer dependencies on Sentry and Zookeeper, having a combination of frequent or prolonged DNS or network outages can also cause various session timeouts to occur. Such timeouts will force partition rebalancing on the brokers, which will worsen general Kafka reliability. Should these issues be common in your network, you may need to have a less straightforward design that can handle such reliability issues outside of the Kafka client. Read the Details Carefully in the Apache Kafka Javadoc The following pages have additional details about Kafka client programming: • KafkaConsumer Javadoc 56 | Apache Kafka Guide Kafka Integration • KafkaProducer Javadoc These Javadoc pages are quite dense with information. They assume you have sufficient background in reliable computing, networking, multithreading, and distributed systems to use the APIs correctly. While the previous sections point out many caveats in using the client APIs, the Javadoc (and ultimately the source code) provides a more detailed explanation. Kafka Metrics Kafka uses Yammer metrics to record internal performance measurements. The metrics are exposed via Java Management Extensions (JMX) and can be read with a JMX console. Metrics Categories There are metrics available in the various components of Kafka. In addition, there are some metrics specific to how Cloudera Manager and Kafka interact. This table has pointers to both the Apache Kafka metrics names and the Cloudera Manager metric names. Table 4: Metrics by Category Category Cloudera Manager Metrics Doc Cloudera Manager Base Metrics on page 86 Apache Kafka Metrics Doc Kafka Service Broker Broker Metrics on page 87 Broker Broker Topic Metrics on page 155 Replica Metrics on page 160 Common Client Producer/Consumer Client-to-Broker Producer Producer Producer Sender Consumer Consumer Group Consumer Fetch Mirror Maker Mirror Maker Metrics on page 159 Same as Producer or Consumer tables Viewing Metrics Cloudera Manager records most of these metrics and makes them available via Chart Builder. Because Cloudera Manager cannot track metrics on any clients (that is, producer or consumer), you may wish to use an alternative JMX console program to check metrics. There are several JMX console options: • The JDK comes with the jconsole utility. • VisualVM has a MBeans plugin. Apache Kafka Guide | 57 Kafka Integration Building Cloudera Manager Charts with Kafka Metrics The Charts edit menu looks like a small pencil icon in the Charts page of the Cloudera Manager console. From there, choose Add from Chart Builder and enter a query for the appropriate metric. SELECT metric WHERE filter Some specific examples of queries for Cloudera Metrics are: Controllers across all brokers This chart shows the active controller across all brokers. It is useful for checking active controller status (should be one at any given time, transitions should be fast). SELECT kafka_active_controller WHERE roleType=KAFKA_BROKER Network idle rate >Chart showing the network processor idle rate across all brokers. If idle time is always zero, then probably the num.network.threads property may need to be increased. SELECT kafka_network_processor_avg_idle_rate WHERE roleType=KAFKA_BROKER Partitions per broker Chart showing the number of partitions per broker. It is useful for detecting partition imbalances early. SELECT kafka_partitions WHERE roleType=KAFKA_BROKER Partition activity Chart tracking partition activity on a single broker. SELECT kafka_partitions, kafka_under_replicated_partitions WHERE hostname=host1.domain.com Mirror Maker activity Chart for tracking Mirror Maker behavior. Since Mirror Maker has one or more consumers and a single producer, most consumer or metrics should be usable with this query. SELECT producer or consumer metric WHERE roleType=KAFKA_MIRROR_MAKER 58 | Apache Kafka Guide Kafka Administration Kafka Administration This section describes managing a Kafka cluster in production, including: Kafka Administration Basics Broker Log Management Kafka brokers save their data as log segments in a directory. The logs are rotated depending on the size and time settings. The most common log retention settings to adjust for your cluster are shown below. These are accessible in Cloudera Manager via the Kafka > Configuration tab. • log.dirs: The location for the Kafka data (that is, topic directories and log segments). • log.retention.{ms|minutes|hours}: The retention period for the entire log. Any older log segments are removed. • log.retention.bytes: The retention size for the entire log. There are many more variables available for fine-tuning broker log management. For more detailed information, look at the relevant variables in the Apache Kafka documentation topic Broker Configs. • • • • • log.dirs log.flush.* log.retention.* log.roll.* log.segment.* Record Management There are two pieces to record management, log segments and log cleaner. As part of the general data storage, Kafka rolls logs periodically based on size or time limits. Once either limit is hit, a new log segment is created with the all new data being placed there, while older log segments should generally no longer change. This helps limit the risk of data loss or corruption to a single segment instead of the entire log. • log.roll.{ms|hours}: The time period for each log segment. Once the current segment is older than this value, it goes through log segment rotation. • log.segment.bytes: The maximum size for a single log segment. There is an alternative to simply removing log segments for a partition. There is another feature based on the log cleaner. When the log cleaner is enabled, individual records in older log segments can be managed differently: • log.cleaner.enable: This is a global setting in Kafka to enable the log cleaner. • cleanup.policy: This is a per-topic property that is usually set at topic creation time. There are two valid values for this property, delete and compact. • log.cleaner.min.compaction.lag.ms: This is the retention period for the “head” of the log. Only records outside of this retention period will be compacted by the log cleaner. The compact policy, also called log compaction, assumes that the "most recent Kafka record is important." Some examples include tracking a current email address or tracking a current mailing address. With log compaction, older records with the same key are removed from a log segment and the latest one is kept. This effectively removes some offsets from the partition. Apache Kafka Guide | 59 Kafka Administration Broker Garbage Log Collection and Log Rotation Both broker JVM garbage collection and JVM garbage log rotation is enabled by default in the Kafka version delivered with CDH. Garbage collection logs are written in the agent process directory by default. Example path: /run/cloudera-scm-agent/process/99-kafka-KAFKA_BROKER/kafkaServer-gc.log Changing the default directory of garbage collection logs is currently not supported. However, you can configure properties related garbage log rotation with the Kafka Broker Environment Advanced Configuration Snippet (Safety Valve) property. 1. In Cloudera Manager, go to the Kafka service and click Configuration. 2. Find the Kafka Broker Environment Advanced Configuration Snippet (Safety Valve) property. 3. Add the following line to the property: Modify the values of as required. KAFKA_GC_LOG_OPTS="-XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=100M" The flags used are as follows: • +UseGCLogFileRotation: Enables garbage log rotation. • -XX:NumberOfGCLogFiles: Specifies the number of files to use when rotating logs. • -XX:GCLogFileSize: Specifies the size when the log will be rotated. 4. Click on Save Changes. 5. Restart the Kafka service to apply the changes. Adding Users as Kafka Administrators In some cases, additional users besides the kafka account need administrator access. This can be done in Cloudera Manager by going to Kafka > Configuration > Super users. Migrating Brokers in a Cluster Brokers can be moved to a new host in a Kafka cluster. This might be needed in the case of catastrophic hardware failure. Make sure the following are true before starting: • Make sure the cluster is healthy. • Make sure all replicas are in sync. • Perform the migration when there is minimal load on the cluster. Brokers need to be moved one-by-one. There are two techniques available: Using kafka-reassign-partitions tool This method involves more manual work to modify JSON, but does not require manual edits to configuration files. For more information, see kafka-reassign-partitions on page 65. Modify the broker IDs in meta.properties This technique involves less manual work, but requires modifying an internal configuration file. 1. Start up the new broker as a member of the old cluster. This creates files in the data directory. 2. Stop both the new broker and the old broker that it is replacing. 60 | Apache Kafka Guide Kafka Administration 3. Change broker.id of the new broker to the broker.id of the old one both in Cloudera Manager and in data directory/meta.properties. 4. (Optional) Run rsync to copy files from one broker to another. See Using rsync to Copy Files from One Broker to Another on page 61. 5. Start up the new broker. It re-replicates data from the other nodes. Note that data intensive administration operations such as rebalancing partitions, adding a broker, removing a broker, or bootstrapping a new machine can cause significant additional load on the cluster. To avoid performance degradation of business workloads, you can limit the resources that these background processes can consume by specifying the -throttleparameter when running kafka-reassign-partitions. Using rsync to Copy Files from One Broker to Another You can run rsync command to copy over all data from an old broker to a new broker, preserving modification times and permissions. Using rsync allows you to avoid having to re-replicate the data from the leader. You have to ensure that the disk structures match between the two brokers, or you have to verify the meta.properties file between the source and destination brokers (because there is one meta.properties file for each data directory). Run the following command on destination broker: rsync -avz src_broker:src_data_dir dest_data_dir If you plan to change the broker ID, edit dest_data_dir/meta.properties. Setting User Limits for Kafka Kafka opens many files at the same time. The default setting of 1024 for the maximum number of open files on most Unix-like systems is insufficient. Any significant load can result in failures and cause error messages such as java.io.IOException...(Too many open files) to be logged in the Kafka or HDFS log files. You might also notice errors such as this: ERROR Error in acceptor (kafka.network.Acceptor) java.io.IOException: Too many open files Cloudera recommends setting the value to a relatively high starting point, such as 32,768. You can monitor the number of file descriptors in use on the Kafka Broker dashboard. In Cloudera Manager: 1. Go to the Kafka service. 2. Select a Kafka Broker. 3. Open Charts Library > Process Resources and scroll down to the File Descriptors chart. See Viewing Charts for Cluster, Service, Role, and Host Instances. Quotas For a quick video introduction to quotas, see Quotas. In CDK 2.0 Powered by Apache Kafka and higher, Kafka can enforce quotas on produce and fetch requests. Producers and consumers can use very high volumes of data. This can monopolize broker resources, cause network saturation, and generally deny service to other clients and the brokers themselves. Quotas protect against these issues and are important for large, multi-tenant clusters where a small set of clients using high volumes of data can degrade the user experience. Apache Kafka Guide | 61 Kafka Administration Quotas are byte-rate thresholds, defined per client ID. A client ID logically identifies an application making a request. A single client ID can span multiple producer and consumer instances. The quota is applied for all instances as a single entity. For example, if a client ID has a produce quota of 10 MB/s, that quota is shared across all instances with that same ID. When running Kafka as a service, quotas can enforce API limits. By default, each unique client ID receives a fixed quota in bytes per second, as configured by the cluster (quota.producer.default, quota.consumer.default). This quota is defined on a per-broker basis. Each client can publish or fetch a maximum of X bytes per second per broker before it gets throttled. The broker does not return an error when a client exceeds its quota, but instead attempts to slow the client down. The broker computes the amount of delay needed to bring a client under its quota and delays the response for that amount of time. This approach keeps the quota violation transparent to clients (outside of client-side metrics). This also prevents clients from having to implement special backoff and retry behavior. Setting Quotas You can override the default quota for client IDs that need a higher or lower quota. The mechanism is similar to per-topic log configuration overrides. Write your client ID overrides to ZooKeeper under /config/clients. All brokers read the overrides, which are effective immediately. You can change quotas without having to do a rolling restart of the entire cluster. By default, each client ID receives an unlimited quota. The following configuration sets the default quota per producer and consumer client ID to 10 MB/s. quota.producer.default=10485760 quota.consumer.default=10485760 To set quotas using Cloudera Manager, open the Kafka Configuration page and search for Quota. Use the fields provided to set the Default Consumer Quota or Default Producer Quota. For more information, see Modifying Configuration Properties Using Cloudera Manager. Kafka Administration Using Command Line Tools In some situations, it is convenient to use the command line tools available in Kafka to administer your cluster. However, it is important to note that not all tools available for Kafka are supported by Cloudera. Moreover, certain administration tasks can be carried more easily and conveniently using Cloudera Manager. Therefore, before you continue, make sure to review Unsupported Command Line Tools on page 62 and Notes on Kafka CLI Administration on page 63. Note: Output examples in this document are cleaned and formatted for easier readability. Unsupported Command Line Tools The following tools can be found as part of the Kafka distribution, but their use is generally discouraged for various reasons as documented here. Tool Notes connect-distributed Kafka Connect is currently not supported. connect-standalone kafka-acls Cloudera recommends using Sentry to manage ACLs instead of this tool. kafka-broker-api-versions Primarily useful for Client-to-Broker protocol related development. 62 | Apache Kafka Guide Kafka Administration Tool Notes kafka-configs Use Cloudera Manager to adjust any broker or security properties instead of the kafka-configs tool. This tool should only be used to modify topic properties. kafka-delete-records Do not use with CDH. kafka-mirror-maker Use Cloudera Manager to create any CDH Mirror Maker instance. kafka-preferred-replica-election This tool causes leadership for each partition to be transferred back to the 'preferred replica'. It can be used to balance leadership among the servers. It is recommended to use kafka-reassign-partitions instead of kafka-preferred-replica-election. kafka-replay-log-producer Can be used to “rename” a topic. kafka-replica-verification Validates that all replicas for a set of topics have the same data. This tool is a “heavy duty” version of the ISR column of kafka-topics tool. kafka-server-start Use Cloudera Manager to manage any Kafka host. kafka-server-stop kafka-simple-consumer-shell Deprecated in Apache Kafka. kafka-streams-application-reset Kafka Streams is currently not supported. kafka-verifiable-consumer These scripts are intended for system testing. kafka-verifiable-producer zookeeper-security-migration Use Cloudera Manager to manage any Zookeeper host. zookeeper-server-start zookeeper-server-stop zookeeper-shell Limit usage of this script to reading information from Zookeeper. Notes on Kafka CLI Administration Here are some additional points to be aware of regarding Kafka administration: • Use Cloudera Manager to start and stop Kafka and Zookeeper services. Do not use the kafka-server-start, kafka-server-stop, zookeeper-server-start, or zookeeper-server-stop commands. • For a parcel installation, all Kafka command line tools are located in /opt/cloudera/parcels/KAFKA/lib/kafka/bin/. For a package installation, all such tools can be found in /usr/bin/. • Ensure that the JAVA_HOME environment variable is set to your JDK installation directory before using the command-line tools. For example: export JAVA_HOME=/usr/java/jdk1.8.0_144-cloudera • Using any Zookeeper command manually can be very difficult to get right when it comes to interaction with Kafka. Cloudera recommends that you avoid doing any write operations or ACL modifications in Zookeeper. Apache Kafka Guide | 63 Kafka Administration kafka-topics Use the kafka-topics tool to generate a snapshot of topics in the Kafka cluster. kafka-topics --zookeeper zkhost --describe Topic: topic-a1 Topic: Isr: 64,62,63 Topic: Isr: 62,63,64 Topic: Isr: 63,64,62 Topic: topic-a2 Topic: Isr: 64,62,63 PartitionCount:3 ReplicationFactor:3 topic-a1 Partition: 0 Leader: 64 Configs: Replicas: 64,62,63 topic-a1 Partition: 1 Leader: 62 Replicas: 62,63,64 topic-a1 Partition: 2 Leader: 63 Replicas: 63,64,62 PartitionCount:1 ReplicationFactor:3 topic-a2 Partition: 0 Leader: 64 Configs: Replicas: 64,62,63 The output lists each topic and basic partition information. Note the following about the output: • Partition count: The more partitions, the higher the possible parallelism among consumers and producers. • Replication factor: Shows 1 for no redundancy and higher for more redundancy. • Replicas and in-sync replicas (ISR): Shows which broker ID’s have the partitions and which replicas are current. There are situations where this tool shows an invalid value for the leader broker ID or the number of ISRs is fewer than the number of replicas. In those cases, there may be something wrong with those specific topics. It is possible to change topic configuration properties using this tool. Increasing the partition count, the replication factor or both is not recommended. kafka-configs The kafka-configs tool allows you to set and unset properties to topics. Cloudera recommends that you use Cloudera Manager instead of this tool to change properties on brokers, because this tool bypasses any Cloudera Manager safety checks. Setting a topic property: kafka-configs --zookeeper zkhost --entity-type topics --entity-name topic --alter --add-config property=value Checking a topic property: $ kafka-configs --zookeeper zkhost --entity-type topics --entity-name topic --describe Unsetting a topic property: $ kafka-configs --zookeeper zkhost --entity-type topics --entity-name topic --alter --delete-config property The Apache Kafka documentation includes a complete list of topic properties. kafka-console-consumer The kafka-console-consumer tool can be useful in a couple of ways: • Acting as an independent consumer of particular topics. This can be useful to compare results against a consumer program that you’ve written. • To test general topic consumption without the need to write any consumer code. 64 | Apache Kafka Guide Kafka Administration Examples of usage: $ kafka-console-consumer --bootstrap-server <broker1>,<broker2>... --topic <topic> --from-beginning <record-earliest-offset> <record-earliest-offset+1> Note the following about the tool: • This tool prints all records and keeps outputting as more records are written to the topic. • If the kafka-console-consumer tool is given no flags, it displays the full help message. • In older versions of Kafka, it may have been necessary to use the --new-consumer flag. As of Apache Kafka version 0.10.2, this is no longer necessary. kafka-console-producer This tool is used to write messages to a topic. It is typically not as useful as the console consumer, but it can be useful when the messages are in a text based format. In general, the usage will be something like: cat file | kafka-console-producer args kafka-consumer-groups The basic usage of the kafka-consumer-groups tool is: kafka-consumer-groups --bootstrap-server broker1,broker2... --describe --group GROUP_ID This tool is primarily useful for debugging consumer offset issues. The output from the tool shows the log and consumer offsets for each partition connected to the consumer group corresponding to GROUP_ID. You can see at a glance which consumers are current with their partition and which ones are behind. From there, you can determine which partitions (and likely the corresponding brokers) are slow. Beyond this debugging usage, there are other more advanced options to this tool: • --execute --reset-offsets SCENARIO_OPTION: Resets the offsets for a consumer group to a particular value based on the SCENARIO_OPTION flag given. Valid flags for SCENARIO_OPTION are: – – – – – – – --to-datetime --by-period --to-earliest --to-latest --shift-by --from-file --to-current You will likely want to set the --topic flag to restrict this change to a specific topic or a specific set of partitions within that topic. This tool can be used to reset all offsets on all topics. This is something you probably won’t ever want to do. It is highly recommended that you use this command carefully. kafka-reassign-partitions This tool provides substantial control over partitions in a Kafka cluster. It is mainly used to balance storage loads across brokers through the following reassignment actions: Apache Kafka Guide | 65 Kafka Administration • Change the ordering of the partition assignment list. Used to control leader imbalances between brokers. • Reassign partitions from one broker to another. Used to expand existing clusters. • Reassign partitions between log directories on the same broker. Used to resolve storage load imbalance among available disks in the broker. • Reassign partitions between log directories across multiple brokers. Used to resolve storage load imbalance across multiple brokers. The tool uses two JSON files for input. Both of these are created by the user. The two files are the following: • Topics-to-Move JSON on page 66 • Reassignment Configuration JSON on page 66 Topics-to-Move JSON This JSON file specifies the topics that you want to reassign. This a simple file that tells the kafka-reassign-partitions tool which partitions it should look at when generating a proposal for the reassignment configuration. The user has to create the topics-to-move JSON file from scratch. The format of the file is the following: {"topics": [{"topic": "mytopic1"}, {"topic": "mytopic2"}], "version":1 } Reassignment Configuration JSON This JSON file is a configuration file that contains the parameters used in the reassignment process. This file is created by the user, however, a proposal for its contents is generated by the tool. When the kafka-reasssign-partitions tool is executed with the --generate option, it generates a proposed configuration which can be fine-tuned and saved as a JSON file. The file created this way is the reassignment configuration JSON. To generate a proposal, the tool requires a topics-to-move file as input. The format of the file is the following: {"version":1, "partitions": [{"topic":"mytopic1","partition":3,"replicas":[4,5],"log_dirs":["any","any"]}, {"topic":"mytopic1","partition":1,"replicas":[5,4],"log_dirs":["any","any"]}, {"topic":"mytopic2","partition":2,"replicas":[6,5],"log_dirs":["any","any"]}] } The reassignment configuration contains multiple properties that each control and specify an aspect of the configuration. The Reassignment Configuration Properties table lists each property and its description. Table 5: Reassignment Configuration Properties Property Description topic Specifies the topic. partition Specifies the partition. replicas Specifies the brokers that the selected partition is assigned to. The brokers are listed in order, which means that the first broker in the list is always the leader for that partition. Change the order of brokers to resolve any leader balancing issues among brokers. Change the broker IDs to reassign partitions to different brokers. log_dirs Specifies the log directory of the brokers. The log directories are listed in the same order as the brokers. By 66 | Apache Kafka Guide Kafka Administration Property Description default any is specified as the log directory, which means that the broker is free to choose where it places the replica. By default, the current broker implementation selects the log directory using a round-robin algorithm. An absolute path beginning with a / can be used to explicitly set where to store the partition replica. Notes and Recommendations: • Cloudera recommends that you minimize the volume of replica changes per command instance. Instead of moving 10 replicas with a single command, move two at a time in order to save cluster resources. • This tool cannot be used to make an out-of-sync replica into the leader partition. • Use this tool only when all brokers and topics are healthy. • Anticipate system growth. Redistribute the load when the system is at 70% capacity. Waiting until redistribution becomes necessary due to reaching resource limits can make the redistribution process extremely time consuming. Tool Usage To reassign partitions, complete the following steps: 1. Create a topics-to-move JSON file that specifies the topics you want to reassign. Use the following format: {"topics": [{"topic": "mytopic1"}, {"topic": "mytopic2"}], "version":1 } 2. Generate the content for the reassignment configuration JSON with the following command: kafka-reassign-partitions --zookeeper hostname:port --topics-to-move-json-file topics to move.json --broker-list broker 1, broker 2 --generate Running the command lists the distribution of partition replicas on your current brokers followed by a proposed partition reassignment configuration. Example output: Current partition replica assignment {"version":1, "partitions": [{"topic":"mytopic2","partition":1,"replicas":[2,3],"log_dirs":["any","any"]}, {"topic":"mytopic1","partition":0,"replicas":[1,2],"log_dirs":["any","any"]}, {"topic":"mytopic2","partition":0,"replicas":[1,2],"log_dirs":["any","any"]}, {"topic":"mytopic1","partition":2,"replicas":[3,1],"log_dirs":["any","any"]}, {"topic":"mytopic1","partition":1,"replicas":[2,3],"log_dirs":["any","any"]}] } Proposed partition reassignment configuration {"version":1, "partitions": [{"topic":"mytopic1","partition":0,"replicas":[4,5],"log_dirs":["any","any"]}, {"topic":"mytopic1","partition":2,"replicas":[4,5],"log_dirs":["any","any"]}, {"topic":"mytopic2","partition":1,"replicas":[4,5],"log_dirs":["any","any"]}, {"topic":"mytopic1","partition":1,"replicas":[5,4],"log_dirs":["any","any"]}, {"topic":"mytopic2","partition":0,"replicas":[5,4],"log_dirs":["any","any"]}] } In this example, the tool proposed a configuration which reassigns existing partitions on broker 1, 2, and 3 to brokers 4 and 5. Apache Kafka Guide | 67 Kafka Administration 3. 4. 5. 6. Copy and paste the proposed partition reassignment configuration into an empty JSON file. Review, and if required, modify the suggested reassignment configuration. Save the file. Start the redistribution process with the following command: kafka-reassign-partitions --zookeeper hostname:port --reassignment-json-file reassignment configuration.json --bootstrap-server hostname:port --execute Note: Specifying a bootstrap server with the --bootstrap-server option is only required when an absolute log directory path is specified for a replica in the reassignment configuration JSON file. The tool prints a list containing the original replica assignment and a message that reassignment has started. Example output: Current partition replica assignment {"version":1, "partitions": [{"topic":"mytopic2","partition":1,"replicas":[2,3],"log_dirs":["any","any"]}, {"topic":"mytopic1","partition":0,"replicas":[1,2],"log_dirs":["any","any"]}, {"topic":"mytopic2","partition":0,"replicas":[1,2],"log_dirs":["any","any"]}, {"topic":"mytopic1","partition":2,"replicas":[3,1],"log_dirs":["any","any"]}, {"topic":"mytopic1","partition":1,"replicas":[2,3],"log_dirs":["any","any"]}] } Save this to use as the --reassignment-json-file option during rollback Successfully started reassignment of partitions. 7. Verify the status of the reassignment with the following command: kafka-reassign-partitions --zookeeper hostname:port --reassignment-json-file reassignment configuration.json --bootstrap-server hostname:port --verify The tool prints the reassignment status of all partitions. Example output: Status of partition reassignment: Reassignment of partition mytopic2-1 Reassignment of partition mytopic1-0 Reassignment of partition mytopic2-0 Reassignment of partition mytopic1-2 Reassignment of partition mytopic1-1 completed completed completed completed completed successfully successfully successfully successfully successfully Examples There are multiple ways to modify the configuration file. The following list of examples shows how a user can modify a proposed configuration and what these changes do. Changes to the original example are marked in bold. Suppose that the kafka-reassign-partitions tool generated the following proposed reassignment configuration: {"version":1, "partitions": [{"topic":"mytopic1","partition":0,"replicas":[1,2],"log_dirs":["any","any"]}]} Reassign partitions between brokers To reassign partitions from one broker to another, change the broker ID specified in replicas. For example: {"topic":"mytopic1","partition":0,"replicas":[5,2],"log_dirs":["any","any"]} This reassignment configuration moves partition mytopic1-0 from broker 1 to broker 5. 68 | Apache Kafka Guide Kafka Administration Reassign partitions to another log directory on the same broker To reassign partitions between log directories on the same broker, change the appropriate any entry to an absolute path. For example: {"topic":"mytopic1","partition":0,"replicas":[1,2],"log_dirs":["/log/directory1","any"]} This reassignment configuration moves partition mytopic1-0 to the /log/directory1 log directory. Reassign partitions between log directories across multiple brokers To reassign partitions between log directories across multiple brokers, change the broker ID specified in replicas and the appropriate any entry to an absolute path. For example: {"topic":"mytopic1","partition":0,"replicas":[5,2],"log_dirs":["/log/directory1","any"]} This reassignment configuration moves partition mytopic1-0 to /log/directory1 on broker 5. Change partition assignment order (elect a new leader) To change the ordering of the partition assignment list, change the order of the brokers in replicas. For example: {"topic":"mytopic1","partition":0,"replicas":[2,1],"log_dirs":["any","any"]} This reassignment configuration elects broker 2 as the new leader. kafka-log-dirs The kafka-log-dirs tool allows user to query a list of replicas per log directory on a broker. The tool provides information that is required for optimizing replica assignment across brokers. On successful execution, the tool prints a list of partitions per log directory for the specified topics and brokers. The list contains information on topic partition, size, offset lag, and reassignment state. Example output: { "brokers": [ { "broker": 86, "logDirs": [ { "error": null, "logDir": "/var/local/kafka/data", "partitions": [ { "isFuture": false, "offsetLag": 0, "partition": "mytopic1-2", "size": 0 } ] } ] }, ... ], "version": 1 } The Contents of the kafka-log-dirs Output table gives an overview of the information provided by the kafka-log-dirs tool. Table 6: Contents of the kafka-log-dirs Output Property Description broker Displays the ID of the broker. Apache Kafka Guide | 69 Kafka Administration Property Description error Indicates if there is a problem with the disk that hosts the topic partition. If an error is detected, org.apache.kafka.common.errors.KafkaStorageException is displayed. If no error is detected, the value is null. logDir Specifies the location of the log directory. Returns an absolute path. isfuture The reassignment state of the partition. This property shows whether there is currently replica movement underway between the log directories. offsetLag Displays the offset lag of the partition. partition Displays the name of the partition. size Displays the size of the partition in bytes. Tool Usage To retrieve replica assignment information, run the following command: kafka-log-dirs --describe --bootstrap-server hostname:port --broker-list broker 1, broker 2 --topic-list topic 1, topic 2 Important: On secure clusters the admin client config property file has to be specified with the --command-config option. Otherwise, the tool fails to execute. If no topic is specified with the --topic-list option, then all topics are queried. If no broker is specified with the --broker-list option, then all brokers are queried. If a log directory is offline, the log directory will be marked offline in the script output. Error example: "error":"org.apache.kafka.common.errors.KafkaStorageException" kafka-*-perf-test The kafka-*-perf-test tool can be used in several ways. In general, it is expected that these tools should be used on a test or development cluster. • Measuring read and/or write throughput. • Stress testing the cluster based on specific parameters (such as message size). • Load testing for the purpose of evaluating specific metrics or determining the impact of cluster configuration changes. The kafka-producer-perf-test script can either create a randomly generated byte record: kafka-producer-perf-test --topic TOPIC --record-size SIZE_IN_BYTES or randomly read from a set of provided records: kafka-producer-perf-test --topic TOPIC --payload-delimiter DELIMITER --payload-file INPUT_FILE where the INPUT_FILE is a concatenated set of pre-generated messages separated by DELIMITER. This script keeps producing messages or limited based on the --num-records flag. 70 | Apache Kafka Guide Kafka Administration The kafka-consumer-perf-test is: kafka-consumer-perf-test --broker-list host1:port1,host2:port2,... --zookeeper zk1:port1,zk2:port2,... --topic TOPIC The flags of most interest for this command are: • --group gid: If you run more than one instance of this test, you will want to set different ids for each instance. • --num-fetch-threads: Defaults to 1. Increase if higher throughput testing is needed. • --from-latest: To start consuming from the latest offset. May be needed for certain types of testing. Enabling DEBUG or TRACE in command line scripts In some cases, you may find it useful to produce extra debugging output from the Kafka client API. The DEBUG (shown below) or TRACE levels can be set by replacing the setting in the log4j properties file as follows: cp /etc/kafka/conf/tools-log4j.properties /var/tmp sed -i -e 's/WARN/DEBUG/g' /var/tmp/tools-log4j.properties export KAFKA_OPTS="-Dlog4j.configuration=file:/var/tmp/tools-log4j.properties" Understanding the kafka-run-class Bash Script Almost all the provided Kafka tools eventually call the kafka-run-class script. This script is generally not called directly. However, if you are proficient with bash and want to understand certain features available in all Kafka scripts as well as some potential debugging scenarios, familiarity with the kafka-run-class script can prove highly beneficial. For example, there are some useful environment variables that affect all the command line scripts: • KAFKA_DEBUG allows a Java debugger to attach to the JVM launched by the particular script. Setting KAFKA_DEBUG also allows some further debugging customization: – JAVA_DEBUG_PORT sets the JVM debugging port. – JAVA_DEBUG_OPTS can be used to override the default debugging arguments being passed to the JVM. • KAFKA_HEAP_OPTS can be used to pass memory setting arguments to the JVM. • KAFKA_JVM_PERFORMANCE_OPTS can be used to pass garbage collection flags to the JVM. JBOD As of CDH 6.1.0, Kafka clusters with nodes using JBOD configurations are supported by Cloudera. JBOD refers to a system configuration where disks are used independently rather than organizing them into redundant arrays (RAID). Using RAID usually results in more reliable hard disk configurations even if the individual disks are not reliable. RAID setups like these are common in large scale big data environments built on top of commodity hardware. RAID enabled configurations are more expensive and more complicated to set up. In a large number of environments, JBOD configurations are preferred for the following reasons: • Reduced storage cost: RAID-10 is recommended to protect against disk failures. However, scaling RAID-10 configurations can become excessively expensive. Storing the data redundantly on each node means that storage space requirements have to be multiplied because the data is also replicated across nodes. • Improved performance: Just like HDFS, the slowest disk in RAID-10 configuration limits overall throughput. Writes need to go through a RAID controller. On the other hand, when using JBOD, IO performance is increased as a result of isolated writes across disks without a controller. JBOD Setup and Migration Consider the following before using JBOD support in Kafka: Apache Kafka Guide | 71 Kafka Administration • Manual operation and administration: Monitoring offline directories and JBOD related metrics is done through Cloudera Manager. However, identifying failed disks and rebalancing partitions between disks is done manually. • Manual load balancing between disks: Unlike with RAID-10, JBOD does not automatically distribute data across disks. The process is fully manual. To provide robust JBOD support in Kafka, changes in the Kafka protocol have been made. When performing an upgrade to a new version of Kafka, make sure that you follow the recommended rolling upgrade process. For more information, see Upgrading the CDH Cluster. For more information regarding the JBOD related Kafka protocol changes, see KIP-112 and KIP-113. Setup To set up JBOD in your Kafka environment, perform the following steps: 1. Mount the required number of disks on your system. 2. In Cloudera Manager, set up log directories for all Kafka brokers. a. Go to the Kafka service, select Instances and select the broker. b. Go to Configuration and find the Data Directories property. c. Modify the path of the log directories so that they correspond with the newly mounted disks. Note: Depending on your, setup you may need to add or remove multiple data directories. d. Enter a Reason for change, and then click Save Changes to commit the changes. 3. Go to the Kafka service and select Configuration. 4. Find and configure the following properties depending on your system and use case. • • • • Number of I/O Threads Number of Network Threads Number of Replica Fetchers Minimum Number of Replicas in ISR 5. Set replication factor to at least 3. Important: If you set replication factor to less than 3, your data will be at risk. In addition, in case of a disk failure, disk maintenance cannot be carried out without system downtime. 6. Restart the service. a. b. c. d. Return to the Home page by clicking the Cloudera Manager logo. Go to the Kafka service and select Actions > Rolling Restart. Check the Restart roles with stale configurations only checkbox and click Rolling restart. Click Close when the restart has finished. Migration Migrating data from one disk to another is achieved with the kafka-reassign-partitions tool. The following instructions focus on migrating existing Kafka partitions to JBOD configured disks. For a full tool description, see kafka-reassign-partitions on page 65. Note: Cloudera recommends that you minimize the volume of replica changes per command instance. Instead of moving 10 replicas with a single command, move two at a time in order to save cluster resources. 72 | Apache Kafka Guide Kafka Administration Prerequisites • • • • Set up JBOD in your Kafka environment. For more information, see Setup on page 72. Collect the log directory paths on the JBOD disks where you want to migrate existing data. Collect the broker IDs of the brokers you want to migrate data to. Collect the name of the topics you want to migrate partitions from. Steps Note: Output examples in these instructions are cleaned and formatted to make them easily readable. To migrate data to JBOD configured disks, perform the following steps: 1. Create a topics-to-move JSON file that specifies the topics you want to reassign. Use the following format: {"topics": [{"topic": "mytopic1"}, {"topic": "mytopic2"}], "version":1 } 2. Generate the content for the reassignment configuration JSON with the following command: kafka-reassign-partitions --zookeeper hostname:port --topics-to-move-json-file topics to move.json --broker-list broker 1, broker 2 --generate Running the command lists the distribution of partition replicas on your current brokers followed by a proposed partition reassignment configuration. Example output: Current partition replica assignment {"version":1, "partitions": [{"topic":"mytopic2","partition":1,"replicas":[2,3],"log_dirs":["any","any"]}, {"topic":"mytopic1","partition":0,"replicas":[1,2],"log_dirs":["any","any"]}, {"topic":"mytopic2","partition":0,"replicas":[1,2],"log_dirs":["any","any"]}, {"topic":"mytopic1","partition":2,"replicas":[3,1],"log_dirs":["any","any"]}, {"topic":"mytopic1","partition":1,"replicas":[2,3],"log_dirs":["any","any"]}] } Proposed partition reassignment configuration {"version":1, "partitions": [{"topic":"mytopic1","partition":0,"replicas":[4,5],"log_dirs":["any","any"]}, {"topic":"mytopic1","partition":2,"replicas":[4,5],"log_dirs":["any","any"]}, {"topic":"mytopic2","partition":1,"replicas":[4,5],"log_dirs":["any","any"]}, {"topic":"mytopic1","partition":1,"replicas":[5,4],"log_dirs":["any","any"]}, {"topic":"mytopic2","partition":0,"replicas":[5,4],"log_dirs":["any","any"]}] } In this example, the tool proposed a configuration which reassigns existing partitions on broker 1, 2, and 3 to brokers 4 and 5. 3. Copy and paste the proposed partition reassignment configuration into an empty JSON file. 4. Modify the suggested reassignment configuration. When migrating data you have two choices. You can move partitions to a different log directory on the same broker, or move it to a different log directory on another broker. Apache Kafka Guide | 73 Kafka Administration a. To reassign partitions between log directories on the same broker, change the appropriate any entry to an absolute path. For example: {"topic":"mytopic1","partition":0,"replicas":[4,5],"log_dirs":["/JBOD-disk/directory1","any"]} b. To reassign partitions between log directories across different brokers, change the broker ID specified in replicas and the appropriate any entry to an absolute path. For example: {"topic":"mytopic1","partition":0,"replicas":[6,5],"log_dirs":["/JBOD-disk/directory1","any"]} 5. Save the file. 6. Start the redistribution process with the following command: kafka-reassign-partitions --zookeeper hostname:port --reassignment-json-file reassignment configuration.json --bootstrap-server hostname:port --execute Important: The bootstrap server has to be specified with the --bootstrap-server option if an absolute log directory path is specified for a replica in the reassignment configuration JSON file. The tool prints a list containing the original replica assignment and a message that reassignment has started. Example output: Current partition replica assignment {"version":1, "partitions": [{"topic":"mytopic2","partition":1,"replicas":[2,3],"log_dirs":["any","any"]}, {"topic":"mytopic1","partition":0,"replicas":[1,2],"log_dirs":["any","any"]}, {"topic":"mytopic2","partition":0,"replicas":[1,2],"log_dirs":["any","any"]}, {"topic":"mytopic1","partition":2,"replicas":[3,1],"log_dirs":["any","any"]}, {"topic":"mytopic1","partition":1,"replicas":[2,3],"log_dirs":["any","any"]}] } Save this to use as the --reassignment-json-file option during rollback Successfully started reassignment of partitions. 7. Verify the status of the reassignment with the following command: kafka-reassign-partitions --zookeeper hostname:port --reassignment-json-file reassignment configuration.json --bootstrap-server hostname:port --verify The tool prints the reassignment status of all partitions. Example output: Status of partition reassignment: Reassignment of partition mytopic2-1 Reassignment of partition mytopic1-0 Reassignment of partition mytopic2-0 Reassignment of partition mytopic1-2 Reassignment of partition mytopic1-1 completed completed completed completed completed successfully successfully successfully successfully successfully JBOD Operational Procedures Monitoring Cloudera recommends that administrators continuously monitor the following on a cluster: 74 | Apache Kafka Guide Kafka Administration Replication Status Monitor replication status using Cloudera Manager Health Tests. Cloudera Manager automatically and continuously monitors both the OfflineLogDirectoryCount and OfflineReplicaCount metrics. Alters are raised when failures are detected. For more information, see Cloudera Manager Health Tests. Disk Capacity Monitor free space on mounted disks and open file descriptors. For more information, see Useful Shell Command Reference on page 161. Reassign partitions or move log files around if necessary. For more information, see kafka-reassign-partitions on page 65. Handling Disk Failures Cloudera Manager has built in monitoring functionalities that automatically trigger alerts when disk failures are detected. When a log directory fails, Kafka also detects the failure and takes the partitions stored in that directory offline. Important: If there are no healthy log directories present in the system, the broker stops working. The cause of disk failures can be analyzed with the help of the kafka-log-dirs on page 69 tool, or by reviewing the error messages of KafkaStorageException entries in the Kafka broker log file. To view the Kafka broker log file, complete the following steps: 1. In Cloudera Manager go to the Kafka service, select Instances and select the broker. 2. Go to Log Files > Role Log File. In case of a disk failure, a Kafka administrator can carry out either of the following actions. The action taken depends on the failure type and system environment: • Replace the faulty disk with a new one. • Remove the disk and redistribute data across remaining disks to restore the desired replication factor. Note: Disk replacement and disk removal both require stopping the broker. Therefore, Cloudera recommends that you perform these actions during a maintenance window. Disk Replacement To replace a disk, complete the following steps: 1. Stop the broker that has a faulty disk. a. In Cloudera Manager, go to the Kafka service, select Instances and select the broker. b. Go to Actions > Gracefully stop this Kafka Broker. 2. Replace the disk. 3. Mount the disk. 4. Set up the directory structure on the new disk the same way as it was set up on the previous disk. Note: You can find the directory paths for the old disk in the Data Directories property of the broker. 5. Start the broker. a. In Cloudera Manager go to the Kafka service, selectInstances and select the broker. b. Go to Actions > Start this Kafka Broker. The Kafka broker re-creates topic partitions in the same directory by replicating data from other brokers. Apache Kafka Guide | 75 Kafka Administration Disk Removal To remove a disk from the configuration, complete the following steps: 1. Stop the broker that has a faulty disk. a. In Cloudera Manager, go to the Kafka service, select Instances and select the broker. b. Go to Actions > Gracefully stop this Kafka Broker. 2. Remove the log directories on the faulty disk from the broker. a. Go to Configuration and find the Data Directories property. b. Remove the affected log directories with the Remove button. c. Enter a Reason for change, and then click Save Changes to commit the changes. 3. Start the broker. a. In Cloudera Manager go to the Kafka service, selectInstances and select the broker. b. Go to Actions > Start this Kafka Broker. The Kafka broker redistributes data across the cluster. Reassigning Replicas Between Log Directories Reassigning replicas between log directories can prove useful when you have multiple disks available, but one or more of them is nearing capacity. Moving a replica from one disk to another ensures that the service will not go down due to disks reaching capacity. To balance storage loads, the Kafka administrator has to continuously monitor the system and reassign replicas between log directories on the same broker or across different brokers. These actions can be carried out with the kafka-reassign-partitions tool. For more information on tool usage, see the documentation for the kafka-reassign-partitions on page 65 tool. Retrieving Log Directory Replica Assignment Information To optimize replica assignment across log directories, the list of partitions per log directory and the size of each partition is required. This information can be exposed with the kafka-log-dirs tool. For more information on tool usage, see the documentation for the kafka-log-dirs on page 69 tool. 76 | Apache Kafka Guide Kafka Performance Tuning Kafka Performance Tuning Performance tuning involves two important metrics: • Latency measures how long it takes to process one event. • Throughput measures how many events arrive within a specific amount of time. Most systems are optimized for either latency or throughput. Kafka is balanced for both. A well-tuned Kafka system has just enough brokers to handle topic throughput, given the latency required to process information as it is received. Tuning your producers, brokers, and consumers to send, process, and receive the largest possible batches within a manageable amount of time results in the best balance of latency and throughput for your Kafka cluster. The following sections introduce the concepts you'll need to be able to balance your Kafka workload and then provide practical tuning configuration to address specific circumstances. For a quick video introduction to tuning Kafka, see Tuning Your Apache Kafka Cluster. There are a few concepts described here that will help you focus your tuning efforts. Additional topics in this section provide practical tuning guidelines: Tuning Brokers Topics are divided into partitions. Each partition has a leader. Topics that are properly configured for reliability will consist of a leader partition and 2 or more follower partitions. When the leaders are not balanced properly, one might be overworked, compared to others. Depending on your system and how critical your data is, you want to be sure that you have sufficient replication sets to preserve your data. For each topic, Cloudera recommends starting with one partition per physical storage disk and one consumer per partition. Tuning Producers Kafka uses an asynchronous publish/subscribe model. When your producer calls send(), the result returned is a future. The future provides methods to let you check the status of the information in process. When the batch is ready, the producer sends it to the broker. The Kafka broker waits for an event, receives the result, and then responds that the transaction is complete. If you do not use a future, you could get just one record, wait for the result, and then send a response. Latency is very low, but so is throughput. If each transaction takes 5 ms, throughput is 200 events per second — slower than the expected 100,000 events per second. When you use Producer.send(), you fill up buffers on the producer. When a buffer is full, the producer sends the buffer to the Kafka broker and begins to refill the buffer. Two parameters are particularly important for latency and throughput: batch size and linger time. Batch Size batch.size measures batch size in total bytes instead of the number of messages. It controls how many bytes of data to collect before sending messages to the Kafka broker. Set this as high as possible, without exceeding available memory. The default value is 16384. If you increase the size of your buffer, it might never get full. The Producer sends the information eventually, based on other triggers, such as linger time in milliseconds. Although you can impair memory usage by setting the buffer batch size too high, this does not impact latency. Apache Kafka Guide | 77 Kafka Performance Tuning If your producer is sending all the time, you are probably getting the best throughput possible. If the producer is often idle, you might not be writing enough data to warrant the current allocation of resources. Linger Time linger.ms sets the maximum time to buffer data in asynchronous mode. For example, the setting of 100 means that it batches 100ms of messages to send at once. This improves throughput, but the buffering adds message delivery latency. By default, the producer does not wait. It sends the buffer any time data is available. Instead of sending immediately, you can set linger.ms to 5 and send more messages in one batch. This would reduce the number of requests sent, but would add up to 5 milliseconds of latency to records sent, even if the load on the system does not warrant the delay. The farther away the broker is from the producer, the more overhead required to send messages. Increase linger.ms for higher latency and higher throughput in your producer. Tuning Consumers Consumers can create throughput issues on the other side of the pipeline. The maximum number of consumers in a consumer group for a topic is equal to the number of partitions. You need enough partitions to handle all the consumers needed to keep up with the producers. Consumers in the same consumer group split the partitions among them. Adding more consumers to a group can enhance performance (up to the number of partitions). Adding more consumer groups does not affect performance. Mirror Maker Performance Kafka Mirror Maker is a tool to replicate topics between data centers. It is best to run Mirror Maker at the destination data center. Consuming messages from a distant cluster and writing them into a local cluster tends to be more safe than producing over a long-distance network. Deploying the Mirror Maker in the source data center and producing remotely has a higher risk of losing data. However, if you need this setup, make sure that you configure acks=all with appropriate number of retries and min ISR. • Encrypting data in transit with SSL has impact on performance of Kafka brokers. • To reduce lag between clusters, you can improve performance by deploying multiple Mirror Maker instances using the same consumer group ID. • Measure CPU utilization. • Consider using compression for consumers and producers when mirroring topics between data centers as bandwidth can be a bottleneck. • Monitor lag and metrics of Mirror Maker. To properly size Mirror Maker, take expected throughput and maximum allowed lag between data centers into account. • num.streams parameter controls the number of consumer threads in Mirror Maker. • kafka-producer-perf-test can be used to generate load on the source cluster. You can test and measure performance of Mirror Maker with different num.streams values (start from 1 and increase it gradually). Good performance can be achieved with proper consumer and producer settings and properly tuned OS properties, such as networking and I/O related kernel settings. Kafka Tuning: Handling Large Messages Before configuring Kafka to handle large messages, first consider the following options to reduce message size: 78 | Apache Kafka Guide Kafka Performance Tuning • The Kafka producer can compress messages. For example, if the original message is a text-based format (such as XML), in most cases the compressed message will be sufficiently small. • Use the compression.type producer configuration parameters to enable compression. gzip, lz4 and Snappy are supported. • If shared storage (such as NAS, HDFS, or S3) is available, consider placing large files on the shared storage and using Kafka to send a message with the file location. In many cases, this can be much faster than using Kafka to send the large file itself. • Split large messages into 1 KB segments with the producing client, using partition keys to ensure that all segments are sent to the same Kafka partition in the correct order. The consuming client can then reconstruct the original large message. If you still need to send large messages with Kafka, modify the configuration parameters presented in the following sections to match your requirements. Table 7: Broker Configuration Properties Property Default Value Description message.max.bytes 1000000 Maximum message size the broker accepts. (1 MB) log.segment.bytes 1073741824 (1 GiB) replica.fetch.max.bytes 1048576 (1 MiB) Size of a Kafka data file. Must be larger than any single message. Maximum message size a broker can replicate. Must be larger than message.max.bytes, or a broker can accept messages it cannot replicate, potentially resulting in data loss. Table 8: Consumer Configuration Properties Property Default Value Description max.partition.fetch.bytes 1048576 The maximum amount of data per-partition the server will return. (10 MiB) fetch.max.bytes 52428800 (50 MiB) The maximum amount of data the server should return for a fetch request. Note: The consumer is able to consume a message batch that is larger than the default value of the max.partition.fetch.bytes or fetch.max.bytes property. However, the batch will be sent alone, which can cause performance degradation. Kafka Cluster Sizing Cluster Sizing - Network and Disk Message Throughput There are many variables that go into determining the correct hardware footprint for a Kafka cluster. The most accurate way to model your use case is to simulate the load you expect on your own hardware. You can do this using the load generation tools that ship with Kafka, kafka-producer-perf-test and kafka-consumer-perf-test. For more information, see Kafka Administration Using Command Line Tools on page 62. Apache Kafka Guide | 79 Kafka Performance Tuning However, if you want to size a cluster without simulation, a very simple rule could be to size the cluster based on the amount of disk-space required (which can be computed from the estimated rate at which you get data times the required data retention period). A slightly more sophisticated estimation can be done based on network and disk throughput requirements. To make this estimation, let's plan for a use case with the following characteristics: • W - MB/sec of data that will be written • R - Replication factor • C - Number of consumer groups, that is the number of readers for each write Kafka is mostly limited by the disk and network throughput. The volume of writing expected is W * R (that is, each replica writes each message). Data is read by replicas as part of the internal cluster replication and also by consumers. Because every replicas but the master read each write, the read volume of replication is (R-1) * W. In addition each of the C consumers reads each write, so there will be a read volume of C * W. This gives the following: • Writes: W * R • Reads: (R+C- 1) * W However, note that reads may actually be cached, in which case no actual disk I/O happens. We can model the effect of caching fairly easily. If the cluster has M MB of memory, then a write rate of W MB/second allows M/(W * R) seconds of writes to be cached. So a server with 32 GB of memory taking writes at 50 MB/second serves roughly the last 10 minutes of data from cache. Readers may fall out of cache for a variety of reasons—a slow consumer or a failed server that recovers and needs to catch up. An easy way to model this is to assume a number of lagging readers you to budget for. To model this, let’s call the number of lagging readers L. A very pessimistic assumption would be that L = R + C -1, that is that all consumers are lagging all the time. A more realistic assumption might be to assume no more than two consumers are lagging at any given time. Based on this, we can calculate our cluster-wide I/O requirements: • Disk Throughput (Read + Write): W * R + L * W • Network Read Throughput: (R + C -1) * W • Network Write Throughput: W * R A single server provides a given disk throughput as well as network throughput. For example, if you have a 1 Gigabit Ethernet card with full duplex, then that would give 125 MB/sec read and 125 MB/sec write; likewise 6 7200 SATA drives might give roughly 300 MB/sec read + write throughput. Once we know the total requirements, as well as what is provided by one machine, you can divide to get the total number of machines needed. This gives a machine count running at maximum capacity, assuming no overhead for network protocols, as well as perfect balance of data and load. Since there is protocol overhead as well as imbalance, you want to have at least 2x this ideal capacity to ensure sufficient capacity. Choosing the Number of Partitions for a Topic Choosing the proper number of partitions for a topic is the key to achieving a high degree of parallelism with respect to writes to and reads and to distribute load. Evenly distributed load over partitions is a key factor to have good throughput (avoid hot spots). Making a good decision requires estimation based on the desired throughput of producers and consumers per partition. 80 | Apache Kafka Guide Kafka Performance Tuning For example, if you want to be able to read 1 GB/sec, but your consumer is only able process 50 MB/sec, then you need at least 20 partitions and 20 consumers in the consumer group. Similarly, if you want to achieve the same for producers, and 1 producer can only write at 100 MB/sec, you need 10 partitions. In this case, if you have 20 partitions, you can maintain 1 GB/sec for producing and consuming messages. You should adjust the exact number of partitions to number of consumers or producers, so that each consumer and producer achieve their target throughput. So a simple formula could be: #Partitions = max(NP, NC) where: • • • • • NP is the number of required producers determined by calculating: TT/TP NC is the number of required consumers determined by calculating: TT/TC TT is the total expected throughput for our system TP is the max throughput of a single producer to a single partition TC is the max throughput of a single consumer from a single partition This calculation gives you a rough indication of the number of partitions. It's a good place to start. Keep in mind the following considerations for improving the number of partitions after you have your system in place: • The number of partitions can be specified at topic creation time or later. • Increasing the number of partitions also affects the number of open file descriptors. So make sure you set file descriptor limit properly. • Reassigning partitions can be very expensive, and therefore it's better to over- than under-provision. • Changing the number of partitions that are based on keys is challenging and involves manual copying (see Kafka Administration on page 59). • Reducing the number of partitions is not currently supported. Instead, create a new a topic with a lower number of partitions and copy over existing data. • Metadata about partitions are stored in ZooKeeper in the form of znodes. Having a large number of partitions has effects on ZooKeeper and on client resources: – Unneeded partitions put extra pressure on ZooKeeper (more network requests), and might introduce delay in controller and/or partition leader election if a broker goes down. – Producer and consumer clients need more memory, because they need to keep track of more partitions and also buffer data for all partitions. • As guideline for optimal performance, you should not have more than 4000 partitions per broker and not more than 200,000 partitions in a cluster. Make sure consumers don’t lag behind producers by monitoring consumer lag. To check consumers' position in a consumer group (that is, how far behind the end of the log they are), use the following command: $ kafka-consumer-groups --bootstrap-server BROKER_ADDRESS --describe --group CONSUMER_GROUP --new-consumer Apache Kafka Guide | 81 Kafka Performance Tuning Kafka Performance Broker Configuration JVM and Garbage Collection Garbage collection has a huge impact on performance of JVM based applications. It is recommended to use the Garbage-First (G1) garbage collector for Kafka broker. In Cloudera Manager specify the following under Additional Broker Java Options in the Kafka service configuration: -server -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:+DisableExplicitGC -Djava.awt.headless=true -Djava.net.preferIPv4Stack=true Cloudera recommends to set 4-8 GB of JVM heap size memory for the brokers depending on your use case. As Kafka’s performance depends heavily on the operating systems page cache, it is not recommended to collocate with other memory-hungry applications. • Large messages can cause longer garbage collection (GC) pauses as brokers allocate large chunks. Monitor the GC log and the server log. Add this to Broker Java Options: -XX:+PrintGC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xloggc:</path/to/file.txt> • If long GC pauses cause Kafka to abandon the ZooKeeper session, you may need to configure longer timeout values, see Kafka-ZooKeeper Performance Tuning on page 85 for details. Network and I/O Threads Kafka brokers use network threads to handle client requests. Incoming requests (such as produce and fetch requests) are placed into a requests queue from where I/O threads are taking them up and process them. After a request is processed, the response is placed into an internal response queue from where a network thread picks it up and sends response back to the client. • num.network.threads is an important cluster-wide setting that determines the number of threads used for handling network requests (that is, receiving requests and sending responses). Set this value mainly based on number of producers, consumers and replica fetchers. • queued.max.requests controls how many requests are allowed in the request queue before blocking network threads. • num.io.threads specifies the number of threads that a broker uses for processing requests from the request queue (might include disk I/O). ISR Management An in-sync replica (ISR) set for a topic partition contains all follower replicas that are caught-up with the leader partition, and are situated on a broker that is alive. • If a replica lags “too far” behind from the partition leader, it is removed from the ISR set. The definition of what is too far is controlled by the configuration setting replica.lag.time.max.ms. If a follower hasn't sent any fetch requests or hasn't consumed up to the leaders log end offset for at least this time, the leader removes the follower from the ISR set. • num.replica.fetchers is a cluster-wide configuration setting that controls how many fetcher threads are in a broker. These threads are responsible for replicating messages from a source broker (that is, where partition leader resides). Increasing this value results in higher I/O parallelism and fetcher throughput. Of course, there is a trade-off: brokers use more CPU and network. 82 | Apache Kafka Guide Kafka Performance Tuning • replica.fetch.min.bytes controls the minimum number of bytes to fetch from a follower replica. If there is not enough bytes, wait up to replica.fetch.wait.max.ms. • replica.fetch.wait.max.ms controls how long to sleep before checking for new messages from a fetcher replica. This value should be less than replica.lag.time.max.ms, otherwise the replica is kicked out of the ISR set. • To check the ISR set for topic partitions, run the following command: kafka-topics --zookeeper ${ZOOKEEPER_HOSTNAME}:2181/kafka --describe --topic ${TOPIC} • If a partition leader dies, a new leader is selected from the ISR set. There will be no data loss. If there is no ISR, unclean leader election can be used with the risk of data-loss. • Unclean leader election occurs if unclean.leader.election.enable is set to true. By default, this is set to false. Log Cleaner As discussed in Record Management on page 59, the log cleaner implements log compaction. The following cluster-wide configuration settings can be used to fine tune log compaction: • log.cleaner.threads controls how many background threads are responsible for log compaction. Increasing this value improves performance of log compaction at the cost of increased I/O activity. • log.cleaner.io.max.bytes.per.second throttles log cleaner’s I/O activity so that the sum of its read and write is less than this value on average. • log.cleaner.dedupe.buffer.size specifies memory used for log compaction across all cleaner threads. • log.cleaner.io.buffer.size controls total memory used for log cleaner I/O buffers across all cleaner threads. • log.cleaner.min.compaction.lag.ms controls how long messages are left uncompacted. • log.cleaner.io.buffer.load.factor controls log cleaner’s load factor for the dedupe buffer. Increasing this value allows the system to clean more logs at once but increases hash collisions. • log.cleaner.backoff.ms controls how long to wait until the next check if there is no log to compact. Kafka Performance: System-Level Broker Tuning Operating system related kernel parameters affect overall performance of Kafka. These parameters can be configured via sysctl at runtime. To make kernel configuration changes persistent (that is, use adjusted parameters after a reboot), edit /etc/sysctl.conf. The following sections describe some important kernel settings. File Descriptor Limits As Kafka works with many log segment files and network connections, the Maximum Process File Descriptors setting may need to be increased in some cases in production deployments, if a broker hosts many partitions. For example, a Kafka broker needs at least the following number of file descriptors to just track log segment files: (number of partitions)*(partition size / segment size) The broker needs additional file descriptors to communicate via network sockets with external parties (such as clients, other brokers, Zookeeper, Sentry, and Kerberos). The Maximum Process File Descriptors setting can be monitored in Cloudera Manager and increased if usage requires a larger value than the default ulimit (often 64K). It should be reviewed for use case suitability. • To review FD limit currently set for a running Kafka broker, run cat /proc/KAFKA_BROKER_PID/limits, and look for Max open files. • To see open file descriptors, run: lsof -p KAFKA_BROKER_PID Apache Kafka Guide | 83 Kafka Performance Tuning Filesystems Linux records when a file was created (ctime), modified (mtime) and accessed (atime). The value noatime is a special mount option for filesystems (such as EXT4) in Linux that tells the kernel not to update inode information every time a file is accessed (that is, when it was last read). Using this option may result in write performance gain. Kafka is not relying on atime. The value relatime is another mounting option that optimizes how atime is persisted. Access time is only updated if the previous atime was earlier than the current modified time. To view mounting options, run mount -l or cat /etc/fstab command. Virtual Memory Handling Kafka uses system page cache extensively for producing and consuming the messages. The Linux kernel parameter, vm.swappiness, is a value from 0-100 that controls the swapping of application data (as anonymous pages) from physical memory to virtual memory on disk. The higher the value, the more aggressively inactive processes are swapped out from physical memory. The lower the value, the less they are swapped, forcing filesystem buffers to be emptied. It is an important kernel parameter for Kafka because the more memory allocated to the swap space, the less memory can be allocated to the page cache. Cloudera recommends to set vm.swappiness value to 1. • To check memory swapped to disk, run vmstat and look for the swap columns. Kafka heavily relies on disk I/O performance. vm.dirty_ratio and vm.dirty_background_ratio are kernel parameters that control how often dirty pages are flushed to disk. Higher vm.dirty_ratio results in less frequent flushes to disk. • To display the actual number of dirty pages in the system, run egrep "dirty|writeback" /proc/vmstat Networking Parameters Kafka is designed to handle a huge amount of network traffic. By default, the Linux kernel is not tuned for this scenario. The following kernel settings may need to be tuned based on use case or specific Kafka workload: • • • • • • • • • net.core.wmem_default: Default send socket buffer size. net.core.rmem_default: Default receive socket buffer size. net.core.wmem_max: Maximum send socket buffer size. net.core.rmem_max: Maximum receive socket buffer size. net.ipv4.tcp_wmem: Memory reserved for TCP send buffers. net.ipv4.tcp_rmem: Memory reserved for TCP receive buffers. net.ipv4.tcp_window_scaling: TCP Window Scaling option. net.ipv4.tcp_max_syn_backlog: Maximum number of outstanding TCP SYN requests (connection requests). net.core.netdev_max_backlog: Maximum number of queued packets on the kernel input side (useful to deal with spike of network requests). To specify the parameters, you can use Cloudera Enterprise Reference Architecture as a guideline. Configuring JMX Ephemeral Ports Kafka uses two high-numbered ephemeral ports for JMX. These ports are listed when you view netstat -anp information for the Kafka broker process. • You can change the number for the first port by adding a command similar to the following to the field Additional Broker Java Options (broker_java_opts) in Cloudera Manager. -Dcom.sun.management.jmxremote.rmi.port=port • The JMX_PORT configuration maps to com.sun.management.jmxremote.port by default. To access JMX via JConsole, run jconsole ${BROKER_HOST}:9393 • The second ephemeral port used for JMX communication is implemented for the JRMP protocol and cannot be changed. 84 | Apache Kafka Guide Kafka Performance Tuning Kafka-ZooKeeper Performance Tuning Kafka uses Zookeeper to store metadata information about topics, partitions, brokers and system coordination (such as membership statuses). Unavailability or slowness of Zookeeper makes the Kafka cluster unstable, and Kafka brokers do not automatically recover from it. Cloudera recommends to use a 3-5 machines Zookeeper ensemble solely dedicated to Kafka (co-location of applications can cause unwanted service disturbances). • zookeeper.session.timeout.ms is a setting for Kafka that specifies how long Zookeeper shall wait for heartbeat messages before it considers the client (the Kafka broker) unavailable. If this happens, metadata information about partition leadership owned by the broker will be reassigned. If this setting is too high, then it might take a long time for the system to detect a broker failure. On the other hand, if it is set to too small, it might result in frequent leadership reassignments. • jute.maxbuffer is a crucial Java system property for both Kafka and Zookeeper. It controls the maximum size of the data a znode can contain. The default value, one megabyte, might be increased for certain production use cases. • There are cases where Zookeeper can require more connections. In those cases, it is recommended to increase the maxClientCnxns parameter in Zookeeper. • Note that old Kafka consumers store consumer offset commits in Zookeeper (deprecated). It is recommended to use new consumers that store offsets in internal Kafka topics (reduces load on Zookeeper). Apache Kafka Guide | 85 Kafka Reference Kafka Reference Metrics Reference In addition to these metrics, many aggregate metrics are available. If an entity type has parents defined, you can formulate all possible aggregate metrics using the formula base_metric_across_parents. In addition, metrics for aggregate totals can be formed by adding the prefix total_ to the front of the metric name. Use the type-ahead feature in the Cloudera Manager chart browser to find the exact aggregate metric name, in case the plural form does not end in "s". For example, the following metric names may be valid for Kafka: • alerts_rate_across_clusters • total_alerts_rate_across_clusters Some metrics, such as alerts_rate, apply to nearly every metric context. Others only apply to a certain service or role. For more information about metrics the Cloudera Manager, see Cloudera Manager Metrics and Metric Aggregation. Note: The following sections are identical to the metrics listed with Cloudera Manager. Be sure to scroll horizontally to see the full content in each table. Base Metrics Metric Name Description Unit Parents CDH Version alerts_rate The number of alerts. events per second cluster CDH 5, CDH 6 events_critical_rate The number of critical events. events per second cluster CDH 5, CDH 6 events_important_rate The number of important events. events per second cluster CDH 5, CDH 6 events_informational_ The number of informational events. rate events per second cluster CDH 5, CDH 6 seconds per second cluster CDH 5, CDH 6 health_concerning_rate Percentage of Time with Concerning Health seconds per second cluster CDH 5, CDH 6 health_disabled_rate Percentage of Time with Disabled Health seconds per second cluster CDH 5, CDH 6 health_good_rate Percentage of Time with Good Health seconds per second cluster CDH 5, CDH 6 health_unknown_rate Percentage of Time with Unknown Health seconds per second cluster CDH 5, CDH 6 health_bad_rate 86 | Apache Kafka Guide Percentage of Time with Bad Health Kafka Reference Broker Metrics Metric Name Description Unit Parents CDH Version alerts_rate The number of alerts. events per second cluster, kafka, rack CDH 5, CDH 6 seconds per second cluster, kafka, rack CDH 5, CDH 6 seconds per second cluster, kafka, rack CDH 5, CDH 6 cgroup_mem_page_cache Page cache usage of the role's cgroup bytes cluster, kafka, rack CDH 5, CDH 6 cgroup_mem_rss Resident memory of the role's cgroup bytes cluster, kafka, rack CDH 5, CDH 6 cgroup_mem_swap Swap usage of the role's cgroup bytes cluster, kafka, rack CDH 5, CDH 6 cluster, kafka, rack CDH 5, CDH 6 cgroup_read_ios_rate Number of read I/O operations ios per second cluster, kafka, from all disks by the role's rack cgroup CDH 5, CDH 6 cgroup_write_bytes_ rate Bytes written to all disks by the bytes per role's cgroup second cluster, kafka, rack CDH 5, CDH 6 cgroup_write_ios_rate Number of write I/O operations ios per second cluster, kafka, to all disks by the role's cgroup rack CDH 5, CDH 6 cgroup_cpu_system_rate CPU usage of the role's cgroup cgroup_cpu_user_rate User Space CPU usage of the role's cgroup cgroup_read_bytes_rate Bytes read from all disks by the bytes per role's cgroup second cpu_system_rate Total System CPU seconds per second cluster, kafka, rack CDH 5, CDH 6 cpu_user_rate Total CPU user time seconds per second cluster, kafka, rack CDH 5, CDH 6 events_critical_rate The number of critical events. events per second cluster, kafka, rack CDH 5, CDH 6 events_important_rate The number of important events. events per second cluster, kafka, rack CDH 5, CDH 6 events_informational_ The number of informational events. rate cluster, kafka, rack CDH 5, CDH 6 events per second fd_max Maximum number of file descriptors file descriptors cluster, kafka, rack CDH 5, CDH 6 fd_open Open file descriptors. file descriptors cluster, kafka, rack CDH 5, CDH 6 health_bad_rate Percentage of Time with Bad Health seconds per second cluster, kafka, rack CDH 5, CDH 6 health_concerning_rate Percentage of Time with Concerning Health seconds per second cluster, kafka, rack CDH 5, CDH 6 Percentage of Time with Disabled Health seconds per second cluster, kafka, rack CDH 5, CDH 6 health_disabled_rate Apache Kafka Guide | 87 Kafka Reference Metric Name Description Unit Parents CDH Version health_good_rate Percentage of Time with Good Health seconds per second cluster, kafka, rack CDH 5, CDH 6 health_unknown_rate Percentage of Time with Unknown Health seconds per second cluster, kafka, rack CDH 5, CDH 6 kafka_active_ controller Will be 1 if this broker is the active controller, 0 otherwise message.units. cluster, kafka, controller rack CDH 5, CDH 6 kafka_broker_state The state the broker is in. message.units. cluster, kafka, state rack CDH 5, CDH 6 • 0 = NotRunning • 1 = Starting • 2 = RecoveringFrom UncleanShutdown • 3 = RunningAsBroker • 4= RunningAsController • 6= PendingControlled Shutdown • 7= BrokerShuttingDown kafka_bytes_fetched_ 15min_rate Amount of data consumers fetched from this topic on this broker: 15 Min Rate bytes per cluster, kafka, message.units. rack singular.second CDH 5, CDH 6 kafka_bytes_fetched_ 1min_rate Amount of data consumers fetched from this topic on this broker: 1 Min Rate bytes per cluster, kafka, message.units. rack singular.second CDH 5, CDH 6 kafka_bytes_fetched_ 5min_rate Amount of data consumers fetched from this topic on this broker: 5 Min Rate bytes per cluster, kafka, message.units. rack singular.second CDH 5, CDH 6 kafka_bytes_fetched_ avg_rate Amount of data consumers fetched from this topic on this broker: Avg Rate bytes per cluster, kafka, message.units. rack singular.second CDH 5, CDH 6 kafka_bytes_fetched_ rate Amount of data consumers fetched from this topic on this broker bytes per second cluster, kafka, rack CDH 5, CDH 6 kafka_bytes_received_ Amount of data written to topic bytes per cluster, kafka, on this broker: 15 Min Rate message.units. rack 15min_rate singular.second CDH 5, CDH 6 kafka_bytes_received_ Amount of data written to topic bytes per cluster, kafka, on this broker: 1 Min Rate message.units. rack 1min_rate singular.second CDH 5, CDH 6 kafka_bytes_received_ Amount of data written to topic bytes per cluster, kafka, on this broker: 5 Min Rate message.units. rack 5min_rate singular.second CDH 5, CDH 6 kafka_bytes_received_ Amount of data written to topic bytes per cluster, kafka, on this broker: Avg Rate message.units. rack avg_rate singular.second CDH 5, CDH 6 88 | Apache Kafka Guide Kafka Reference Metric Name Description Unit Parents CDH Version cluster, kafka, rack CDH 5, CDH 6 kafka_bytes_rejected_ Amount of data in messages bytes per cluster, kafka, rejected by broker for this topic: message.units. rack 15min_rate 15 Min Rate singular.second CDH 5, CDH 6 kafka_bytes_rejected_ Amount of data in messages bytes per cluster, kafka, rejected by broker for this topic: message.units. rack 1min_rate 1 Min Rate singular.second CDH 5, CDH 6 kafka_bytes_rejected_ Amount of data in messages bytes per cluster, kafka, rejected by broker for this topic: message.units. rack 5min_rate 5 Min Rate singular.second CDH 5, CDH 6 kafka_bytes_rejected_ Amount of data in messages bytes per cluster, kafka, rejected by broker for this topic: message.units. rack avg_rate Avg Rate singular.second CDH 5, CDH 6 kafka_bytes_rejected_ Amount of data in messages bytes per rejected by broker for this topic second rate cluster, kafka, rack CDH 5, CDH 6 kafka_consumer_ expires_15min_rate Number of expired delayed requests per cluster, kafka, consumer fetch requests: 15 Min message.units. rack Rate singular.second CDH 5, CDH 6 kafka_consumer_ expires_1min_rate Number of expired delayed requests per cluster, kafka, consumer fetch requests: 1 Min message.units. rack Rate singular.second CDH 5, CDH 6 kafka_consumer_ expires_5min_rate Number of expired delayed requests per cluster, kafka, consumer fetch requests: 5 Min message.units. rack Rate singular.second CDH 5, CDH 6 kafka_consumer_ expires_avg_rate Number of expired delayed consumer fetch requests: Avg Rate requests per cluster, kafka, message.units. rack singular.second CDH 5, CDH 6 kafka_consumer_ expires_rate Number of expired delayed consumer fetch requests requests per second cluster, kafka, rack CDH 5, CDH 6 kafka_consumer_ metadata_local_time_ 75th_percentile Local Time spent in responding ms to ConsumerMetadata requests: 75th Percentile cluster, kafka, rack CDH 5, CDH 6 kafka_consumer_ metadata_local_time_ 999th_percentile Local Time spent in responding ms to ConsumerMetadata requests: 999th Percentile cluster, kafka, rack CDH 5, CDH 6 kafka_consumer_ metadata_local_time_ 99th_percentile Local Time spent in responding ms to ConsumerMetadata requests: 99th Percentile cluster, kafka, rack CDH 5, CDH 6 kafka_consumer_ metadata_local_time_ avg Local Time spent in responding ms to ConsumerMetadata requests: Avg cluster, kafka, rack CDH 5, CDH 6 kafka_consumer_ metadata_local_time_ max Local Time spent in responding ms to ConsumerMetadata requests: Max cluster, kafka, rack CDH 5, CDH 6 kafka_bytes_received_ Amount of data written to topic bytes per on this broker second rate Apache Kafka Guide | 89 Kafka Reference Metric Name Description kafka_consumer_ metadata_local_time_ median Unit Parents CDH Version Local Time spent in responding ms to ConsumerMetadata requests: 50th Percentile cluster, kafka, rack CDH 5, CDH 6 kafka_consumer_ metadata_local_time_ min Local Time spent in responding ms to ConsumerMetadata requests: Min cluster, kafka, rack CDH 5, CDH 6 kafka_consumer_ metadata_local_time_ rate Local Time spent in responding requests per to ConsumerMetadata requests: second Samples cluster, kafka, rack CDH 5, CDH 6 kafka_consumer_ metadata_local_time_ stddev Local Time spent in responding ms to ConsumerMetadata requests: Standard Deviation cluster, kafka, rack CDH 5, CDH 6 kafka_consumer_ Remote Time spent in metadata_remote_time_ responding to ConsumerMetadata requests: 75th_percentile 75th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_consumer_ Remote Time spent in metadata_remote_time_ responding to ConsumerMetadata requests: 999th_percentile 999th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_consumer_ Remote Time spent in metadata_remote_time_ responding to ConsumerMetadata requests: 99th_percentile 99th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_consumer_ Remote Time spent in metadata_remote_time_ responding to ConsumerMetadata requests: avg Avg ms cluster, kafka, rack CDH 5, CDH 6 kafka_consumer_ Remote Time spent in metadata_remote_time_ responding to ConsumerMetadata requests: max Max ms cluster, kafka, rack CDH 5, CDH 6 kafka_consumer_ Remote Time spent in metadata_remote_time_ responding to ConsumerMetadata requests: median 50th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_consumer_ Remote Time spent in metadata_remote_time_ responding to ConsumerMetadata requests: min Min ms cluster, kafka, rack CDH 5, CDH 6 kafka_consumer_ Remote Time spent in metadata_remote_time_ responding to ConsumerMetadata requests: rate Samples requests per second cluster, kafka, rack CDH 5, CDH 6 kafka_consumer_ Remote Time spent in metadata_remote_time_ responding to ConsumerMetadata requests: stddev Standard Deviation ms cluster, kafka, rack CDH 5, CDH 6 90 | Apache Kafka Guide Kafka Reference Metric Name Description Unit Parents CDH Version kafka_consumer_ metadata_request_ queue_time_75th_ percentile Request Queue Time spent in responding to ConsumerMetadata requests: 75th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_consumer_ metadata_request_ queue_time_999th_ percentile Request Queue Time spent in responding to ConsumerMetadata requests: 999th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_consumer_ metadata_request_ queue_time_99th_ percentile Request Queue Time spent in responding to ConsumerMetadata requests: 99th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_consumer_ metadata_request_ queue_time_avg Request Queue Time spent in responding to ConsumerMetadata requests: Avg ms cluster, kafka, rack CDH 5, CDH 6 kafka_consumer_ metadata_request_ queue_time_max Request Queue Time spent in responding to ConsumerMetadata requests: Max ms cluster, kafka, rack CDH 5, CDH 6 kafka_consumer_ metadata_request_ queue_time_median Request Queue Time spent in responding to ConsumerMetadata requests: 50th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_consumer_ metadata_request_ queue_time_min Request Queue Time spent in responding to ConsumerMetadata requests: Min ms cluster, kafka, rack CDH 5, CDH 6 kafka_consumer_ metadata_request_ queue_time_rate Request Queue Time spent in responding to ConsumerMetadata requests: Samples requests per second cluster, kafka, rack CDH 5, CDH 6 kafka_consumer_ metadata_request_ queue_time_stddev Request Queue Time spent in responding to ConsumerMetadata requests: Standard Deviation ms cluster, kafka, rack CDH 5, CDH 6 kafka_consumer_ metadata_requests_ 15min_rate Number of ConsumerMetadata requests per cluster, kafka, requests: 15 Min Rate message.units. rack singular.second CDH 5, CDH 6 kafka_consumer_ metadata_requests_ 1min_rate Number of ConsumerMetadata requests per cluster, kafka, requests: 1 Min Rate message.units. rack singular.second CDH 5, CDH 6 kafka_consumer_ metadata_requests_ 5min_rate Number of ConsumerMetadata requests per cluster, kafka, requests: 5 Min Rate message.units. rack singular.second CDH 5, CDH 6 Apache Kafka Guide | 91 Kafka Reference Metric Name Description Unit Parents CDH Version kafka_consumer_ Number of ConsumerMetadata requests per cluster, kafka, message.units. rack metadata_requests_avg_ requests: Avg Rate singular.second rate CDH 5, CDH 6 kafka_consumer_ Number of ConsumerMetadata requests per second metadata_requests_rate requests cluster, kafka, rack CDH 5, CDH 6 kafka_consumer_ metadata_response_ queue_time_75th_ percentile Response Queue Time spent in responding to ConsumerMetadata requests: 75th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_consumer_ metadata_response_ queue_time_999th_ percentile Response Queue Time spent in responding to ConsumerMetadata requests: 999th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_consumer_ metadata_response_ queue_time_99th_ percentile Response Queue Time spent in responding to ConsumerMetadata requests: 99th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_consumer_ metadata_response_ queue_time_avg Response Queue Time spent in responding to ConsumerMetadata requests: Avg ms cluster, kafka, rack CDH 5, CDH 6 kafka_consumer_ metadata_response_ queue_time_max Response Queue Time spent in responding to ConsumerMetadata requests: Max ms cluster, kafka, rack CDH 5, CDH 6 kafka_consumer_ metadata_response_ queue_time_median Response Queue Time spent in responding to ConsumerMetadata requests: 50th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_consumer_ metadata_response_ queue_time_min Response Queue Time spent in responding to ConsumerMetadata requests: Min ms cluster, kafka, rack CDH 5, CDH 6 kafka_consumer_ metadata_response_ queue_time_rate Response Queue Time spent in responding to ConsumerMetadata requests: Samples requests per second cluster, kafka, rack CDH 5, CDH 6 kafka_consumer_ metadata_response_ queue_time_stddev Response Queue Time spent in responding to ConsumerMetadata requests: Standard Deviation ms cluster, kafka, rack CDH 5, CDH 6 kafka_consumer_ metadata_response_ send_time_75th_ percentile Response Send Time spent in responding to ConsumerMetadata requests: 75th Percentile ms cluster, kafka, rack CDH 5, CDH 6 92 | Apache Kafka Guide Kafka Reference Metric Name Description Unit Parents CDH Version kafka_consumer_ metadata_response_ send_time_999th_ percentile Response Send Time spent in responding to ConsumerMetadata requests: 999th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_consumer_ metadata_response_ send_time_99th_ percentile Response Send Time spent in responding to ConsumerMetadata requests: 99th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_consumer_ metadata_response_ send_time_avg Response Send Time spent in responding to ConsumerMetadata requests: Avg ms cluster, kafka, rack CDH 5, CDH 6 kafka_consumer_ metadata_response_ send_time_max Response Send Time spent in responding to ConsumerMetadata requests: Max ms cluster, kafka, rack CDH 5, CDH 6 kafka_consumer_ metadata_response_ send_time_median Response Send Time spent in responding to ConsumerMetadata requests: 50th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_consumer_ metadata_response_ send_time_min Response Send Time spent in responding to ConsumerMetadata requests: Min ms cluster, kafka, rack CDH 5, CDH 6 kafka_consumer_ metadata_response_ send_time_rate Response Send Time spent in responding to ConsumerMetadata requests: Samples requests per second cluster, kafka, rack CDH 5, CDH 6 kafka_consumer_ metadata_response_ send_time_stddev Response Send Time spent in responding to ConsumerMetadata requests: Standard Deviation ms cluster, kafka, rack CDH 5, CDH 6 kafka_consumer_ metadata_total_time_ 75th_percentile Total Time spent in responding ms to ConsumerMetadata requests: 75th Percentile cluster, kafka, rack CDH 5, CDH 6 kafka_consumer_ metadata_total_time_ 999th_percentile Total Time spent in responding ms to ConsumerMetadata requests: 999th Percentile cluster, kafka, rack CDH 5, CDH 6 kafka_consumer_ metadata_total_time_ 99th_percentile Total Time spent in responding ms to ConsumerMetadata requests: 99th Percentile cluster, kafka, rack CDH 5, CDH 6 kafka_consumer_ metadata_total_time_ avg Total Time spent in responding ms to ConsumerMetadata requests: Avg cluster, kafka, rack CDH 5, CDH 6 kafka_consumer_ metadata_total_time_ max Total Time spent in responding ms to ConsumerMetadata requests: Max cluster, kafka, rack CDH 5, CDH 6 Apache Kafka Guide | 93 Kafka Reference Metric Name Description kafka_consumer_ metadata_total_time_ median Parents CDH Version Total Time spent in responding ms to ConsumerMetadata requests: 50th Percentile cluster, kafka, rack CDH 5, CDH 6 kafka_consumer_ metadata_total_time_ min Total Time spent in responding ms to ConsumerMetadata requests: Min cluster, kafka, rack CDH 5, CDH 6 kafka_consumer_ metadata_total_time_ rate Total Time spent in responding requests per to ConsumerMetadata requests: second Samples cluster, kafka, rack CDH 5, CDH 6 kafka_consumer_ metadata_total_time_ stddev Total Time spent in responding ms to ConsumerMetadata requests: Standard Deviation cluster, kafka, rack CDH 5, CDH 6 kafka_controlled_ shutdown_local_time_ 75th_percentile Local Time spent in responding ms to ControlledShutdown requests: 75th Percentile cluster, kafka, rack CDH 5, CDH 6 kafka_controlled_ shutdown_local_time_ 999th_percentile Local Time spent in responding ms to ControlledShutdown requests: 999th Percentile cluster, kafka, rack CDH 5, CDH 6 kafka_controlled_ shutdown_local_time_ 99th_percentile Local Time spent in responding ms to ControlledShutdown requests: 99th Percentile cluster, kafka, rack CDH 5, CDH 6 kafka_controlled_ shutdown_local_time_ avg Local Time spent in responding ms to ControlledShutdown requests: Avg cluster, kafka, rack CDH 5, CDH 6 kafka_controlled_ shutdown_local_time_ max Local Time spent in responding ms to ControlledShutdown requests: Max cluster, kafka, rack CDH 5, CDH 6 kafka_controlled_ shutdown_local_time_ median Local Time spent in responding ms to ControlledShutdown requests: 50th Percentile cluster, kafka, rack CDH 5, CDH 6 kafka_controlled_ shutdown_local_time_ min Local Time spent in responding ms to ControlledShutdown requests: Min cluster, kafka, rack CDH 5, CDH 6 kafka_controlled_ shutdown_local_time_ rate Local Time spent in responding requests per to ControlledShutdown requests: second Samples cluster, kafka, rack CDH 5, CDH 6 kafka_controlled_ shutdown_local_time_ stddev Local Time spent in responding ms to ControlledShutdown requests: Standard Deviation cluster, kafka, rack CDH 5, CDH 6 cluster, kafka, rack CDH 5, CDH 6 kafka_controlled_ Remote Time spent in shutdown_remote_time_ responding to ControlledShutdown requests: 75th_percentile 75th Percentile 94 | Apache Kafka Guide Unit ms Kafka Reference Metric Name Description Unit Parents CDH Version kafka_controlled_ Remote Time spent in shutdown_remote_time_ responding to ControlledShutdown requests: 999th_percentile 999th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_controlled_ Remote Time spent in shutdown_remote_time_ responding to ControlledShutdown requests: 99th_percentile 99th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_controlled_ Remote Time spent in shutdown_remote_time_ responding to ControlledShutdown requests: avg Avg ms cluster, kafka, rack CDH 5, CDH 6 kafka_controlled_ Remote Time spent in shutdown_remote_time_ responding to ControlledShutdown requests: max Max ms cluster, kafka, rack CDH 5, CDH 6 kafka_controlled_ Remote Time spent in shutdown_remote_time_ responding to ControlledShutdown requests: median 50th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_controlled_ Remote Time spent in shutdown_remote_time_ responding to ControlledShutdown requests: min Min ms cluster, kafka, rack CDH 5, CDH 6 kafka_controlled_ Remote Time spent in shutdown_remote_time_ responding to ControlledShutdown requests: rate Samples requests per second cluster, kafka, rack CDH 5, CDH 6 kafka_controlled_ Remote Time spent in shutdown_remote_time_ responding to ControlledShutdown requests: stddev Standard Deviation ms cluster, kafka, rack CDH 5, CDH 6 kafka_controlled_ shutdown_request_ queue_time_75th_ percentile Request Queue Time spent in responding to ControlledShutdown requests: 75th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_controlled_ shutdown_request_ queue_time_999th_ percentile Request Queue Time spent in responding to ControlledShutdown requests: 999th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_controlled_ shutdown_request_ queue_time_99th_ percentile Request Queue Time spent in responding to ControlledShutdown requests: 99th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_controlled_ shutdown_request_ queue_time_avg Request Queue Time spent in responding to ControlledShutdown requests: Avg ms cluster, kafka, rack CDH 5, CDH 6 Apache Kafka Guide | 95 Kafka Reference Metric Name Description Unit Parents CDH Version kafka_controlled_ shutdown_request_ queue_time_max Request Queue Time spent in responding to ControlledShutdown requests: Max ms cluster, kafka, rack CDH 5, CDH 6 kafka_controlled_ shutdown_request_ queue_time_median Request Queue Time spent in responding to ControlledShutdown requests: 50th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_controlled_ shutdown_request_ queue_time_min Request Queue Time spent in responding to ControlledShutdown requests: Min ms cluster, kafka, rack CDH 5, CDH 6 kafka_controlled_ shutdown_request_ queue_time_rate Request Queue Time spent in responding to ControlledShutdown requests: Samples requests per second cluster, kafka, rack CDH 5, CDH 6 kafka_controlled_ shutdown_request_ queue_time_stddev Request Queue Time spent in responding to ControlledShutdown requests: Standard Deviation ms cluster, kafka, rack CDH 5, CDH 6 kafka_controlled_ shutdown_requests_ 15min_rate Number of ControlledShutdown requests per cluster, kafka, requests: 15 Min Rate message.units. rack singular.second CDH 5, CDH 6 kafka_controlled_ shutdown_requests_ 1min_rate Number of ControlledShutdown requests per cluster, kafka, requests: 1 Min Rate message.units. rack singular.second CDH 5, CDH 6 kafka_controlled_ shutdown_requests_ 5min_rate Number of ControlledShutdown requests per cluster, kafka, requests: 5 Min Rate message.units. rack singular.second CDH 5, CDH 6 kafka_controlled_ Number of ControlledShutdown requests per cluster, kafka, message.units. rack shutdown_requests_avg_ requests: Avg Rate singular.second rate CDH 5, CDH 6 kafka_controlled_ Number of ControlledShutdown requests per second shutdown_requests_rate requests cluster, kafka, rack CDH 5, CDH 6 kafka_controlled_ shutdown_response_ queue_time_75th_ percentile Response Queue Time spent in responding to ControlledShutdown requests: 75th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_controlled_ shutdown_response_ queue_time_999th_ percentile Response Queue Time spent in responding to ControlledShutdown requests: 999th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_controlled_ shutdown_response_ queue_time_99th_ percentile Response Queue Time spent in responding to ControlledShutdown requests: 99th Percentile ms cluster, kafka, rack CDH 5, CDH 6 96 | Apache Kafka Guide Kafka Reference Metric Name Description Unit Parents CDH Version kafka_controlled_ shutdown_response_ queue_time_avg Response Queue Time spent in responding to ControlledShutdown requests: Avg ms cluster, kafka, rack CDH 5, CDH 6 kafka_controlled_ shutdown_response_ queue_time_max Response Queue Time spent in responding to ControlledShutdown requests: Max ms cluster, kafka, rack CDH 5, CDH 6 kafka_controlled_ shutdown_response_ queue_time_median Response Queue Time spent in responding to ControlledShutdown requests: 50th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_controlled_ shutdown_response_ queue_time_min Response Queue Time spent in responding to ControlledShutdown requests: Min ms cluster, kafka, rack CDH 5, CDH 6 kafka_controlled_ shutdown_response_ queue_time_rate Response Queue Time spent in responding to ControlledShutdown requests: Samples requests per second cluster, kafka, rack CDH 5, CDH 6 kafka_controlled_ shutdown_response_ queue_time_stddev Response Queue Time spent in responding to ControlledShutdown requests: Standard Deviation ms cluster, kafka, rack CDH 5, CDH 6 kafka_controlled_ shutdown_response_ send_time_75th_ percentile Response Send Time spent in responding to ControlledShutdown requests: 75th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_controlled_ shutdown_response_ send_time_999th_ percentile Response Send Time spent in responding to ControlledShutdown requests: 999th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_controlled_ shutdown_response_ send_time_99th_ percentile Response Send Time spent in responding to ControlledShutdown requests: 99th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_controlled_ shutdown_response_ send_time_avg Response Send Time spent in responding to ControlledShutdown requests: Avg ms cluster, kafka, rack CDH 5, CDH 6 kafka_controlled_ shutdown_response_ send_time_max Response Send Time spent in responding to ControlledShutdown requests: Max ms cluster, kafka, rack CDH 5, CDH 6 kafka_controlled_ shutdown_response_ send_time_median Response Send Time spent in responding to ControlledShutdown requests: 50th Percentile ms cluster, kafka, rack CDH 5, CDH 6 Apache Kafka Guide | 97 Kafka Reference Metric Name Description Unit Parents CDH Version kafka_controlled_ shutdown_response_ send_time_min Response Send Time spent in responding to ControlledShutdown requests: Min ms cluster, kafka, rack CDH 5, CDH 6 kafka_controlled_ shutdown_response_ send_time_rate Response Send Time spent in responding to ControlledShutdown requests: Samples requests per second cluster, kafka, rack CDH 5, CDH 6 kafka_controlled_ shutdown_response_ send_time_stddev Response Send Time spent in responding to ControlledShutdown requests: Standard Deviation ms cluster, kafka, rack CDH 5, CDH 6 kafka_controlled_ shutdown_total_time_ 75th_percentile Total Time spent in responding ms to ControlledShutdown requests: 75th Percentile cluster, kafka, rack CDH 5, CDH 6 kafka_controlled_ shutdown_total_time_ 999th_percentile Total Time spent in responding ms to ControlledShutdown requests: 999th Percentile cluster, kafka, rack CDH 5, CDH 6 kafka_controlled_ shutdown_total_time_ 99th_percentile Total Time spent in responding ms to ControlledShutdown requests: 99th Percentile cluster, kafka, rack CDH 5, CDH 6 kafka_controlled_ shutdown_total_time_ avg Total Time spent in responding ms to ControlledShutdown requests: Avg cluster, kafka, rack CDH 5, CDH 6 kafka_controlled_ shutdown_total_time_ max Total Time spent in responding ms to ControlledShutdown requests: Max cluster, kafka, rack CDH 5, CDH 6 kafka_controlled_ shutdown_total_time_ median Total Time spent in responding ms to ControlledShutdown requests: 50th Percentile cluster, kafka, rack CDH 5, CDH 6 kafka_controlled_ shutdown_total_time_ min Total Time spent in responding ms to ControlledShutdown requests: Min cluster, kafka, rack CDH 5, CDH 6 kafka_controlled_ shutdown_total_time_ rate Total Time spent in responding requests per to ControlledShutdown requests: second Samples cluster, kafka, rack CDH 5, CDH 6 kafka_controlled_ shutdown_total_time_ stddev Total Time spent in responding ms to ControlledShutdown requests: Standard Deviation cluster, kafka, rack CDH 5, CDH 6 kafka_daemon_thread_ count JVM daemon thread count cluster, kafka, rack CDH 5, CDH 6 cluster, kafka, rack CDH 5, CDH 6 threads kafka_fetch_consumer_ Local Time spent in responding ms to FetchConsumer requests: local_time_75th_ 75th Percentile percentile 98 | Apache Kafka Guide Kafka Reference Metric Name Description Unit Parents CDH Version kafka_fetch_consumer_ Local Time spent in responding ms to FetchConsumer requests: local_time_999th_ 999th Percentile percentile cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_consumer_ Local Time spent in responding ms to FetchConsumer requests: local_time_99th_ 99th Percentile percentile cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_consumer_ Local Time spent in responding ms to FetchConsumer requests: Avg local_time_avg cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_consumer_ Local Time spent in responding ms to FetchConsumer requests: Max local_time_max cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_consumer_ Local Time spent in responding ms to FetchConsumer requests: local_time_median 50th Percentile cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_consumer_ Local Time spent in responding ms to FetchConsumer requests: Min local_time_min cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_consumer_ Local Time spent in responding requests per to FetchConsumer requests: second local_time_rate Samples cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_consumer_ Local Time spent in responding ms to FetchConsumer requests: local_time_stddev Standard Deviation cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_consumer_ Remote Time spent in responding to FetchConsumer remote_time_75th_ requests: 75th Percentile percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_consumer_ Remote Time spent in responding to FetchConsumer remote_time_999th_ requests: 999th Percentile percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_consumer_ Remote Time spent in responding to FetchConsumer remote_time_99th_ requests: 99th Percentile percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_consumer_ Remote Time spent in responding to FetchConsumer remote_time_avg requests: Avg ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_consumer_ Remote Time spent in responding to FetchConsumer remote_time_max requests: Max ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_consumer_ Remote Time spent in responding to FetchConsumer remote_time_median requests: 50th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_consumer_ Remote Time spent in responding to FetchConsumer remote_time_min requests: Min ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_consumer_ Remote Time spent in responding to FetchConsumer remote_time_rate requests: Samples requests per second cluster, kafka, rack CDH 5, CDH 6 Apache Kafka Guide | 99 Kafka Reference Metric Name Unit Parents CDH Version kafka_fetch_consumer_ Remote Time spent in responding to FetchConsumer remote_time_stddev requests: Standard Deviation ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_consumer_ Request Queue Time spent in responding to FetchConsumer request_queue_time_ requests: 75th Percentile 75th_percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_consumer_ Request Queue Time spent in responding to FetchConsumer request_queue_time_ requests: 999th Percentile 999th_percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_consumer_ Request Queue Time spent in responding to FetchConsumer request_queue_time_ requests: 99th Percentile 99th_percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_consumer_ Request Queue Time spent in request_queue_time_avg responding to FetchConsumer requests: Avg ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_consumer_ Request Queue Time spent in request_queue_time_max responding to FetchConsumer requests: Max ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_consumer_ Request Queue Time spent in responding to FetchConsumer request_queue_time_ requests: 50th Percentile median ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_consumer_ Request Queue Time spent in request_queue_time_min responding to FetchConsumer requests: Min ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_consumer_ Request Queue Time spent in responding to FetchConsumer request_queue_time_ requests: Samples rate requests per second cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_consumer_ Request Queue Time spent in responding to FetchConsumer request_queue_time_ requests: Standard Deviation stddev ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_consumer_ Number of FetchConsumer requests: 15 Min Rate requests_15min_rate requests per cluster, kafka, message.units. rack singular.second CDH 5, CDH 6 kafka_fetch_consumer_ Number of FetchConsumer requests: 1 Min Rate requests_1min_rate requests per cluster, kafka, message.units. rack singular.second CDH 5, CDH 6 kafka_fetch_consumer_ Number of FetchConsumer requests: 5 Min Rate requests_5min_rate requests per cluster, kafka, message.units. rack singular.second CDH 5, CDH 6 kafka_fetch_consumer_ Number of FetchConsumer requests: Avg Rate requests_avg_rate requests per cluster, kafka, message.units. rack singular.second CDH 5, CDH 6 kafka_fetch_consumer_ Number of FetchConsumer requests requests_rate requests per second CDH 5, CDH 6 100 | Apache Kafka Guide Description cluster, kafka, rack Kafka Reference Metric Name Description Unit Parents CDH Version kafka_fetch_consumer_ Response Queue Time spent in responding to FetchConsumer response_queue_time_ requests: 75th Percentile 75th_percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_consumer_ Response Queue Time spent in responding to FetchConsumer response_queue_time_ requests: 999th Percentile 999th_percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_consumer_ Response Queue Time spent in responding to FetchConsumer response_queue_time_ requests: 99th Percentile 99th_percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_consumer_ Response Queue Time spent in responding to FetchConsumer response_queue_time_ requests: Avg avg ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_consumer_ Response Queue Time spent in responding to FetchConsumer response_queue_time_ requests: Max max ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_consumer_ Response Queue Time spent in responding to FetchConsumer response_queue_time_ requests: 50th Percentile median ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_consumer_ Response Queue Time spent in responding to FetchConsumer response_queue_time_ requests: Min min ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_consumer_ Response Queue Time spent in responding to FetchConsumer response_queue_time_ requests: Samples rate requests per second cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_consumer_ Response Queue Time spent in responding to FetchConsumer response_queue_time_ requests: Standard Deviation stddev ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_consumer_ Response Send Time spent in responding to FetchConsumer response_send_time_ requests: 75th Percentile 75th_percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_consumer_ Response Send Time spent in responding to FetchConsumer response_send_time_ requests: 999th Percentile 999th_percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_consumer_ Response Send Time spent in responding to FetchConsumer response_send_time_ requests: 99th Percentile 99th_percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_consumer_ Response Send Time spent in response_send_time_avg responding to FetchConsumer requests: Avg ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_consumer_ Response Send Time spent in response_send_time_max responding to FetchConsumer requests: Max ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_consumer_ Response Send Time spent in responding to FetchConsumer response_send_time_ requests: 50th Percentile median ms cluster, kafka, rack CDH 5, CDH 6 Apache Kafka Guide | 101 Kafka Reference Metric Name Unit Parents CDH Version kafka_fetch_consumer_ Response Send Time spent in response_send_time_min responding to FetchConsumer requests: Min ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_consumer_ Response Send Time spent in responding to FetchConsumer response_send_time_ requests: Samples rate requests per second cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_consumer_ Response Send Time spent in responding to FetchConsumer response_send_time_ requests: Standard Deviation stddev ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_consumer_ Total Time spent in responding to FetchConsumer requests: total_time_75th_ 75th Percentile percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_consumer_ Total Time spent in responding to FetchConsumer requests: total_time_999th_ 999th Percentile percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_consumer_ Total Time spent in responding to FetchConsumer requests: total_time_99th_ 99th Percentile percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_consumer_ Total Time spent in responding ms to FetchConsumer requests: Avg total_time_avg cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_consumer_ Total Time spent in responding ms to FetchConsumer requests: Max total_time_max cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_consumer_ Total Time spent in responding to FetchConsumer requests: total_time_median 50th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_consumer_ Total Time spent in responding ms to FetchConsumer requests: Min total_time_min cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_consumer_ Total Time spent in responding to FetchConsumer requests: total_time_rate Samples requests per second cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_consumer_ Total Time spent in responding to FetchConsumer requests: total_time_stddev Standard Deviation ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_follower_ Local Time spent in responding ms to FetchFollower requests: 75th local_time_75th_ Percentile percentile cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_follower_ Local Time spent in responding ms to FetchFollower requests: 999th local_time_999th_ Percentile percentile cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_follower_ Local Time spent in responding ms to FetchFollower requests: 99th local_time_99th_ Percentile percentile cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_follower_ Local Time spent in responding ms to FetchFollower requests: Avg local_time_avg cluster, kafka, rack CDH 5, CDH 6 102 | Apache Kafka Guide Description Kafka Reference Metric Name Description Unit Parents CDH Version kafka_fetch_follower_ Local Time spent in responding ms to FetchFollower requests: Max local_time_max cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_follower_ Local Time spent in responding ms to FetchFollower requests: 50th local_time_median Percentile cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_follower_ Local Time spent in responding ms to FetchFollower requests: Min local_time_min cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_follower_ Local Time spent in responding requests per to FetchFollower requests: second local_time_rate Samples cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_follower_ Local Time spent in responding ms to FetchFollower requests: local_time_stddev Standard Deviation cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_follower_ Remote Time spent in responding to FetchFollower remote_time_75th_ requests: 75th Percentile percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_follower_ Remote Time spent in responding to FetchFollower remote_time_999th_ requests: 999th Percentile percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_follower_ Remote Time spent in responding to FetchFollower remote_time_99th_ requests: 99th Percentile percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_follower_ Remote Time spent in responding to FetchFollower remote_time_avg requests: Avg ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_follower_ Remote Time spent in responding to FetchFollower remote_time_max requests: Max ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_follower_ Remote Time spent in responding to FetchFollower remote_time_median requests: 50th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_follower_ Remote Time spent in responding to FetchFollower remote_time_min requests: Min ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_follower_ Remote Time spent in responding to FetchFollower remote_time_rate requests: Samples requests per second cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_follower_ Remote Time spent in responding to FetchFollower remote_time_stddev requests: Standard Deviation ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_follower_ Request Queue Time spent in responding to FetchFollower request_queue_time_ requests: 75th Percentile 75th_percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_follower_ Request Queue Time spent in responding to FetchFollower request_queue_time_ requests: 999th Percentile 999th_percentile ms cluster, kafka, rack CDH 5, CDH 6 Apache Kafka Guide | 103 Kafka Reference Metric Name Unit Parents CDH Version kafka_fetch_follower_ Request Queue Time spent in responding to FetchFollower request_queue_time_ requests: 99th Percentile 99th_percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_follower_ Request Queue Time spent in request_queue_time_avg responding to FetchFollower requests: Avg ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_follower_ Request Queue Time spent in request_queue_time_max responding to FetchFollower requests: Max ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_follower_ Request Queue Time spent in responding to FetchFollower request_queue_time_ requests: 50th Percentile median ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_follower_ Request Queue Time spent in request_queue_time_min responding to FetchFollower requests: Min ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_follower_ Request Queue Time spent in responding to FetchFollower request_queue_time_ requests: Samples rate requests per second cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_follower_ Request Queue Time spent in responding to FetchFollower request_queue_time_ requests: Standard Deviation stddev ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_follower_ Number of FetchFollower requests: 15 Min Rate requests_15min_rate requests per cluster, kafka, message.units. rack singular.second CDH 5, CDH 6 kafka_fetch_follower_ Number of FetchFollower requests: 1 Min Rate requests_1min_rate requests per cluster, kafka, message.units. rack singular.second CDH 5, CDH 6 kafka_fetch_follower_ Number of FetchFollower requests: 5 Min Rate requests_5min_rate requests per cluster, kafka, message.units. rack singular.second CDH 5, CDH 6 kafka_fetch_follower_ Number of FetchFollower requests: Avg Rate requests_avg_rate requests per cluster, kafka, message.units. rack singular.second CDH 5, CDH 6 kafka_fetch_follower_ Number of FetchFollower requests requests_rate requests per second cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_follower_ Response Queue Time spent in responding to FetchFollower response_queue_time_ requests: 75th Percentile 75th_percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_follower_ Response Queue Time spent in responding to FetchFollower response_queue_time_ requests: 999th Percentile 999th_percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_follower_ Response Queue Time spent in responding to FetchFollower response_queue_time_ requests: 99th Percentile 99th_percentile ms cluster, kafka, rack CDH 5, CDH 6 104 | Apache Kafka Guide Description Kafka Reference Metric Name Description Unit Parents CDH Version kafka_fetch_follower_ Response Queue Time spent in responding to FetchFollower response_queue_time_ requests: Avg avg ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_follower_ Response Queue Time spent in responding to FetchFollower response_queue_time_ requests: Max max ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_follower_ Response Queue Time spent in responding to FetchFollower response_queue_time_ requests: 50th Percentile median ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_follower_ Response Queue Time spent in responding to FetchFollower response_queue_time_ requests: Min min ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_follower_ Response Queue Time spent in responding to FetchFollower response_queue_time_ requests: Samples rate requests per second cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_follower_ Response Queue Time spent in responding to FetchFollower response_queue_time_ requests: Standard Deviation stddev ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_follower_ Response Send Time spent in responding to FetchFollower response_send_time_ requests: 75th Percentile 75th_percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_follower_ Response Send Time spent in responding to FetchFollower response_send_time_ requests: 999th Percentile 999th_percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_follower_ Response Send Time spent in responding to FetchFollower response_send_time_ requests: 99th Percentile 99th_percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_follower_ Response Send Time spent in response_send_time_avg responding to FetchFollower requests: Avg ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_follower_ Response Send Time spent in response_send_time_max responding to FetchFollower requests: Max ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_follower_ Response Send Time spent in responding to FetchFollower response_send_time_ requests: 50th Percentile median ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_follower_ Response Send Time spent in response_send_time_min responding to FetchFollower requests: Min ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_follower_ Response Send Time spent in responding to FetchFollower response_send_time_ requests: Samples rate requests per second cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_follower_ Response Send Time spent in responding to FetchFollower response_send_time_ requests: Standard Deviation stddev ms cluster, kafka, rack CDH 5, CDH 6 Apache Kafka Guide | 105 Kafka Reference Metric Name Parents CDH Version kafka_fetch_follower_ Total Time spent in responding ms to FetchFollower requests: 75th total_time_75th_ Percentile percentile cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_follower_ Total Time spent in responding ms to FetchFollower requests: 999th total_time_999th_ Percentile percentile cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_follower_ Total Time spent in responding ms to FetchFollower requests: 99th total_time_99th_ Percentile percentile cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_follower_ Total Time spent in responding to FetchFollower requests: Avg total_time_avg ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_follower_ Total Time spent in responding ms to FetchFollower requests: Max total_time_max cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_follower_ Total Time spent in responding ms to FetchFollower requests: 50th total_time_median Percentile cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_follower_ Total Time spent in responding ms to FetchFollower requests: Min total_time_min cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_follower_ Total Time spent in responding to FetchFollower requests: total_time_rate Samples requests per second cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_follower_ Total Time spent in responding to FetchFollower requests: total_time_stddev Standard Deviation ms cluster, kafka, rack CDH 5, CDH 6 Local Time spent in responding ms to Fetch requests: 75th Percentile cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_local_ Local Time spent in responding ms time_999th_percentile to Fetch requests: 999th Percentile cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_local_ time_99th_percentile Local Time spent in responding ms to Fetch requests: 99th Percentile cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_local_ time_avg Local Time spent in responding ms to Fetch requests: Avg cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_local_ time_max Local Time spent in responding ms to Fetch requests: Max cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_local_ time_median Local Time spent in responding ms to Fetch requests: 50th Percentile cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_local_ time_min Local Time spent in responding ms to Fetch requests: Min cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_local_ time_rate Local Time spent in responding requests per to Fetch requests: Samples second cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_local_ time_75th_percentile 106 | Apache Kafka Guide Description Unit Kafka Reference Metric Name Description Parents CDH Version kafka_fetch_local_ time_stddev Local Time spent in responding ms to Fetch requests: Standard Deviation cluster, kafka, rack CDH 5, CDH 6 requests cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_purgatory_ Requests waiting in the fetch requests purgatory. This depends on value size of fetch.wait.max.ms in the consumer cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_purgatory_ Number of requests delayed in the fetch purgatory delayed_requests Unit Remote Time spent in responding to Fetch requests: 75th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_remote_ Remote Time spent in time_999th_percentile responding to Fetch requests: 999th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_remote_ time_99th_percentile Remote Time spent in responding to Fetch requests: 99th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_remote_ time_avg Remote Time spent in responding to Fetch requests: Avg ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_remote_ time_max Remote Time spent in responding to Fetch requests: Max ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_remote_ time_median Remote Time spent in responding to Fetch requests: 50th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_remote_ time_min Remote Time spent in responding to Fetch requests: Min ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_remote_ time_rate Remote Time spent in responding to Fetch requests: Samples requests per second cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_remote_ time_stddev Remote Time spent in responding to Fetch requests: Standard Deviation ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_request_ failures_15min_rate Number of data read requests from consumers that brokers failed to process for this topic: 15 Min Rate message.units. cluster, kafka, fetch_requests rack per message. units.singular. second CDH 5, CDH 6 kafka_fetch_request_ failures_1min_rate Number of data read requests from consumers that brokers failed to process for this topic: 1 Min Rate message.units. cluster, kafka, fetch_requests rack per message. units.singular. second CDH 5, CDH 6 kafka_fetch_remote_ time_75th_percentile Apache Kafka Guide | 107 Kafka Reference Metric Name Description Unit kafka_fetch_request_ failures_5min_rate Number of data read requests from consumers that brokers failed to process for this topic: 5 Min Rate message.units. cluster, kafka, fetch_requests rack per message. units.singular. second CDH 5, CDH 6 kafka_fetch_request_ failures_avg_rate Number of data read requests from consumers that brokers failed to process for this topic: Avg Rate message.units. cluster, kafka, fetch_requests rack per message. units.singular. second CDH 5, CDH 6 kafka_fetch_request_ failures_rate Number of data read requests from consumers that brokers failed to process for this topic message.units. cluster, kafka, fetch_requests rack per second CDH 5, CDH 6 kafka_fetch_request_ queue_time_75th_ percentile Request Queue Time spent in responding to Fetch requests: 75th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_request_ queue_time_999th_ percentile Request Queue Time spent in responding to Fetch requests: 999th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_request_ queue_time_99th_ percentile Request Queue Time spent in responding to Fetch requests: 99th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_request_ queue_time_avg Request Queue Time spent in responding to Fetch requests: Avg ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_request_ queue_time_max Request Queue Time spent in responding to Fetch requests: Max ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_request_ queue_time_median Request Queue Time spent in responding to Fetch requests: 50th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_request_ queue_time_min Request Queue Time spent in responding to Fetch requests: Min ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_request_ queue_time_rate Request Queue Time spent in responding to Fetch requests: Samples requests per second cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_request_ queue_time_stddev Request Queue Time spent in responding to Fetch requests: Standard Deviation ms cluster, kafka, rack CDH 5, CDH 6 requests per cluster, kafka, message.units. rack singular.second CDH 5, CDH 6 kafka_fetch_requests_ Number of Fetch requests: 1 Min requests per cluster, kafka, Rate message.units. rack 1min_rate singular.second CDH 5, CDH 6 kafka_fetch_requests_ Number of Fetch requests: 15 Min Rate 15min_rate 108 | Apache Kafka Guide Parents CDH Version Kafka Reference Metric Name Description Unit Parents CDH Version kafka_fetch_requests_ Number of Fetch requests: 5 Min requests per cluster, kafka, Rate message.units. rack 5min_rate singular.second CDH 5, CDH 6 kafka_fetch_requests_ Number of Fetch requests: Avg Rate avg_rate requests per cluster, kafka, message.units. rack singular.second CDH 5, CDH 6 kafka_fetch_requests_ Number of Fetch requests rate requests per second cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_response_ Response Queue Time spent in responding to Fetch requests: queue_time_75th_ 75th Percentile percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_response_ Response Queue Time spent in responding to Fetch requests: queue_time_999th_ 999th Percentile percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_response_ Response Queue Time spent in responding to Fetch requests: queue_time_99th_ 99th Percentile percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_response_ Response Queue Time spent in responding to Fetch requests: queue_time_avg Avg ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_response_ Response Queue Time spent in responding to Fetch requests: queue_time_max Max ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_response_ Response Queue Time spent in responding to Fetch requests: queue_time_median 50th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_response_ Response Queue Time spent in responding to Fetch requests: queue_time_min Min ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_response_ Response Queue Time spent in responding to Fetch requests: queue_time_rate Samples requests per second cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_response_ Response Queue Time spent in responding to Fetch requests: queue_time_stddev Standard Deviation ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_response_ Response Send Time spent in responding to Fetch requests: send_time_75th_ 75th Percentile percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_response_ Response Send Time spent in responding to Fetch requests: send_time_999th_ 999th Percentile percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_response_ Response Send Time spent in responding to Fetch requests: send_time_99th_ 99th Percentile percentile ms cluster, kafka, rack CDH 5, CDH 6 Apache Kafka Guide | 109 Kafka Reference Metric Name Unit Parents CDH Version kafka_fetch_response_ Response Send Time spent in responding to Fetch requests: send_time_avg Avg ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_response_ Response Send Time spent in responding to Fetch requests: send_time_max Max ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_response_ Response Send Time spent in responding to Fetch requests: send_time_median 50th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_response_ Response Send Time spent in responding to Fetch requests: send_time_min Min ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_response_ Response Send Time spent in responding to Fetch requests: send_time_rate Samples requests per second cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_response_ Response Send Time spent in responding to Fetch requests: send_time_stddev Standard Deviation ms cluster, kafka, rack CDH 5, CDH 6 Total Time spent in responding to Fetch requests: 75th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_total_ Total Time spent in responding time_999th_percentile to Fetch requests: 999th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_total_ time_99th_percentile Total Time spent in responding to Fetch requests: 99th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_total_ time_avg Total Time spent in responding to Fetch requests: Avg ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_total_ time_max Total Time spent in responding to Fetch requests: Max ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_total_ time_median Total Time spent in responding to Fetch requests: 50th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_total_ time_min Total Time spent in responding to Fetch requests: Min ms cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_total_ time_rate Total Time spent in responding to Fetch requests: Samples requests per second cluster, kafka, rack CDH 5, CDH 6 kafka_fetch_total_ time_stddev Total Time spent in responding to Fetch requests: Standard Deviation ms cluster, kafka, rack CDH 5, CDH 6 kafka_follower_ expires_15min_rate Number of expired delayed requests per cluster, kafka, follower fetch requests: 15 Min message.units. rack Rate singular.second CDH 5, CDH 6 kafka_fetch_total_ time_75th_percentile 110 | Apache Kafka Guide Description Kafka Reference Metric Name Description Unit kafka_follower_ expires_1min_rate Number of expired delayed follower fetch requests: 1 Min Rate requests per cluster, kafka, message.units. rack singular.second CDH 5, CDH 6 kafka_follower_ expires_5min_rate Number of expired delayed follower fetch requests: 5 Min Rate requests per cluster, kafka, message.units. rack singular.second CDH 5, CDH 6 kafka_follower_ expires_avg_rate Number of expired delayed requests per cluster, kafka, follower fetch requests: Avg Rate message.units. rack singular.second CDH 5, CDH 6 kafka_follower_ expires_rate Number of expired delayed follower fetch requests cluster, kafka, rack CDH 5, CDH 6 kafka_heartbeat_local_ Local Time spent in responding ms to Heartbeat requests: 75th time_75th_percentile Percentile cluster, kafka, rack CDH 5, CDH 6 kafka_heartbeat_local_ Local Time spent in responding ms time_999th_percentile to Heartbeat requests: 999th Percentile cluster, kafka, rack CDH 5, CDH 6 kafka_heartbeat_local_ Local Time spent in responding ms to Heartbeat requests: 99th time_99th_percentile Percentile cluster, kafka, rack CDH 5, CDH 6 kafka_heartbeat_local_ Local Time spent in responding ms to Heartbeat requests: Avg time_avg cluster, kafka, rack CDH 5, CDH 6 kafka_heartbeat_local_ Local Time spent in responding ms to Heartbeat requests: Max time_max cluster, kafka, rack CDH 5, CDH 6 kafka_heartbeat_local_ Local Time spent in responding ms to Heartbeat requests: 50th time_median Percentile cluster, kafka, rack CDH 5, CDH 6 kafka_heartbeat_local_ Local Time spent in responding ms to Heartbeat requests: Min time_min cluster, kafka, rack CDH 5, CDH 6 kafka_heartbeat_local_ Local Time spent in responding requests per to Heartbeat requests: Samples second time_rate cluster, kafka, rack CDH 5, CDH 6 kafka_heartbeat_local_ Local Time spent in responding ms to Heartbeat requests: Standard time_stddev Deviation cluster, kafka, rack CDH 5, CDH 6 requests per second Parents CDH Version kafka_heartbeat_ remote_time_75th_ percentile Remote Time spent in responding to Heartbeat requests: 75th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_heartbeat_ remote_time_999th_ percentile Remote Time spent in responding to Heartbeat requests: 999th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_heartbeat_ remote_time_99th_ percentile Remote Time spent in responding to Heartbeat requests: 99th Percentile ms cluster, kafka, rack CDH 5, CDH 6 Apache Kafka Guide | 111 Kafka Reference Metric Name Description Unit Parents CDH Version kafka_heartbeat_ remote_time_avg Remote Time spent in responding to Heartbeat requests: Avg ms cluster, kafka, rack CDH 5, CDH 6 kafka_heartbeat_ remote_time_max Remote Time spent in responding to Heartbeat requests: Max ms cluster, kafka, rack CDH 5, CDH 6 kafka_heartbeat_ remote_time_median Remote Time spent in responding to Heartbeat requests: 50th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_heartbeat_ remote_time_min Remote Time spent in responding to Heartbeat requests: Min ms cluster, kafka, rack CDH 5, CDH 6 kafka_heartbeat_ remote_time_rate Remote Time spent in responding to Heartbeat requests: Samples requests per second cluster, kafka, rack CDH 5, CDH 6 kafka_heartbeat_ remote_time_stddev Remote Time spent in responding to Heartbeat requests: Standard Deviation ms cluster, kafka, rack CDH 5, CDH 6 kafka_heartbeat_ request_queue_time_ 75th_percentile Request Queue Time spent in responding to Heartbeat requests: 75th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_heartbeat_ request_queue_time_ 999th_percentile Request Queue Time spent in responding to Heartbeat requests: 999th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_heartbeat_ request_queue_time_ 99th_percentile Request Queue Time spent in responding to Heartbeat requests: 99th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_heartbeat_ Request Queue Time spent in request_queue_time_avg responding to Heartbeat requests: Avg ms cluster, kafka, rack CDH 5, CDH 6 kafka_heartbeat_ Request Queue Time spent in request_queue_time_max responding to Heartbeat requests: Max ms cluster, kafka, rack CDH 5, CDH 6 Request Queue Time spent in responding to Heartbeat requests: 50th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_heartbeat_ Request Queue Time spent in request_queue_time_min responding to Heartbeat requests: Min ms cluster, kafka, rack CDH 5, CDH 6 kafka_heartbeat_ request_queue_time_ median kafka_heartbeat_ request_queue_time_ rate Request Queue Time spent in responding to Heartbeat requests: Samples requests per second cluster, kafka, rack CDH 5, CDH 6 kafka_heartbeat_ request_queue_time_ stddev Request Queue Time spent in responding to Heartbeat requests: Standard Deviation ms cluster, kafka, rack CDH 5, CDH 6 112 | Apache Kafka Guide Kafka Reference Metric Name Description Unit Parents CDH Version kafka_heartbeat_ requests_15min_rate Number of Heartbeat requests: requests per cluster, kafka, 15 Min Rate message.units. rack singular.second CDH 5, CDH 6 kafka_heartbeat_ requests_1min_rate Number of Heartbeat requests: requests per cluster, kafka, 1 Min Rate message.units. rack singular.second CDH 5, CDH 6 kafka_heartbeat_ requests_5min_rate Number of Heartbeat requests: requests per cluster, kafka, 5 Min Rate message.units. rack singular.second CDH 5, CDH 6 kafka_heartbeat_ requests_avg_rate Number of Heartbeat requests: requests per cluster, kafka, Avg Rate message.units. rack singular.second CDH 5, CDH 6 kafka_heartbeat_ requests_rate Number of Heartbeat requests requests per second cluster, kafka, rack CDH 5, CDH 6 kafka_heartbeat_ response_queue_time_ 75th_percentile Response Queue Time spent in responding to Heartbeat requests: 75th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_heartbeat_ response_queue_time_ 999th_percentile Response Queue Time spent in responding to Heartbeat requests: 999th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_heartbeat_ response_queue_time_ 99th_percentile Response Queue Time spent in responding to Heartbeat requests: 99th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_heartbeat_ response_queue_time_ avg Response Queue Time spent in responding to Heartbeat requests: Avg ms cluster, kafka, rack CDH 5, CDH 6 kafka_heartbeat_ response_queue_time_ max Response Queue Time spent in responding to Heartbeat requests: Max ms cluster, kafka, rack CDH 5, CDH 6 kafka_heartbeat_ response_queue_time_ median Response Queue Time spent in responding to Heartbeat requests: 50th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_heartbeat_ response_queue_time_ min Response Queue Time spent in responding to Heartbeat requests: Min ms cluster, kafka, rack CDH 5, CDH 6 kafka_heartbeat_ response_queue_time_ rate Response Queue Time spent in responding to Heartbeat requests: Samples requests per second cluster, kafka, rack CDH 5, CDH 6 kafka_heartbeat_ response_queue_time_ stddev Response Queue Time spent in responding to Heartbeat requests: Standard Deviation ms cluster, kafka, rack CDH 5, CDH 6 kafka_heartbeat_ response_send_time_ 75th_percentile Response Send Time spent in responding to Heartbeat requests: 75th Percentile ms cluster, kafka, rack CDH 5, CDH 6 Apache Kafka Guide | 113 Kafka Reference Metric Name Description Unit Parents CDH Version kafka_heartbeat_ response_send_time_ 999th_percentile Response Send Time spent in responding to Heartbeat requests: 999th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_heartbeat_ response_send_time_ 99th_percentile Response Send Time spent in responding to Heartbeat requests: 99th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_heartbeat_ Response Send Time spent in response_send_time_avg responding to Heartbeat requests: Avg ms cluster, kafka, rack CDH 5, CDH 6 kafka_heartbeat_ Response Send Time spent in response_send_time_max responding to Heartbeat requests: Max ms cluster, kafka, rack CDH 5, CDH 6 Response Send Time spent in responding to Heartbeat requests: 50th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_heartbeat_ Response Send Time spent in response_send_time_min responding to Heartbeat requests: Min ms cluster, kafka, rack CDH 5, CDH 6 kafka_heartbeat_ response_send_time_ median kafka_heartbeat_ response_send_time_ rate Response Send Time spent in responding to Heartbeat requests: Samples requests per second cluster, kafka, rack CDH 5, CDH 6 kafka_heartbeat_ response_send_time_ stddev Response Send Time spent in responding to Heartbeat requests: Standard Deviation ms cluster, kafka, rack CDH 5, CDH 6 kafka_heartbeat_total_ Total Time spent in responding to Heartbeat requests: 75th time_75th_percentile Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_heartbeat_total_ Total Time spent in responding time_999th_percentile to Heartbeat requests: 999th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_heartbeat_total_ Total Time spent in responding to Heartbeat requests: 99th time_99th_percentile Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_heartbeat_total_ Total Time spent in responding to Heartbeat requests: Avg time_avg ms cluster, kafka, rack CDH 5, CDH 6 kafka_heartbeat_total_ Total Time spent in responding to Heartbeat requests: Max time_max ms cluster, kafka, rack CDH 5, CDH 6 kafka_heartbeat_total_ Total Time spent in responding to Heartbeat requests: 50th time_median Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_heartbeat_total_ Total Time spent in responding to Heartbeat requests: Min time_min ms cluster, kafka, rack CDH 5, CDH 6 cluster, kafka, rack CDH 5, CDH 6 kafka_heartbeat_total_ Total Time spent in responding requests per to Heartbeat requests: Samples second time_rate 114 | Apache Kafka Guide Kafka Reference Metric Name Description Unit Parents CDH Version cluster, kafka, rack CDH 5, CDH 6 kafka_isr_expands_ 15min_rate Number of times ISR for a message.units. cluster, kafka, partition expanded: 15 Min Rate expansions per rack message.units. singular.second CDH 5, CDH 6 kafka_isr_expands_ 1min_rate Number of times ISR for a message.units. cluster, kafka, partition expanded: 1 Min Rate expansions per rack message.units. singular.second CDH 5, CDH 6 kafka_isr_expands_ 5min_rate Number of times ISR for a message.units. cluster, kafka, partition expanded: 5 Min Rate expansions per rack message.units. singular.second CDH 5, CDH 6 kafka_heartbeat_total_ Total Time spent in responding ms to Heartbeat requests: Standard time_stddev Deviation kafka_isr_expands_avg_ Number of times ISR for a partition expanded: Avg Rate rate message.units. cluster, kafka, expansions per rack message.units. singular.second CDH 5, CDH 6 kafka_isr_expands_rate Number of times ISR for a partition expanded message.units. cluster, kafka, expansions per rack second CDH 5, CDH 6 kafka_isr_shrinks_ 15min_rate Number of times ISR for a partition shrank: 15 Min Rate message.units. cluster, kafka, shrinks per rack message.units. singular.second CDH 5, CDH 6 kafka_isr_shrinks_ 1min_rate Number of times ISR for a partition shrank: 1 Min Rate message.units. cluster, kafka, shrinks per rack message.units. singular.second CDH 5, CDH 6 kafka_isr_shrinks_ 5min_rate Number of times ISR for a partition shrank: 5 Min Rate message.units. cluster, kafka, shrinks per rack message.units. singular.second CDH 5, CDH 6 kafka_isr_shrinks_avg_ Number of times ISR for a partition shrank: Avg Rate rate message.units. cluster, kafka, shrinks per rack message.units. singular.second CDH 5, CDH 6 kafka_isr_shrinks_rate Number of times ISR for a partition shrank message.units. cluster, kafka, shrinks per rack second CDH 5, CDH 6 kafka_join_group_ local_time_75th_ percentile Local Time spent in responding ms to JoinGroup requests: 75th Percentile cluster, kafka, rack CDH 5, CDH 6 kafka_join_group_ local_time_999th_ percentile Local Time spent in responding ms to JoinGroup requests: 999th Percentile cluster, kafka, rack CDH 5, CDH 6 Apache Kafka Guide | 115 Kafka Reference Metric Name Description kafka_join_group_ local_time_99th_ percentile Parents CDH Version Local Time spent in responding ms to JoinGroup requests: 99th Percentile cluster, kafka, rack CDH 5, CDH 6 kafka_join_group_ local_time_avg Local Time spent in responding ms to JoinGroup requests: Avg cluster, kafka, rack CDH 5, CDH 6 kafka_join_group_ local_time_max Local Time spent in responding ms to JoinGroup requests: Max cluster, kafka, rack CDH 5, CDH 6 kafka_join_group_ local_time_median Local Time spent in responding ms to JoinGroup requests: 50th Percentile cluster, kafka, rack CDH 5, CDH 6 kafka_join_group_ local_time_min Local Time spent in responding ms to JoinGroup requests: Min cluster, kafka, rack CDH 5, CDH 6 kafka_join_group_ local_time_rate Local Time spent in responding requests per to JoinGroup requests: Samples second cluster, kafka, rack CDH 5, CDH 6 kafka_join_group_ local_time_stddev Local Time spent in responding ms to JoinGroup requests: Standard Deviation cluster, kafka, rack CDH 5, CDH 6 kafka_join_group_ remote_time_75th_ percentile Remote Time spent in responding to JoinGroup requests: 75th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_join_group_ remote_time_999th_ percentile Remote Time spent in responding to JoinGroup requests: 999th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_join_group_ remote_time_99th_ percentile Remote Time spent in responding to JoinGroup requests: 99th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_join_group_ remote_time_avg Remote Time spent in responding to JoinGroup requests: Avg ms cluster, kafka, rack CDH 5, CDH 6 kafka_join_group_ remote_time_max Remote Time spent in responding to JoinGroup requests: Max ms cluster, kafka, rack CDH 5, CDH 6 kafka_join_group_ remote_time_median Remote Time spent in responding to JoinGroup requests: 50th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_join_group_ remote_time_min Remote Time spent in responding to JoinGroup requests: Min ms cluster, kafka, rack CDH 5, CDH 6 kafka_join_group_ remote_time_rate Remote Time spent in responding to JoinGroup requests: Samples requests per second cluster, kafka, rack CDH 5, CDH 6 kafka_join_group_ remote_time_stddev Remote Time spent in responding to JoinGroup requests: Standard Deviation ms cluster, kafka, rack CDH 5, CDH 6 116 | Apache Kafka Guide Unit Kafka Reference Metric Name Description Unit Parents CDH Version kafka_join_group_ request_queue_time_ 75th_percentile Request Queue Time spent in responding to JoinGroup requests: 75th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_join_group_ request_queue_time_ 999th_percentile Request Queue Time spent in responding to JoinGroup requests: 999th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_join_group_ request_queue_time_ 99th_percentile Request Queue Time spent in responding to JoinGroup requests: 99th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_join_group_ Request Queue Time spent in request_queue_time_avg responding to JoinGroup requests: Avg ms cluster, kafka, rack CDH 5, CDH 6 kafka_join_group_ Request Queue Time spent in request_queue_time_max responding to JoinGroup requests: Max ms cluster, kafka, rack CDH 5, CDH 6 Request Queue Time spent in responding to JoinGroup requests: 50th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_join_group_ Request Queue Time spent in request_queue_time_min responding to JoinGroup requests: Min ms cluster, kafka, rack CDH 5, CDH 6 kafka_join_group_ request_queue_time_ median kafka_join_group_ request_queue_time_ rate Request Queue Time spent in responding to JoinGroup requests: Samples requests per second cluster, kafka, rack CDH 5, CDH 6 kafka_join_group_ request_queue_time_ stddev Request Queue Time spent in responding to JoinGroup requests: Standard Deviation ms cluster, kafka, rack CDH 5, CDH 6 kafka_join_group_ requests_15min_rate Number of JoinGroup requests: requests per cluster, kafka, 15 Min Rate message.units. rack singular.second CDH 5, CDH 6 kafka_join_group_ requests_1min_rate Number of JoinGroup requests: requests per cluster, kafka, 1 Min Rate message.units. rack singular.second CDH 5, CDH 6 kafka_join_group_ requests_5min_rate Number of JoinGroup requests: requests per cluster, kafka, 5 Min Rate message.units. rack singular.second CDH 5, CDH 6 kafka_join_group_ requests_avg_rate Number of JoinGroup requests: requests per cluster, kafka, Avg Rate message.units. rack singular.second CDH 5, CDH 6 kafka_join_group_ requests_rate Number of JoinGroup requests requests per second cluster, kafka, rack CDH 5, CDH 6 kafka_join_group_ response_queue_time_ 75th_percentile Response Queue Time spent in responding to JoinGroup requests: 75th Percentile ms cluster, kafka, rack CDH 5, CDH 6 Apache Kafka Guide | 117 Kafka Reference Metric Name Description Unit Parents CDH Version kafka_join_group_ response_queue_time_ 999th_percentile Response Queue Time spent in responding to JoinGroup requests: 999th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_join_group_ response_queue_time_ 99th_percentile Response Queue Time spent in responding to JoinGroup requests: 99th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_join_group_ response_queue_time_ avg Response Queue Time spent in responding to JoinGroup requests: Avg ms cluster, kafka, rack CDH 5, CDH 6 kafka_join_group_ response_queue_time_ max Response Queue Time spent in responding to JoinGroup requests: Max ms cluster, kafka, rack CDH 5, CDH 6 kafka_join_group_ response_queue_time_ median Response Queue Time spent in responding to JoinGroup requests: 50th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_join_group_ response_queue_time_ min Response Queue Time spent in responding to JoinGroup requests: Min ms cluster, kafka, rack CDH 5, CDH 6 kafka_join_group_ response_queue_time_ rate Response Queue Time spent in responding to JoinGroup requests: Samples requests per second cluster, kafka, rack CDH 5, CDH 6 kafka_join_group_ response_queue_time_ stddev Response Queue Time spent in responding to JoinGroup requests: Standard Deviation ms cluster, kafka, rack CDH 5, CDH 6 kafka_join_group_ response_send_time_ 75th_percentile Response Send Time spent in responding to JoinGroup requests: 75th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_join_group_ response_send_time_ 999th_percentile Response Send Time spent in responding to JoinGroup requests: 999th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_join_group_ response_send_time_ 99th_percentile Response Send Time spent in responding to JoinGroup requests: 99th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_join_group_ Response Send Time spent in response_send_time_avg responding to JoinGroup requests: Avg ms cluster, kafka, rack CDH 5, CDH 6 kafka_join_group_ Response Send Time spent in response_send_time_max responding to JoinGroup requests: Max ms cluster, kafka, rack CDH 5, CDH 6 Response Send Time spent in responding to JoinGroup requests: 50th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_join_group_ Response Send Time spent in response_send_time_min responding to JoinGroup requests: Min ms cluster, kafka, rack CDH 5, CDH 6 kafka_join_group_ response_send_time_ median 118 | Apache Kafka Guide Kafka Reference Metric Name Description Unit Parents CDH Version kafka_join_group_ response_send_time_ rate Response Send Time spent in responding to JoinGroup requests: Samples requests per second cluster, kafka, rack CDH 5, CDH 6 kafka_join_group_ response_send_time_ stddev Response Send Time spent in responding to JoinGroup requests: Standard Deviation ms cluster, kafka, rack CDH 5, CDH 6 kafka_join_group_ total_time_75th_ percentile Total Time spent in responding to JoinGroup requests: 75th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_join_group_ total_time_999th_ percentile Total Time spent in responding to JoinGroup requests: 999th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_join_group_ total_time_99th_ percentile Total Time spent in responding to JoinGroup requests: 99th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_join_group_ total_time_avg Total Time spent in responding to JoinGroup requests: Avg ms cluster, kafka, rack CDH 5, CDH 6 kafka_join_group_ total_time_max Total Time spent in responding to JoinGroup requests: Max ms cluster, kafka, rack CDH 5, CDH 6 kafka_join_group_ total_time_median Total Time spent in responding to JoinGroup requests: 50th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_join_group_ total_time_min Total Time spent in responding to JoinGroup requests: Min ms cluster, kafka, rack CDH 5, CDH 6 kafka_join_group_ total_time_rate Total Time spent in responding requests per to JoinGroup requests: Samples second cluster, kafka, rack CDH 5, CDH 6 kafka_join_group_ total_time_stddev Total Time spent in responding ms to JoinGroup requests: Standard Deviation cluster, kafka, rack CDH 5, CDH 6 kafka_leader_and_isr_ Local Time spent in responding ms to LeaderAndIsr requests: 75th local_time_75th_ Percentile percentile cluster, kafka, rack CDH 5, CDH 6 kafka_leader_and_isr_ Local Time spent in responding ms to LeaderAndIsr requests: 999th local_time_999th_ Percentile percentile cluster, kafka, rack CDH 5, CDH 6 kafka_leader_and_isr_ Local Time spent in responding ms to LeaderAndIsr requests: 99th local_time_99th_ Percentile percentile cluster, kafka, rack CDH 5, CDH 6 kafka_leader_and_isr_ Local Time spent in responding ms to LeaderAndIsr requests: Avg local_time_avg cluster, kafka, rack CDH 5, CDH 6 kafka_leader_and_isr_ Local Time spent in responding ms to LeaderAndIsr requests: Max local_time_max cluster, kafka, rack CDH 5, CDH 6 kafka_leader_and_isr_ Local Time spent in responding ms to LeaderAndIsr requests: 50th local_time_median Percentile cluster, kafka, rack CDH 5, CDH 6 Apache Kafka Guide | 119 Kafka Reference Metric Name Parents CDH Version kafka_leader_and_isr_ Local Time spent in responding ms to LeaderAndIsr requests: Min local_time_min cluster, kafka, rack CDH 5, CDH 6 kafka_leader_and_isr_ Local Time spent in responding requests per to LeaderAndIsr requests: second local_time_rate Samples cluster, kafka, rack CDH 5, CDH 6 kafka_leader_and_isr_ Local Time spent in responding ms to LeaderAndIsr requests: local_time_stddev Standard Deviation cluster, kafka, rack CDH 5, CDH 6 kafka_leader_and_isr_ Remote Time spent in responding to LeaderAndIsr remote_time_75th_ requests: 75th Percentile percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_leader_and_isr_ Remote Time spent in responding to LeaderAndIsr remote_time_999th_ requests: 999th Percentile percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_leader_and_isr_ Remote Time spent in responding to LeaderAndIsr remote_time_99th_ requests: 99th Percentile percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_leader_and_isr_ Remote Time spent in responding to LeaderAndIsr remote_time_avg requests: Avg ms cluster, kafka, rack CDH 5, CDH 6 kafka_leader_and_isr_ Remote Time spent in responding to LeaderAndIsr remote_time_max requests: Max ms cluster, kafka, rack CDH 5, CDH 6 kafka_leader_and_isr_ Remote Time spent in responding to LeaderAndIsr remote_time_median requests: 50th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_leader_and_isr_ Remote Time spent in responding to LeaderAndIsr remote_time_min requests: Min ms cluster, kafka, rack CDH 5, CDH 6 kafka_leader_and_isr_ Remote Time spent in responding to LeaderAndIsr remote_time_rate requests: Samples requests per second cluster, kafka, rack CDH 5, CDH 6 kafka_leader_and_isr_ Remote Time spent in responding to LeaderAndIsr remote_time_stddev requests: Standard Deviation ms cluster, kafka, rack CDH 5, CDH 6 kafka_leader_and_isr_ Request Queue Time spent in responding to LeaderAndIsr request_queue_time_ requests: 75th Percentile 75th_percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_leader_and_isr_ Request Queue Time spent in responding to LeaderAndIsr request_queue_time_ requests: 999th Percentile 999th_percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_leader_and_isr_ Request Queue Time spent in responding to LeaderAndIsr request_queue_time_ requests: 99th Percentile 99th_percentile ms cluster, kafka, rack CDH 5, CDH 6 120 | Apache Kafka Guide Description Unit Kafka Reference Metric Name Description Unit Parents CDH Version kafka_leader_and_isr_ Request Queue Time spent in request_queue_time_avg responding to LeaderAndIsr requests: Avg ms cluster, kafka, rack CDH 5, CDH 6 kafka_leader_and_isr_ Request Queue Time spent in request_queue_time_max responding to LeaderAndIsr requests: Max ms cluster, kafka, rack CDH 5, CDH 6 kafka_leader_and_isr_ Request Queue Time spent in responding to LeaderAndIsr request_queue_time_ requests: 50th Percentile median ms cluster, kafka, rack CDH 5, CDH 6 kafka_leader_and_isr_ Request Queue Time spent in request_queue_time_min responding to LeaderAndIsr requests: Min ms cluster, kafka, rack CDH 5, CDH 6 kafka_leader_and_isr_ Request Queue Time spent in responding to LeaderAndIsr request_queue_time_ requests: Samples rate requests per second cluster, kafka, rack CDH 5, CDH 6 kafka_leader_and_isr_ Request Queue Time spent in responding to LeaderAndIsr request_queue_time_ requests: Standard Deviation stddev ms cluster, kafka, rack CDH 5, CDH 6 kafka_leader_and_isr_ Number of LeaderAndIsr requests: 15 Min Rate requests_15min_rate requests per cluster, kafka, message.units. rack singular.second CDH 5, CDH 6 kafka_leader_and_isr_ Number of LeaderAndIsr requests: 1 Min Rate requests_1min_rate requests per cluster, kafka, message.units. rack singular.second CDH 5, CDH 6 kafka_leader_and_isr_ Number of LeaderAndIsr requests: 5 Min Rate requests_5min_rate requests per cluster, kafka, message.units. rack singular.second CDH 5, CDH 6 kafka_leader_and_isr_ Number of LeaderAndIsr requests: Avg Rate requests_avg_rate requests per cluster, kafka, message.units. rack singular.second CDH 5, CDH 6 kafka_leader_and_isr_ Number of LeaderAndIsr requests requests_rate requests per second cluster, kafka, rack CDH 5, CDH 6 kafka_leader_and_isr_ Response Queue Time spent in responding to LeaderAndIsr response_queue_time_ requests: 75th Percentile 75th_percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_leader_and_isr_ Response Queue Time spent in responding to LeaderAndIsr response_queue_time_ requests: 999th Percentile 999th_percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_leader_and_isr_ Response Queue Time spent in responding to LeaderAndIsr response_queue_time_ requests: 99th Percentile 99th_percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_leader_and_isr_ Response Queue Time spent in responding to LeaderAndIsr response_queue_time_ requests: Avg avg ms cluster, kafka, rack CDH 5, CDH 6 Apache Kafka Guide | 121 Kafka Reference Metric Name Unit Parents CDH Version kafka_leader_and_isr_ Response Queue Time spent in responding to LeaderAndIsr response_queue_time_ requests: Max max ms cluster, kafka, rack CDH 5, CDH 6 kafka_leader_and_isr_ Response Queue Time spent in responding to LeaderAndIsr response_queue_time_ requests: 50th Percentile median ms cluster, kafka, rack CDH 5, CDH 6 kafka_leader_and_isr_ Response Queue Time spent in responding to LeaderAndIsr response_queue_time_ requests: Min min ms cluster, kafka, rack CDH 5, CDH 6 kafka_leader_and_isr_ Response Queue Time spent in responding to LeaderAndIsr response_queue_time_ requests: Samples rate requests per second cluster, kafka, rack CDH 5, CDH 6 kafka_leader_and_isr_ Response Queue Time spent in responding to LeaderAndIsr response_queue_time_ requests: Standard Deviation stddev ms cluster, kafka, rack CDH 5, CDH 6 kafka_leader_and_isr_ Response Send Time spent in responding to LeaderAndIsr response_send_time_ requests: 75th Percentile 75th_percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_leader_and_isr_ Response Send Time spent in responding to LeaderAndIsr response_send_time_ requests: 999th Percentile 999th_percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_leader_and_isr_ Response Send Time spent in responding to LeaderAndIsr response_send_time_ requests: 99th Percentile 99th_percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_leader_and_isr_ Response Send Time spent in response_send_time_avg responding to LeaderAndIsr requests: Avg ms cluster, kafka, rack CDH 5, CDH 6 kafka_leader_and_isr_ Response Send Time spent in response_send_time_max responding to LeaderAndIsr requests: Max ms cluster, kafka, rack CDH 5, CDH 6 kafka_leader_and_isr_ Response Send Time spent in responding to LeaderAndIsr response_send_time_ requests: 50th Percentile median ms cluster, kafka, rack CDH 5, CDH 6 kafka_leader_and_isr_ Response Send Time spent in response_send_time_min responding to LeaderAndIsr requests: Min ms cluster, kafka, rack CDH 5, CDH 6 kafka_leader_and_isr_ Response Send Time spent in responding to LeaderAndIsr response_send_time_ requests: Samples rate requests per second cluster, kafka, rack CDH 5, CDH 6 kafka_leader_and_isr_ Response Send Time spent in responding to LeaderAndIsr response_send_time_ requests: Standard Deviation stddev ms cluster, kafka, rack CDH 5, CDH 6 kafka_leader_and_isr_ Total Time spent in responding ms to LeaderAndIsr requests: 75th total_time_75th_ Percentile percentile cluster, kafka, rack CDH 5, CDH 6 122 | Apache Kafka Guide Description Kafka Reference Metric Name Description Unit Parents CDH Version kafka_leader_and_isr_ Total Time spent in responding ms to LeaderAndIsr requests: 999th total_time_999th_ Percentile percentile cluster, kafka, rack CDH 5, CDH 6 kafka_leader_and_isr_ Total Time spent in responding ms to LeaderAndIsr requests: 99th total_time_99th_ Percentile percentile cluster, kafka, rack CDH 5, CDH 6 kafka_leader_and_isr_ Total Time spent in responding to LeaderAndIsr requests: Avg total_time_avg ms cluster, kafka, rack CDH 5, CDH 6 kafka_leader_and_isr_ Total Time spent in responding to LeaderAndIsr requests: Max total_time_max ms cluster, kafka, rack CDH 5, CDH 6 kafka_leader_and_isr_ Total Time spent in responding ms to LeaderAndIsr requests: 50th total_time_median Percentile cluster, kafka, rack CDH 5, CDH 6 kafka_leader_and_isr_ Total Time spent in responding to LeaderAndIsr requests: Min total_time_min ms cluster, kafka, rack CDH 5, CDH 6 kafka_leader_and_isr_ Total Time spent in responding to LeaderAndIsr requests: total_time_rate Samples requests per second cluster, kafka, rack CDH 5, CDH 6 kafka_leader_and_isr_ Total Time spent in responding to LeaderAndIsr requests: total_time_stddev Standard Deviation ms cluster, kafka, rack CDH 5, CDH 6 kafka_leader_election_ Leader elections: 15 Min Rate 15min_rate message.units. cluster, kafka, elections per rack message.units. singular.second CDH 5, CDH 6 kafka_leader_election_ Leader elections: 1 Min Rate 1min_rate message.units. cluster, kafka, elections per rack message.units. singular.second CDH 5, CDH 6 kafka_leader_election_ Leader elections: 5 Min Rate 5min_rate message.units. cluster, kafka, elections per rack message.units. singular.second CDH 5, CDH 6 kafka_leader_election_ Leader elections: 75th Percentile ms 75th_percentile cluster, kafka, rack CDH 5, CDH 6 kafka_leader_election_ Leader elections: 999th Percentile 999th_percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_leader_election_ Leader elections: 99th Percentile ms 99th_percentile cluster, kafka, rack CDH 5, CDH 6 kafka_leader_election_ Leader elections: Avg avg ms cluster, kafka, rack CDH 5, CDH 6 kafka_leader_election_ Leader elections: Max max ms cluster, kafka, rack CDH 5, CDH 6 kafka_leader_election_ Leader elections: 50th Percentile ms median cluster, kafka, rack CDH 5, CDH 6 Apache Kafka Guide | 123 Kafka Reference Metric Name Description Unit Parents CDH Version kafka_leader_election_ Leader elections: Min min ms cluster, kafka, rack CDH 5, CDH 6 kafka_leader_election_ Leader elections: Samples rate message.units. cluster, kafka, elections per rack second CDH 5, CDH 6 kafka_leader_election_ Leader elections: Standard Deviation stddev ms cluster, kafka, rack CDH 5, CDH 6 kafka_leader_replicas Number of leader replicas on broker replicas cluster, kafka, rack CDH 5, CDH 6 kafka_log_flush_15min_ Rate of flushing Kafka logs to disk: 15 Min Rate rate message.units. cluster, kafka, flushes per rack message.units. singular.second CDH 5, CDH 6 kafka_log_flush_1min_ Rate of flushing Kafka logs to disk: 1 Min Rate rate message.units. cluster, kafka, flushes per rack message.units. singular.second CDH 5, CDH 6 kafka_log_flush_5min_ Rate of flushing Kafka logs to disk: 5 Min Rate rate message.units. cluster, kafka, flushes per rack message.units. singular.second CDH 5, CDH 6 kafka_log_flush_75th_ Rate of flushing Kafka logs to disk: 75th Percentile percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_log_flush_999th_ Rate of flushing Kafka logs to disk: 999th Percentile percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_log_flush_99th_ Rate of flushing Kafka logs to disk: 99th Percentile percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_log_flush_avg Rate of flushing Kafka logs to disk: Avg ms cluster, kafka, rack CDH 5, CDH 6 kafka_log_flush_max Rate of flushing Kafka logs to disk: Max ms cluster, kafka, rack CDH 5, CDH 6 kafka_log_flush_median Rate of flushing Kafka logs to disk: 50th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_log_flush_min Rate of flushing Kafka logs to disk: Min ms cluster, kafka, rack CDH 5, CDH 6 kafka_log_flush_rate Rate of flushing Kafka logs to disk: Samples message.units. cluster, kafka, flushes per rack second CDH 5, CDH 6 ms cluster, kafka, rack CDH 5, CDH 6 cluster, kafka, rack CDH 5, CDH 6 cluster, kafka, rack CDH 5, CDH 6 kafka_log_flush_stddev Rate of flushing Kafka logs to disk: Standard Deviation kafka_max_replication_ Maximum replication lag on messages broker, across all fetchers, topics lag and partitions kafka_memory_heap_ committed 124 | Apache Kafka Guide JVM heap committed memory bytes Kafka Reference Metric Name Description Unit Parents CDH Version kafka_memory_heap_init JVM heap initial memory bytes cluster, kafka, rack CDH 5, CDH 6 kafka_memory_heap_max JVM heap max used memory bytes cluster, kafka, rack CDH 5, CDH 6 kafka_memory_heap_used JVM heap used memory bytes cluster, kafka, rack CDH 5, CDH 6 kafka_memory_total_ committed JVM heap and non-heap committed memory bytes cluster, kafka, rack CDH 5, CDH 6 kafka_memory_total_ init JVM heap and non-heap initial memory bytes cluster, kafka, rack CDH 5, CDH 6 kafka_memory_total_max JVM heap and non-heap max initial memory bytes cluster, kafka, rack CDH 5, CDH 6 kafka_memory_total_ used JVM heap and non-heap used memory bytes cluster, kafka, rack CDH 5, CDH 6 kafka_messages_ received_15min_rate Number of messages written to messages per cluster, kafka, topic on this broker: 15 Min Rate message.units. rack singular.second CDH 5, CDH 6 kafka_messages_ received_1min_rate Number of messages written to messages per cluster, kafka, topic on this broker: 1 Min Rate message.units. rack singular.second CDH 5, CDH 6 kafka_messages_ received_5min_rate Number of messages written to messages per cluster, kafka, topic on this broker: 5 Min Rate message.units. rack singular.second CDH 5, CDH 6 kafka_messages_ received_avg_rate Number of messages written to messages per cluster, kafka, topic on this broker: Avg Rate message.units. rack singular.second CDH 5, CDH 6 kafka_messages_ received_rate Number of messages written to messages per topic on this broker second cluster, kafka, rack CDH 5, CDH 6 kafka_metadata_local_ Local Time spent in responding ms to Metadata requests: 75th time_75th_percentile Percentile cluster, kafka, rack CDH 5, CDH 6 kafka_metadata_local_ Local Time spent in responding ms time_999th_percentile to Metadata requests: 999th Percentile cluster, kafka, rack CDH 5, CDH 6 kafka_metadata_local_ Local Time spent in responding ms to Metadata requests: 99th time_99th_percentile Percentile cluster, kafka, rack CDH 5, CDH 6 kafka_metadata_local_ Local Time spent in responding ms to Metadata requests: Avg time_avg cluster, kafka, rack CDH 5, CDH 6 kafka_metadata_local_ Local Time spent in responding ms to Metadata requests: Max time_max cluster, kafka, rack CDH 5, CDH 6 kafka_metadata_local_ Local Time spent in responding ms to Metadata requests: 50th time_median Percentile cluster, kafka, rack CDH 5, CDH 6 Apache Kafka Guide | 125 Kafka Reference Metric Name Parents CDH Version kafka_metadata_local_ Local Time spent in responding ms to Metadata requests: Min time_min cluster, kafka, rack CDH 5, CDH 6 kafka_metadata_local_ Local Time spent in responding requests per to Metadata requests: Samples second time_rate cluster, kafka, rack CDH 5, CDH 6 kafka_metadata_local_ Local Time spent in responding ms to Metadata requests: Standard time_stddev Deviation cluster, kafka, rack CDH 5, CDH 6 kafka_metadata_remote_ Remote Time spent in responding to Metadata time_75th_percentile requests: 75th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_metadata_remote_ Remote Time spent in time_999th_percentile responding to Metadata requests: 999th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_metadata_remote_ Remote Time spent in responding to Metadata time_99th_percentile requests: 99th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_metadata_remote_ Remote Time spent in responding to Metadata time_avg requests: Avg ms cluster, kafka, rack CDH 5, CDH 6 kafka_metadata_remote_ Remote Time spent in responding to Metadata time_max requests: Max ms cluster, kafka, rack CDH 5, CDH 6 kafka_metadata_remote_ Remote Time spent in responding to Metadata time_median requests: 50th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_metadata_remote_ Remote Time spent in responding to Metadata time_min requests: Min ms cluster, kafka, rack CDH 5, CDH 6 kafka_metadata_remote_ Remote Time spent in responding to Metadata time_rate requests: Samples requests per second cluster, kafka, rack CDH 5, CDH 6 kafka_metadata_remote_ Remote Time spent in responding to Metadata time_stddev requests: Standard Deviation ms cluster, kafka, rack CDH 5, CDH 6 kafka_metadata_ request_queue_time_ 75th_percentile Request Queue Time spent in responding to Metadata requests: 75th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_metadata_ request_queue_time_ 999th_percentile Request Queue Time spent in responding to Metadata requests: 999th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_metadata_ request_queue_time_ 99th_percentile Request Queue Time spent in responding to Metadata requests: 99th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_metadata_ Request Queue Time spent in request_queue_time_avg responding to Metadata requests: Avg ms cluster, kafka, rack CDH 5, CDH 6 126 | Apache Kafka Guide Description Unit Kafka Reference Metric Name Unit Parents CDH Version kafka_metadata_ Request Queue Time spent in request_queue_time_max responding to Metadata requests: Max ms cluster, kafka, rack CDH 5, CDH 6 Request Queue Time spent in responding to Metadata requests: 50th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_metadata_ Request Queue Time spent in request_queue_time_min responding to Metadata requests: Min ms cluster, kafka, rack CDH 5, CDH 6 kafka_metadata_ request_queue_time_ median Description kafka_metadata_ request_queue_time_ rate Request Queue Time spent in responding to Metadata requests: Samples requests per second cluster, kafka, rack CDH 5, CDH 6 kafka_metadata_ request_queue_time_ stddev Request Queue Time spent in responding to Metadata requests: Standard Deviation ms cluster, kafka, rack CDH 5, CDH 6 kafka_metadata_ requests_15min_rate Number of Metadata requests: requests per cluster, kafka, 15 Min Rate message.units. rack singular.second CDH 5, CDH 6 kafka_metadata_ requests_1min_rate Number of Metadata requests: requests per cluster, kafka, 1 Min Rate message.units. rack singular.second CDH 5, CDH 6 kafka_metadata_ requests_5min_rate Number of Metadata requests: requests per cluster, kafka, 5 Min Rate message.units. rack singular.second CDH 5, CDH 6 kafka_metadata_ requests_avg_rate Number of Metadata requests: requests per cluster, kafka, Avg Rate message.units. rack singular.second CDH 5, CDH 6 kafka_metadata_ requests_rate Number of Metadata requests requests per second cluster, kafka, rack CDH 5, CDH 6 kafka_metadata_ response_queue_time_ 75th_percentile Response Queue Time spent in responding to Metadata requests: 75th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_metadata_ response_queue_time_ 999th_percentile Response Queue Time spent in responding to Metadata requests: 999th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_metadata_ response_queue_time_ 99th_percentile Response Queue Time spent in responding to Metadata requests: 99th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_metadata_ response_queue_time_ avg Response Queue Time spent in responding to Metadata requests: Avg ms cluster, kafka, rack CDH 5, CDH 6 kafka_metadata_ response_queue_time_ max Response Queue Time spent in responding to Metadata requests: Max ms cluster, kafka, rack CDH 5, CDH 6 Apache Kafka Guide | 127 Kafka Reference Metric Name Description Unit Parents CDH Version kafka_metadata_ response_queue_time_ median Response Queue Time spent in responding to Metadata requests: 50th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_metadata_ response_queue_time_ min Response Queue Time spent in responding to Metadata requests: Min ms cluster, kafka, rack CDH 5, CDH 6 kafka_metadata_ response_queue_time_ rate Response Queue Time spent in responding to Metadata requests: Samples requests per second cluster, kafka, rack CDH 5, CDH 6 kafka_metadata_ response_queue_time_ stddev Response Queue Time spent in responding to Metadata requests: Standard Deviation ms cluster, kafka, rack CDH 5, CDH 6 kafka_metadata_ response_send_time_ 75th_percentile Response Send Time spent in responding to Metadata requests: 75th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_metadata_ response_send_time_ 999th_percentile Response Send Time spent in responding to Metadata requests: 999th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_metadata_ response_send_time_ 99th_percentile Response Send Time spent in responding to Metadata requests: 99th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_metadata_ Response Send Time spent in response_send_time_avg responding to Metadata requests: Avg ms cluster, kafka, rack CDH 5, CDH 6 kafka_metadata_ Response Send Time spent in response_send_time_max responding to Metadata requests: Max ms cluster, kafka, rack CDH 5, CDH 6 Response Send Time spent in responding to Metadata requests: 50th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_metadata_ Response Send Time spent in response_send_time_min responding to Metadata requests: Min ms cluster, kafka, rack CDH 5, CDH 6 kafka_metadata_ response_send_time_ median kafka_metadata_ response_send_time_ rate Response Send Time spent in responding to Metadata requests: Samples requests per second cluster, kafka, rack CDH 5, CDH 6 kafka_metadata_ response_send_time_ stddev Response Send Time spent in responding to Metadata requests: Standard Deviation ms cluster, kafka, rack CDH 5, CDH 6 kafka_metadata_total_ Total Time spent in responding to Metadata requests: 75th time_75th_percentile Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_metadata_total_ Total Time spent in responding time_999th_percentile to Metadata requests: 999th Percentile ms cluster, kafka, rack CDH 5, CDH 6 128 | Apache Kafka Guide Kafka Reference Metric Name Description Unit Parents CDH Version kafka_metadata_total_ Total Time spent in responding to Metadata requests: 99th time_99th_percentile Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_metadata_total_ Total Time spent in responding to Metadata requests: Avg time_avg ms cluster, kafka, rack CDH 5, CDH 6 kafka_metadata_total_ Total Time spent in responding to Metadata requests: Max time_max ms cluster, kafka, rack CDH 5, CDH 6 kafka_metadata_total_ Total Time spent in responding to Metadata requests: 50th time_median Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_metadata_total_ Total Time spent in responding to Metadata requests: Min time_min ms cluster, kafka, rack CDH 5, CDH 6 kafka_metadata_total_ Total Time spent in responding requests per to Metadata requests: Samples second time_rate cluster, kafka, rack CDH 5, CDH 6 kafka_metadata_total_ Total Time spent in responding ms to Metadata requests: Standard time_stddev Deviation cluster, kafka, rack CDH 5, CDH 6 message.units. cluster, kafka, fetch_requests rack per message. units.singular. second CDH 5, CDH 6 kafka_network_ processor_avg_idle_ 15min_rate The average free capacity of the message.units. cluster, kafka, network processors: 15 Min Rate percent_idle rack per message. units.singular. nanoseconds CDH 5, CDH 6 kafka_network_ processor_avg_idle_ 1min_rate The average free capacity of the message.units. cluster, kafka, network processors: 1 Min Rate percent_idle rack per message. units.singular. nanoseconds CDH 5, CDH 6 kafka_network_ processor_avg_idle_ 5min_rate The average free capacity of the message.units. cluster, kafka, network processors: 5 Min Rate percent_idle rack per message. units.singular. nanoseconds CDH 5, CDH 6 kafka_network_ processor_avg_idle_ avg_rate The average free capacity of the message.units. cluster, kafka, network processors: Avg Rate percent_idle rack per message. units.singular. nanoseconds CDH 5, CDH 6 kafka_network_ processor_avg_idle_ rate The average free capacity of the message.units. cluster, kafka, network processors percent_idle rack per second CDH 5, CDH 6 kafka_offline_ partitions Number of unavailable partitions partitions CDH 5, CDH 6 kafka_min_replication_ Minimum replication rate, across all fetchers, topics and partitions. rate Measured in average fetch requests per sec in the last minute cluster, kafka, rack Apache Kafka Guide | 129 Kafka Reference Metric Name Description kafka_offset_commit_ local_time_75th_ percentile Parents CDH Version Local Time spent in responding ms to OffsetCommit requests: 75th Percentile cluster, kafka, rack CDH 5, CDH 6 kafka_offset_commit_ local_time_999th_ percentile Local Time spent in responding ms to OffsetCommit requests: 999th Percentile cluster, kafka, rack CDH 5, CDH 6 kafka_offset_commit_ local_time_99th_ percentile Local Time spent in responding ms to OffsetCommit requests: 99th Percentile cluster, kafka, rack CDH 5, CDH 6 kafka_offset_commit_ local_time_avg Local Time spent in responding ms to OffsetCommit requests: Avg cluster, kafka, rack CDH 5, CDH 6 kafka_offset_commit_ local_time_max Local Time spent in responding ms to OffsetCommit requests: Max cluster, kafka, rack CDH 5, CDH 6 kafka_offset_commit_ local_time_median Local Time spent in responding ms to OffsetCommit requests: 50th Percentile cluster, kafka, rack CDH 5, CDH 6 kafka_offset_commit_ local_time_min Local Time spent in responding ms to OffsetCommit requests: Min cluster, kafka, rack CDH 5, CDH 6 kafka_offset_commit_ local_time_rate Local Time spent in responding requests per to OffsetCommit requests: second Samples cluster, kafka, rack CDH 5, CDH 6 kafka_offset_commit_ local_time_stddev Local Time spent in responding ms to OffsetCommit requests: Standard Deviation cluster, kafka, rack CDH 5, CDH 6 kafka_offset_commit_ remote_time_75th_ percentile Remote Time spent in responding to OffsetCommit requests: 75th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_offset_commit_ remote_time_999th_ percentile Remote Time spent in responding to OffsetCommit requests: 999th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_offset_commit_ remote_time_99th_ percentile Remote Time spent in responding to OffsetCommit requests: 99th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_offset_commit_ remote_time_avg Remote Time spent in responding to OffsetCommit requests: Avg ms cluster, kafka, rack CDH 5, CDH 6 kafka_offset_commit_ remote_time_max Remote Time spent in responding to OffsetCommit requests: Max ms cluster, kafka, rack CDH 5, CDH 6 kafka_offset_commit_ remote_time_median Remote Time spent in responding to OffsetCommit requests: 50th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_offset_commit_ remote_time_min Remote Time spent in responding to OffsetCommit requests: Min ms cluster, kafka, rack CDH 5, CDH 6 130 | Apache Kafka Guide Unit Kafka Reference Metric Name Description Unit Parents CDH Version kafka_offset_commit_ remote_time_rate Remote Time spent in responding to OffsetCommit requests: Samples requests per second cluster, kafka, rack CDH 5, CDH 6 kafka_offset_commit_ remote_time_stddev Remote Time spent in responding to OffsetCommit requests: Standard Deviation ms cluster, kafka, rack CDH 5, CDH 6 kafka_offset_commit_ request_queue_time_ 75th_percentile Request Queue Time spent in responding to OffsetCommit requests: 75th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_offset_commit_ request_queue_time_ 999th_percentile Request Queue Time spent in responding to OffsetCommit requests: 999th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_offset_commit_ request_queue_time_ 99th_percentile Request Queue Time spent in responding to OffsetCommit requests: 99th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_offset_commit_ Request Queue Time spent in request_queue_time_avg responding to OffsetCommit requests: Avg ms cluster, kafka, rack CDH 5, CDH 6 kafka_offset_commit_ Request Queue Time spent in request_queue_time_max responding to OffsetCommit requests: Max ms cluster, kafka, rack CDH 5, CDH 6 Request Queue Time spent in responding to OffsetCommit requests: 50th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_offset_commit_ Request Queue Time spent in request_queue_time_min responding to OffsetCommit requests: Min ms cluster, kafka, rack CDH 5, CDH 6 kafka_offset_commit_ request_queue_time_ median kafka_offset_commit_ request_queue_time_ rate Request Queue Time spent in responding to OffsetCommit requests: Samples requests per second cluster, kafka, rack CDH 5, CDH 6 kafka_offset_commit_ request_queue_time_ stddev Request Queue Time spent in responding to OffsetCommit requests: Standard Deviation ms cluster, kafka, rack CDH 5, CDH 6 kafka_offset_commit_ requests_15min_rate Number of OffsetCommit requests: 15 Min Rate requests per cluster, kafka, message.units. rack singular.second CDH 5, CDH 6 kafka_offset_commit_ requests_1min_rate Number of OffsetCommit requests: 1 Min Rate requests per cluster, kafka, message.units. rack singular.second CDH 5, CDH 6 kafka_offset_commit_ requests_5min_rate Number of OffsetCommit requests: 5 Min Rate requests per cluster, kafka, message.units. rack singular.second CDH 5, CDH 6 kafka_offset_commit_ requests_avg_rate Number of OffsetCommit requests: Avg Rate requests per cluster, kafka, message.units. rack singular.second CDH 5, CDH 6 Apache Kafka Guide | 131 Kafka Reference Metric Name Description Unit Parents CDH Version kafka_offset_commit_ requests_rate Number of OffsetCommit requests requests per second cluster, kafka, rack CDH 5, CDH 6 kafka_offset_commit_ response_queue_time_ 75th_percentile Response Queue Time spent in responding to OffsetCommit requests: 75th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_offset_commit_ response_queue_time_ 999th_percentile Response Queue Time spent in responding to OffsetCommit requests: 999th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_offset_commit_ response_queue_time_ 99th_percentile Response Queue Time spent in responding to OffsetCommit requests: 99th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_offset_commit_ response_queue_time_ avg Response Queue Time spent in responding to OffsetCommit requests: Avg ms cluster, kafka, rack CDH 5, CDH 6 kafka_offset_commit_ response_queue_time_ max Response Queue Time spent in responding to OffsetCommit requests: Max ms cluster, kafka, rack CDH 5, CDH 6 kafka_offset_commit_ response_queue_time_ median Response Queue Time spent in responding to OffsetCommit requests: 50th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_offset_commit_ response_queue_time_ min Response Queue Time spent in responding to OffsetCommit requests: Min ms cluster, kafka, rack CDH 5, CDH 6 kafka_offset_commit_ response_queue_time_ rate Response Queue Time spent in responding to OffsetCommit requests: Samples requests per second cluster, kafka, rack CDH 5, CDH 6 kafka_offset_commit_ response_queue_time_ stddev Response Queue Time spent in responding to OffsetCommit requests: Standard Deviation ms cluster, kafka, rack CDH 5, CDH 6 kafka_offset_commit_ response_send_time_ 75th_percentile Response Send Time spent in responding to OffsetCommit requests: 75th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_offset_commit_ response_send_time_ 999th_percentile Response Send Time spent in responding to OffsetCommit requests: 999th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_offset_commit_ response_send_time_ 99th_percentile Response Send Time spent in responding to OffsetCommit requests: 99th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_offset_commit_ Response Send Time spent in response_send_time_avg responding to OffsetCommit requests: Avg ms cluster, kafka, rack CDH 5, CDH 6 kafka_offset_commit_ Response Send Time spent in response_send_time_max responding to OffsetCommit requests: Max ms cluster, kafka, rack CDH 5, CDH 6 132 | Apache Kafka Guide Kafka Reference Metric Name Description Unit Parents CDH Version kafka_offset_commit_ response_send_time_ median Response Send Time spent in responding to OffsetCommit requests: 50th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_offset_commit_ Response Send Time spent in response_send_time_min responding to OffsetCommit requests: Min ms cluster, kafka, rack CDH 5, CDH 6 kafka_offset_commit_ response_send_time_ rate Response Send Time spent in responding to OffsetCommit requests: Samples requests per second cluster, kafka, rack CDH 5, CDH 6 kafka_offset_commit_ response_send_time_ stddev Response Send Time spent in responding to OffsetCommit requests: Standard Deviation ms cluster, kafka, rack CDH 5, CDH 6 kafka_offset_commit_ total_time_75th_ percentile Total Time spent in responding ms to OffsetCommit requests: 75th Percentile cluster, kafka, rack CDH 5, CDH 6 kafka_offset_commit_ total_time_999th_ percentile Total Time spent in responding ms to OffsetCommit requests: 999th Percentile cluster, kafka, rack CDH 5, CDH 6 kafka_offset_commit_ total_time_99th_ percentile Total Time spent in responding ms to OffsetCommit requests: 99th Percentile cluster, kafka, rack CDH 5, CDH 6 kafka_offset_commit_ total_time_avg Total Time spent in responding to OffsetCommit requests: Avg ms cluster, kafka, rack CDH 5, CDH 6 kafka_offset_commit_ total_time_max Total Time spent in responding ms to OffsetCommit requests: Max cluster, kafka, rack CDH 5, CDH 6 kafka_offset_commit_ total_time_median Total Time spent in responding ms to OffsetCommit requests: 50th Percentile cluster, kafka, rack CDH 5, CDH 6 kafka_offset_commit_ total_time_min Total Time spent in responding ms to OffsetCommit requests: Min cluster, kafka, rack CDH 5, CDH 6 kafka_offset_commit_ total_time_rate Total Time spent in responding to OffsetCommit requests: Samples requests per second cluster, kafka, rack CDH 5, CDH 6 kafka_offset_commit_ total_time_stddev Total Time spent in responding to OffsetCommit requests: Standard Deviation ms cluster, kafka, rack CDH 5, CDH 6 kafka_offset_fetch_ local_time_75th_ percentile Local Time spent in responding ms to OffsetFetch requests: 75th Percentile cluster, kafka, rack CDH 5, CDH 6 kafka_offset_fetch_ local_time_999th_ percentile Local Time spent in responding ms to OffsetFetch requests: 999th Percentile cluster, kafka, rack CDH 5, CDH 6 kafka_offset_fetch_ local_time_99th_ percentile Local Time spent in responding ms to OffsetFetch requests: 99th Percentile cluster, kafka, rack CDH 5, CDH 6 Apache Kafka Guide | 133 Kafka Reference Metric Name Description kafka_offset_fetch_ local_time_avg Parents CDH Version Local Time spent in responding ms to OffsetFetch requests: Avg cluster, kafka, rack CDH 5, CDH 6 kafka_offset_fetch_ local_time_max Local Time spent in responding ms to OffsetFetch requests: Max cluster, kafka, rack CDH 5, CDH 6 kafka_offset_fetch_ local_time_median Local Time spent in responding ms to OffsetFetch requests: 50th Percentile cluster, kafka, rack CDH 5, CDH 6 kafka_offset_fetch_ local_time_min Local Time spent in responding ms to OffsetFetch requests: Min cluster, kafka, rack CDH 5, CDH 6 kafka_offset_fetch_ local_time_rate Local Time spent in responding requests per to OffsetFetch requests: Samples second cluster, kafka, rack CDH 5, CDH 6 kafka_offset_fetch_ local_time_stddev Local Time spent in responding ms to OffsetFetch requests: Standard Deviation cluster, kafka, rack CDH 5, CDH 6 kafka_offset_fetch_ remote_time_75th_ percentile Remote Time spent in responding to OffsetFetch requests: 75th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_offset_fetch_ remote_time_999th_ percentile Remote Time spent in responding to OffsetFetch requests: 999th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_offset_fetch_ remote_time_99th_ percentile Remote Time spent in responding to OffsetFetch requests: 99th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_offset_fetch_ remote_time_avg Remote Time spent in responding to OffsetFetch requests: Avg ms cluster, kafka, rack CDH 5, CDH 6 kafka_offset_fetch_ remote_time_max Remote Time spent in responding to OffsetFetch requests: Max ms cluster, kafka, rack CDH 5, CDH 6 kafka_offset_fetch_ remote_time_median Remote Time spent in responding to OffsetFetch requests: 50th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_offset_fetch_ remote_time_min Remote Time spent in responding to OffsetFetch requests: Min ms cluster, kafka, rack CDH 5, CDH 6 kafka_offset_fetch_ remote_time_rate Remote Time spent in responding to OffsetFetch requests: Samples requests per second cluster, kafka, rack CDH 5, CDH 6 kafka_offset_fetch_ remote_time_stddev Remote Time spent in responding to OffsetFetch requests: Standard Deviation ms cluster, kafka, rack CDH 5, CDH 6 kafka_offset_fetch_ request_queue_time_ 75th_percentile Request Queue Time spent in responding to OffsetFetch requests: 75th Percentile ms cluster, kafka, rack CDH 5, CDH 6 134 | Apache Kafka Guide Unit Kafka Reference Metric Name Description Unit Parents CDH Version kafka_offset_fetch_ request_queue_time_ 999th_percentile Request Queue Time spent in responding to OffsetFetch requests: 999th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_offset_fetch_ request_queue_time_ 99th_percentile Request Queue Time spent in responding to OffsetFetch requests: 99th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_offset_fetch_ Request Queue Time spent in request_queue_time_avg responding to OffsetFetch requests: Avg ms cluster, kafka, rack CDH 5, CDH 6 kafka_offset_fetch_ Request Queue Time spent in request_queue_time_max responding to OffsetFetch requests: Max ms cluster, kafka, rack CDH 5, CDH 6 Request Queue Time spent in responding to OffsetFetch requests: 50th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_offset_fetch_ Request Queue Time spent in request_queue_time_min responding to OffsetFetch requests: Min ms cluster, kafka, rack CDH 5, CDH 6 kafka_offset_fetch_ request_queue_time_ median kafka_offset_fetch_ request_queue_time_ rate Request Queue Time spent in responding to OffsetFetch requests: Samples requests per second cluster, kafka, rack CDH 5, CDH 6 kafka_offset_fetch_ request_queue_time_ stddev Request Queue Time spent in responding to OffsetFetch requests: Standard Deviation ms cluster, kafka, rack CDH 5, CDH 6 kafka_offset_fetch_ requests_15min_rate Number of OffsetFetch requests: requests per cluster, kafka, 15 Min Rate message.units. rack singular.second CDH 5, CDH 6 kafka_offset_fetch_ requests_1min_rate Number of OffsetFetch requests: requests per cluster, kafka, 1 Min Rate message.units. rack singular.second CDH 5, CDH 6 kafka_offset_fetch_ requests_5min_rate Number of OffsetFetch requests: requests per cluster, kafka, 5 Min Rate message.units. rack singular.second CDH 5, CDH 6 kafka_offset_fetch_ requests_avg_rate Number of OffsetFetch requests: requests per cluster, kafka, Avg Rate message.units. rack singular.second CDH 5, CDH 6 kafka_offset_fetch_ requests_rate Number of OffsetFetch requests requests per second cluster, kafka, rack CDH 5, CDH 6 kafka_offset_fetch_ response_queue_time_ 75th_percentile Response Queue Time spent in responding to OffsetFetch requests: 75th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_offset_fetch_ response_queue_time_ 999th_percentile Response Queue Time spent in responding to OffsetFetch requests: 999th Percentile ms cluster, kafka, rack CDH 5, CDH 6 Apache Kafka Guide | 135 Kafka Reference Metric Name Description Unit Parents CDH Version kafka_offset_fetch_ response_queue_time_ 99th_percentile Response Queue Time spent in responding to OffsetFetch requests: 99th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_offset_fetch_ response_queue_time_ avg Response Queue Time spent in responding to OffsetFetch requests: Avg ms cluster, kafka, rack CDH 5, CDH 6 kafka_offset_fetch_ response_queue_time_ max Response Queue Time spent in responding to OffsetFetch requests: Max ms cluster, kafka, rack CDH 5, CDH 6 kafka_offset_fetch_ response_queue_time_ median Response Queue Time spent in responding to OffsetFetch requests: 50th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_offset_fetch_ response_queue_time_ min Response Queue Time spent in responding to OffsetFetch requests: Min ms cluster, kafka, rack CDH 5, CDH 6 kafka_offset_fetch_ response_queue_time_ rate Response Queue Time spent in responding to OffsetFetch requests: Samples requests per second cluster, kafka, rack CDH 5, CDH 6 kafka_offset_fetch_ response_queue_time_ stddev Response Queue Time spent in responding to OffsetFetch requests: Standard Deviation ms cluster, kafka, rack CDH 5, CDH 6 kafka_offset_fetch_ response_send_time_ 75th_percentile Response Send Time spent in responding to OffsetFetch requests: 75th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_offset_fetch_ response_send_time_ 999th_percentile Response Send Time spent in responding to OffsetFetch requests: 999th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_offset_fetch_ response_send_time_ 99th_percentile Response Send Time spent in responding to OffsetFetch requests: 99th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_offset_fetch_ Response Send Time spent in response_send_time_avg responding to OffsetFetch requests: Avg ms cluster, kafka, rack CDH 5, CDH 6 kafka_offset_fetch_ Response Send Time spent in response_send_time_max responding to OffsetFetch requests: Max ms cluster, kafka, rack CDH 5, CDH 6 Response Send Time spent in responding to OffsetFetch requests: 50th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_offset_fetch_ Response Send Time spent in response_send_time_min responding to OffsetFetch requests: Min ms cluster, kafka, rack CDH 5, CDH 6 requests per second cluster, kafka, rack CDH 5, CDH 6 kafka_offset_fetch_ response_send_time_ median kafka_offset_fetch_ response_send_time_ rate 136 | Apache Kafka Guide Response Send Time spent in responding to OffsetFetch requests: Samples Kafka Reference Metric Name Description Unit Parents CDH Version kafka_offset_fetch_ response_send_time_ stddev Response Send Time spent in responding to OffsetFetch requests: Standard Deviation ms cluster, kafka, rack CDH 5, CDH 6 kafka_offset_fetch_ total_time_75th_ percentile Total Time spent in responding to OffsetFetch requests: 75th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_offset_fetch_ total_time_999th_ percentile Total Time spent in responding to OffsetFetch requests: 999th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_offset_fetch_ total_time_99th_ percentile Total Time spent in responding to OffsetFetch requests: 99th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_offset_fetch_ total_time_avg Total Time spent in responding to OffsetFetch requests: Avg ms cluster, kafka, rack CDH 5, CDH 6 kafka_offset_fetch_ total_time_max Total Time spent in responding to OffsetFetch requests: Max ms cluster, kafka, rack CDH 5, CDH 6 kafka_offset_fetch_ total_time_median Total Time spent in responding to OffsetFetch requests: 50th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_offset_fetch_ total_time_min Total Time spent in responding to OffsetFetch requests: Min ms cluster, kafka, rack CDH 5, CDH 6 kafka_offset_fetch_ total_time_rate Total Time spent in responding requests per to OffsetFetch requests: Samples second cluster, kafka, rack CDH 5, CDH 6 kafka_offset_fetch_ total_time_stddev Total Time spent in responding to OffsetFetch requests: Standard Deviation ms cluster, kafka, rack CDH 5, CDH 6 kafka_offsets The size of the offsets cache groups cluster, kafka, rack CDH 5, CDH 6 kafka_offsets_groups The number of consumer groups groups in the offsets cache cluster, kafka, rack CDH 5, CDH 6 kafka_offsets_local_ time_75th_percentile Local Time spent in responding ms to Offsets requests: 75th Percentile cluster, kafka, rack CDH 5, CDH 6 kafka_offsets_local_ Local Time spent in responding ms time_999th_percentile to Offsets requests: 999th Percentile cluster, kafka, rack CDH 5, CDH 6 kafka_offsets_local_ time_99th_percentile Local Time spent in responding ms to Offsets requests: 99th Percentile cluster, kafka, rack CDH 5, CDH 6 kafka_offsets_local_ time_avg Local Time spent in responding ms to Offsets requests: Avg cluster, kafka, rack CDH 5, CDH 6 kafka_offsets_local_ time_max Local Time spent in responding ms to Offsets requests: Max cluster, kafka, rack CDH 5, CDH 6 Apache Kafka Guide | 137 Kafka Reference Metric Name Description kafka_offsets_local_ time_median Parents CDH Version Local Time spent in responding ms to Offsets requests: 50th Percentile cluster, kafka, rack CDH 5, CDH 6 kafka_offsets_local_ time_min Local Time spent in responding ms to Offsets requests: Min cluster, kafka, rack CDH 5, CDH 6 kafka_offsets_local_ time_rate Local Time spent in responding requests per to Offsets requests: Samples second cluster, kafka, rack CDH 5, CDH 6 kafka_offsets_local_ time_stddev Local Time spent in responding ms to Offsets requests: Standard Deviation cluster, kafka, rack CDH 5, CDH 6 kafka_offsets_remote_ Remote Time spent in ms responding to Offsets requests: time_75th_percentile 75th Percentile cluster, kafka, rack CDH 5, CDH 6 kafka_offsets_remote_ Remote Time spent in ms time_999th_percentile responding to Offsets requests: 999th Percentile cluster, kafka, rack CDH 5, CDH 6 kafka_offsets_remote_ Remote Time spent in ms responding to Offsets requests: time_99th_percentile 99th Percentile cluster, kafka, rack CDH 5, CDH 6 kafka_offsets_remote_ Remote Time spent in ms responding to Offsets requests: time_avg Avg cluster, kafka, rack CDH 5, CDH 6 kafka_offsets_remote_ Remote Time spent in ms responding to Offsets requests: time_max Max cluster, kafka, rack CDH 5, CDH 6 kafka_offsets_remote_ Remote Time spent in ms responding to Offsets requests: time_median 50th Percentile cluster, kafka, rack CDH 5, CDH 6 kafka_offsets_remote_ Remote Time spent in ms responding to Offsets requests: time_min Min cluster, kafka, rack CDH 5, CDH 6 kafka_offsets_remote_ Remote Time spent in requests per responding to Offsets requests: second time_rate Samples cluster, kafka, rack CDH 5, CDH 6 kafka_offsets_remote_ Remote Time spent in ms responding to Offsets requests: time_stddev Standard Deviation cluster, kafka, rack CDH 5, CDH 6 kafka_offsets_request_ Request Queue Time spent in ms responding to Offsets requests: queue_time_75th_ 75th Percentile percentile cluster, kafka, rack CDH 5, CDH 6 kafka_offsets_request_ Request Queue Time spent in ms responding to Offsets requests: queue_time_999th_ 999th Percentile percentile cluster, kafka, rack CDH 5, CDH 6 kafka_offsets_request_ Request Queue Time spent in ms responding to Offsets requests: queue_time_99th_ 99th Percentile percentile cluster, kafka, rack CDH 5, CDH 6 138 | Apache Kafka Guide Unit Kafka Reference Metric Name Description Unit Parents CDH Version kafka_offsets_request_ Request Queue Time spent in ms responding to Offsets requests: queue_time_avg Avg cluster, kafka, rack CDH 5, CDH 6 kafka_offsets_request_ Request Queue Time spent in ms responding to Offsets requests: queue_time_max Max cluster, kafka, rack CDH 5, CDH 6 kafka_offsets_request_ Request Queue Time spent in ms responding to Offsets requests: queue_time_median 50th Percentile cluster, kafka, rack CDH 5, CDH 6 kafka_offsets_request_ Request Queue Time spent in ms responding to Offsets requests: queue_time_min Min cluster, kafka, rack CDH 5, CDH 6 kafka_offsets_request_ Request Queue Time spent in requests per responding to Offsets requests: second queue_time_rate Samples cluster, kafka, rack CDH 5, CDH 6 kafka_offsets_request_ Request Queue Time spent in ms responding to Offsets requests: queue_time_stddev Standard Deviation cluster, kafka, rack CDH 5, CDH 6 kafka_offsets_ requests_15min_rate Number of Offsets requests: 15 requests per cluster, kafka, Min Rate message.units. rack singular.second CDH 5, CDH 6 kafka_offsets_ requests_1min_rate Number of Offsets requests: 1 Min Rate requests per cluster, kafka, message.units. rack singular.second CDH 5, CDH 6 kafka_offsets_ requests_5min_rate Number of Offsets requests: 5 Min Rate requests per cluster, kafka, message.units. rack singular.second CDH 5, CDH 6 kafka_offsets_ requests_avg_rate Number of Offsets requests: Avg requests per cluster, kafka, Rate message.units. rack singular.second CDH 5, CDH 6 kafka_offsets_ requests_rate Number of Offsets requests cluster, kafka, rack CDH 5, CDH 6 kafka_offsets_ response_queue_time_ 75th_percentile Response Queue Time spent in ms responding to Offsets requests: 75th Percentile cluster, kafka, rack CDH 5, CDH 6 kafka_offsets_ response_queue_time_ 999th_percentile Response Queue Time spent in ms responding to Offsets requests: 999th Percentile cluster, kafka, rack CDH 5, CDH 6 kafka_offsets_ response_queue_time_ 99th_percentile Response Queue Time spent in ms responding to Offsets requests: 99th Percentile cluster, kafka, rack CDH 5, CDH 6 kafka_offsets_ response_queue_time_ avg Response Queue Time spent in ms responding to Offsets requests: Avg cluster, kafka, rack CDH 5, CDH 6 requests per second Apache Kafka Guide | 139 Kafka Reference Metric Name Description kafka_offsets_ response_queue_time_ max Parents CDH Version Response Queue Time spent in ms responding to Offsets requests: Max cluster, kafka, rack CDH 5, CDH 6 kafka_offsets_ response_queue_time_ median Response Queue Time spent in ms responding to Offsets requests: 50th Percentile cluster, kafka, rack CDH 5, CDH 6 kafka_offsets_ response_queue_time_ min Response Queue Time spent in ms responding to Offsets requests: Min cluster, kafka, rack CDH 5, CDH 6 kafka_offsets_ response_queue_time_ rate Response Queue Time spent in requests per responding to Offsets requests: second Samples cluster, kafka, rack CDH 5, CDH 6 kafka_offsets_ response_queue_time_ stddev Response Queue Time spent in ms responding to Offsets requests: Standard Deviation cluster, kafka, rack CDH 5, CDH 6 kafka_offsets_ response_send_time_ 75th_percentile Response Send Time spent in ms responding to Offsets requests: 75th Percentile cluster, kafka, rack CDH 5, CDH 6 kafka_offsets_ response_send_time_ 999th_percentile Response Send Time spent in ms responding to Offsets requests: 999th Percentile cluster, kafka, rack CDH 5, CDH 6 kafka_offsets_ response_send_time_ 99th_percentile Response Send Time spent in ms responding to Offsets requests: 99th Percentile cluster, kafka, rack CDH 5, CDH 6 kafka_offsets_ Response Send Time spent in ms response_send_time_avg responding to Offsets requests: Avg cluster, kafka, rack CDH 5, CDH 6 kafka_offsets_ Response Send Time spent in ms response_send_time_max responding to Offsets requests: Max cluster, kafka, rack CDH 5, CDH 6 Response Send Time spent in ms responding to Offsets requests: 50th Percentile cluster, kafka, rack CDH 5, CDH 6 kafka_offsets_ Response Send Time spent in ms response_send_time_min responding to Offsets requests: Min cluster, kafka, rack CDH 5, CDH 6 kafka_offsets_ response_send_time_ median Unit kafka_offsets_ response_send_time_ rate Response Send Time spent in requests per responding to Offsets requests: second Samples cluster, kafka, rack CDH 5, CDH 6 kafka_offsets_ response_send_time_ stddev Response Send Time spent in ms responding to Offsets requests: Standard Deviation cluster, kafka, rack CDH 5, CDH 6 kafka_offsets_total_ time_75th_percentile Total Time spent in responding to Offsets requests: 75th Percentile cluster, kafka, rack CDH 5, CDH 6 140 | Apache Kafka Guide ms Kafka Reference Metric Name Description Unit Parents CDH Version kafka_offsets_total_ Total Time spent in responding time_999th_percentile to Offsets requests: 999th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_offsets_total_ time_99th_percentile Total Time spent in responding to Offsets requests: 99th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_offsets_total_ time_avg Total Time spent in responding to Offsets requests: Avg ms cluster, kafka, rack CDH 5, CDH 6 kafka_offsets_total_ time_max Total Time spent in responding to Offsets requests: Max ms cluster, kafka, rack CDH 5, CDH 6 kafka_offsets_total_ time_median Total Time spent in responding to Offsets requests: 50th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_offsets_total_ time_min Total Time spent in responding to Offsets requests: Min ms cluster, kafka, rack CDH 5, CDH 6 kafka_offsets_total_ time_rate Total Time spent in responding to Offsets requests: Samples requests per second cluster, kafka, rack CDH 5, CDH 6 kafka_offsets_total_ time_stddev Total Time spent in responding to Offsets requests: Standard Deviation ms cluster, kafka, rack CDH 5, CDH 6 kafka_partitions Number of partitions (lead or follower replicas) on broker partitions cluster, kafka, rack CDH 5, CDH 6 kafka_preferred_ replica_imbalance Number of partitions where the partitions lead replica is not the preferred replica cluster, kafka, rack CDH 5, CDH 6 kafka_produce_local_ time_75th_percentile Local Time spent in responding ms to Produce requests: 75th Percentile cluster, kafka, rack CDH 5, CDH 6 kafka_produce_local_ Local Time spent in responding ms time_999th_percentile to Produce requests: 999th Percentile cluster, kafka, rack CDH 5, CDH 6 kafka_produce_local_ time_99th_percentile Local Time spent in responding ms to Produce requests: 99th Percentile cluster, kafka, rack CDH 5, CDH 6 kafka_produce_local_ time_avg Local Time spent in responding ms to Produce requests: Avg cluster, kafka, rack CDH 5, CDH 6 kafka_produce_local_ time_max Local Time spent in responding ms to Produce requests: Max cluster, kafka, rack CDH 5, CDH 6 kafka_produce_local_ time_median Local Time spent in responding ms to Produce requests: 50th Percentile cluster, kafka, rack CDH 5, CDH 6 kafka_produce_local_ time_min Local Time spent in responding ms to Produce requests: Min cluster, kafka, rack CDH 5, CDH 6 kafka_produce_local_ time_rate Local Time spent in responding requests per to Produce requests: Samples second cluster, kafka, rack CDH 5, CDH 6 Apache Kafka Guide | 141 Kafka Reference Metric Name Description kafka_produce_local_ time_stddev Parents CDH Version Local Time spent in responding ms to Produce requests: Standard Deviation cluster, kafka, rack CDH 5, CDH 6 kafka_produce_remote_ Remote Time spent in ms responding to Produce requests: time_75th_percentile 75th Percentile cluster, kafka, rack CDH 5, CDH 6 kafka_produce_remote_ Remote Time spent in ms time_999th_percentile responding to Produce requests: 999th Percentile cluster, kafka, rack CDH 5, CDH 6 kafka_produce_remote_ Remote Time spent in ms responding to Produce requests: time_99th_percentile 99th Percentile cluster, kafka, rack CDH 5, CDH 6 kafka_produce_remote_ Remote Time spent in ms responding to Produce requests: time_avg Avg cluster, kafka, rack CDH 5, CDH 6 kafka_produce_remote_ Remote Time spent in ms responding to Produce requests: time_max Max cluster, kafka, rack CDH 5, CDH 6 kafka_produce_remote_ Remote Time spent in ms responding to Produce requests: time_median 50th Percentile cluster, kafka, rack CDH 5, CDH 6 kafka_produce_remote_ Remote Time spent in ms responding to Produce requests: time_min Min cluster, kafka, rack CDH 5, CDH 6 kafka_produce_remote_ Remote Time spent in requests per responding to Produce requests: second time_rate Samples cluster, kafka, rack CDH 5, CDH 6 kafka_produce_remote_ Remote Time spent in ms responding to Produce requests: time_stddev Standard Deviation cluster, kafka, rack CDH 5, CDH 6 kafka_produce_request_ Request Queue Time spent in ms responding to Produce requests: queue_time_75th_ 75th Percentile percentile cluster, kafka, rack CDH 5, CDH 6 kafka_produce_request_ Request Queue Time spent in ms responding to Produce requests: queue_time_999th_ 999th Percentile percentile cluster, kafka, rack CDH 5, CDH 6 kafka_produce_request_ Request Queue Time spent in ms responding to Produce requests: queue_time_99th_ 99th Percentile percentile cluster, kafka, rack CDH 5, CDH 6 kafka_produce_request_ Request Queue Time spent in ms responding to Produce requests: queue_time_avg Avg cluster, kafka, rack CDH 5, CDH 6 kafka_produce_request_ Request Queue Time spent in ms responding to Produce requests: queue_time_max Max cluster, kafka, rack CDH 5, CDH 6 142 | Apache Kafka Guide Unit Kafka Reference Metric Name Description Unit Parents CDH Version kafka_produce_request_ Request Queue Time spent in ms responding to Produce requests: queue_time_median 50th Percentile cluster, kafka, rack CDH 5, CDH 6 kafka_produce_request_ Request Queue Time spent in ms responding to Produce requests: queue_time_min Min cluster, kafka, rack CDH 5, CDH 6 kafka_produce_request_ Request Queue Time spent in requests per responding to Produce requests: second queue_time_rate Samples cluster, kafka, rack CDH 5, CDH 6 kafka_produce_request_ Request Queue Time spent in ms responding to Produce requests: queue_time_stddev Standard Deviation cluster, kafka, rack CDH 5, CDH 6 kafka_produce_ requests_15min_rate Number of Produce requests: 15 requests per cluster, kafka, Min Rate message.units. rack singular.second CDH 5, CDH 6 kafka_produce_ requests_1min_rate Number of Produce requests: 1 requests per cluster, kafka, Min Rate message.units. rack singular.second CDH 5, CDH 6 kafka_produce_ requests_5min_rate Number of Produce requests: 5 requests per cluster, kafka, Min Rate message.units. rack singular.second CDH 5, CDH 6 kafka_produce_ requests_avg_rate Number of Produce requests: Avg Rate requests per cluster, kafka, message.units. rack singular.second CDH 5, CDH 6 kafka_produce_ requests_rate Number of Produce requests requests per second cluster, kafka, rack CDH 5, CDH 6 kafka_produce_ response_queue_time_ 75th_percentile Response Queue Time spent in ms responding to Produce requests: 75th Percentile cluster, kafka, rack CDH 5, CDH 6 kafka_produce_ response_queue_time_ 999th_percentile Response Queue Time spent in ms responding to Produce requests: 999th Percentile cluster, kafka, rack CDH 5, CDH 6 kafka_produce_ response_queue_time_ 99th_percentile Response Queue Time spent in ms responding to Produce requests: 99th Percentile cluster, kafka, rack CDH 5, CDH 6 kafka_produce_ response_queue_time_ avg Response Queue Time spent in ms responding to Produce requests: Avg cluster, kafka, rack CDH 5, CDH 6 kafka_produce_ response_queue_time_ max Response Queue Time spent in ms responding to Produce requests: Max cluster, kafka, rack CDH 5, CDH 6 kafka_produce_ response_queue_time_ median Response Queue Time spent in ms responding to Produce requests: 50th Percentile cluster, kafka, rack CDH 5, CDH 6 Apache Kafka Guide | 143 Kafka Reference Metric Name Description kafka_produce_ response_queue_time_ min Parents CDH Version Response Queue Time spent in ms responding to Produce requests: Min cluster, kafka, rack CDH 5, CDH 6 kafka_produce_ response_queue_time_ rate Response Queue Time spent in requests per responding to Produce requests: second Samples cluster, kafka, rack CDH 5, CDH 6 kafka_produce_ response_queue_time_ stddev Response Queue Time spent in ms responding to Produce requests: Standard Deviation cluster, kafka, rack CDH 5, CDH 6 kafka_produce_ response_send_time_ 75th_percentile Response Send Time spent in ms responding to Produce requests: 75th Percentile cluster, kafka, rack CDH 5, CDH 6 kafka_produce_ response_send_time_ 999th_percentile Response Send Time spent in ms responding to Produce requests: 999th Percentile cluster, kafka, rack CDH 5, CDH 6 kafka_produce_ response_send_time_ 99th_percentile Response Send Time spent in ms responding to Produce requests: 99th Percentile cluster, kafka, rack CDH 5, CDH 6 kafka_produce_ Response Send Time spent in ms response_send_time_avg responding to Produce requests: Avg cluster, kafka, rack CDH 5, CDH 6 kafka_produce_ Response Send Time spent in ms response_send_time_max responding to Produce requests: Max cluster, kafka, rack CDH 5, CDH 6 Response Send Time spent in ms responding to Produce requests: 50th Percentile cluster, kafka, rack CDH 5, CDH 6 kafka_produce_ Response Send Time spent in ms response_send_time_min responding to Produce requests: Min cluster, kafka, rack CDH 5, CDH 6 kafka_produce_ response_send_time_ median Unit kafka_produce_ response_send_time_ rate Response Send Time spent in requests per responding to Produce requests: second Samples cluster, kafka, rack CDH 5, CDH 6 kafka_produce_ response_send_time_ stddev Response Send Time spent in ms responding to Produce requests: Standard Deviation cluster, kafka, rack CDH 5, CDH 6 kafka_produce_total_ time_75th_percentile Total Time spent in responding to Produce requests: 75th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_produce_total_ Total Time spent in responding time_999th_percentile to Produce requests: 999th Percentile ms cluster, kafka, rack CDH 5, CDH 6 Total Time spent in responding to Produce requests: 99th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_produce_total_ time_99th_percentile 144 | Apache Kafka Guide Kafka Reference Metric Name Description Unit Parents CDH Version kafka_produce_total_ time_avg Total Time spent in responding to Produce requests: Avg ms cluster, kafka, rack CDH 5, CDH 6 kafka_produce_total_ time_max Total Time spent in responding to Produce requests: Max ms cluster, kafka, rack CDH 5, CDH 6 kafka_produce_total_ time_median Total Time spent in responding to Produce requests: 50th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_produce_total_ time_min Total Time spent in responding to Produce requests: Min ms cluster, kafka, rack CDH 5, CDH 6 kafka_produce_total_ time_rate Total Time spent in responding to Produce requests: Samples requests per second cluster, kafka, rack CDH 5, CDH 6 kafka_produce_total_ time_stddev Total Time spent in responding to Produce requests: Standard Deviation ms cluster, kafka, rack CDH 5, CDH 6 kafka_producer_ expires_15min_rate Number of expired delayed requests per cluster, kafka, producer requests: 15 Min Rate message.units. rack singular.second CDH 5, CDH 6 kafka_producer_ expires_1min_rate Number of expired delayed producer requests: 1 Min Rate requests per cluster, kafka, message.units. rack singular.second CDH 5, CDH 6 kafka_producer_ expires_5min_rate Number of expired delayed producer requests: 5 Min Rate requests per cluster, kafka, message.units. rack singular.second CDH 5, CDH 6 kafka_producer_ expires_avg_rate Number of expired delayed producer requests: Avg Rate requests per cluster, kafka, message.units. rack singular.second CDH 5, CDH 6 kafka_producer_ expires_rate Number of expired delayed producer requests requests per second cluster, kafka, rack CDH 5, CDH 6 kafka_producer_ purgatory_delayed_ requests Number of requests delayed in the producer purgatory requests cluster, kafka, rack CDH 5, CDH 6 kafka_producer_ purgatory_size Requests waiting in the producer requests purgatory. This should be non-zero when acks = -1 is used in producers cluster, kafka, rack CDH 5, CDH 6 kafka_rejected_ Number of message batches sent message_batches_15min_ by producers that the broker rejected for this topic: 15 Min rate Rate message.units. cluster, kafka, message_batches rack per message. units.singular. second CDH 5, CDH 6 kafka_rejected_ Number of message batches sent message_batches_1min_ by producers that the broker rejected for this topic: 1 Min rate Rate message.units. cluster, kafka, message_ rack batches per message.units. singular.second CDH 5, CDH 6 Apache Kafka Guide | 145 Kafka Reference Metric Name Description kafka_rejected_ Number of message batches sent message_batches_5min_ by producers that the broker rejected for this topic: 5 Min rate Rate Unit Parents CDH Version message.units. cluster, kafka, message_ rack batches per message.units. singular.second CDH 5, CDH 6 kafka_rejected_ message_batches_avg_ rate Number of message batches sent message.units. cluster, kafka, by producers that the broker message_ rack rejected for this topic: Avg Rate batches per message.units. singular.second CDH 5, CDH 6 kafka_rejected_ message_batches_rate Number of message batches sent message.units. cluster, kafka, by producers that the broker message_ rack rejected for this topic batches per second CDH 5, CDH 6 kafka_request_handler_ The average free capacity of the message.units. cluster, kafka, request handler: 15 Min Rate percent_idle rack avg_idle_15min_rate per message. units.singular. nanoseconds CDH 5, CDH 6 kafka_request_handler_ The average free capacity of the message.units. cluster, kafka, request handler: 1 Min Rate percent_idle rack avg_idle_1min_rate per message. units.singular. nanoseconds CDH 5, CDH 6 kafka_request_handler_ The average free capacity of the message.units. cluster, kafka, request handler: 5 Min Rate percent_idle rack avg_idle_5min_rate per message. units.singular. nanoseconds CDH 5, CDH 6 kafka_request_handler_ The average free capacity of the message.units. cluster, kafka, request handler: Avg Rate percent_idle rack avg_idle_avg_rate per message. units.singular. nanoseconds CDH 5, CDH 6 kafka_request_handler_ The average free capacity of the message.units. cluster, kafka, request handler percent_idle rack avg_idle_rate per second CDH 5, CDH 6 cluster, kafka, rack CDH 5, CDH 6 message.units. cluster, kafka, responses rack CDH 5, CDH 6 kafka_responses_being_ The number of responses being message.units. cluster, kafka, sent by the network processors responses rack sent CDH 5, CDH 6 kafka_request_queue_ size Request Queue Size kafka_response_queue_ Response Queue Size size kafka_stop_replica_ local_time_75th_ percentile 146 | Apache Kafka Guide requests Local Time spent in responding ms to StopReplica requests: 75th Percentile cluster, kafka, rack CDH 5, CDH 6 Kafka Reference Metric Name Description kafka_stop_replica_ local_time_999th_ percentile Unit Parents CDH Version Local Time spent in responding ms to StopReplica requests: 999th Percentile cluster, kafka, rack CDH 5, CDH 6 kafka_stop_replica_ local_time_99th_ percentile Local Time spent in responding ms to StopReplica requests: 99th Percentile cluster, kafka, rack CDH 5, CDH 6 kafka_stop_replica_ local_time_avg Local Time spent in responding ms to StopReplica requests: Avg cluster, kafka, rack CDH 5, CDH 6 kafka_stop_replica_ local_time_max Local Time spent in responding ms to StopReplica requests: Max cluster, kafka, rack CDH 5, CDH 6 kafka_stop_replica_ local_time_median Local Time spent in responding ms to StopReplica requests: 50th Percentile cluster, kafka, rack CDH 5, CDH 6 kafka_stop_replica_ local_time_min Local Time spent in responding ms to StopReplica requests: Min cluster, kafka, rack CDH 5, CDH 6 kafka_stop_replica_ local_time_rate Local Time spent in responding requests per to StopReplica requests: Samples second cluster, kafka, rack CDH 5, CDH 6 kafka_stop_replica_ local_time_stddev Local Time spent in responding ms to StopReplica requests: Standard Deviation cluster, kafka, rack CDH 5, CDH 6 kafka_stop_replica_ remote_time_75th_ percentile Remote Time spent in responding to StopReplica requests: 75th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_stop_replica_ remote_time_999th_ percentile Remote Time spent in responding to StopReplica requests: 999th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_stop_replica_ remote_time_99th_ percentile Remote Time spent in responding to StopReplica requests: 99th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_stop_replica_ remote_time_avg Remote Time spent in responding to StopReplica requests: Avg ms cluster, kafka, rack CDH 5, CDH 6 kafka_stop_replica_ remote_time_max Remote Time spent in responding to StopReplica requests: Max ms cluster, kafka, rack CDH 5, CDH 6 kafka_stop_replica_ remote_time_median Remote Time spent in responding to StopReplica requests: 50th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_stop_replica_ remote_time_min Remote Time spent in responding to StopReplica requests: Min ms cluster, kafka, rack CDH 5, CDH 6 kafka_stop_replica_ remote_time_rate Remote Time spent in responding to StopReplica requests: Samples requests per second cluster, kafka, rack CDH 5, CDH 6 Apache Kafka Guide | 147 Kafka Reference Metric Name Description Unit Parents CDH Version kafka_stop_replica_ remote_time_stddev Remote Time spent in responding to StopReplica requests: Standard Deviation ms cluster, kafka, rack CDH 5, CDH 6 kafka_stop_replica_ request_queue_time_ 75th_percentile Request Queue Time spent in responding to StopReplica requests: 75th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_stop_replica_ request_queue_time_ 999th_percentile Request Queue Time spent in responding to StopReplica requests: 999th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_stop_replica_ request_queue_time_ 99th_percentile Request Queue Time spent in responding to StopReplica requests: 99th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_stop_replica_ Request Queue Time spent in request_queue_time_avg responding to StopReplica requests: Avg ms cluster, kafka, rack CDH 5, CDH 6 kafka_stop_replica_ Request Queue Time spent in request_queue_time_max responding to StopReplica requests: Max ms cluster, kafka, rack CDH 5, CDH 6 Request Queue Time spent in responding to StopReplica requests: 50th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_stop_replica_ Request Queue Time spent in request_queue_time_min responding to StopReplica requests: Min ms cluster, kafka, rack CDH 5, CDH 6 kafka_stop_replica_ request_queue_time_ median kafka_stop_replica_ request_queue_time_ rate Request Queue Time spent in responding to StopReplica requests: Samples requests per second cluster, kafka, rack CDH 5, CDH 6 kafka_stop_replica_ request_queue_time_ stddev Request Queue Time spent in responding to StopReplica requests: Standard Deviation ms cluster, kafka, rack CDH 5, CDH 6 kafka_stop_replica_ requests_15min_rate Number of StopReplica requests: requests per cluster, kafka, 15 Min Rate message.units. rack singular.second CDH 5, CDH 6 kafka_stop_replica_ requests_1min_rate Number of StopReplica requests: requests per cluster, kafka, 1 Min Rate message.units. rack singular.second CDH 5, CDH 6 kafka_stop_replica_ requests_5min_rate Number of StopReplica requests: requests per cluster, kafka, 5 Min Rate message.units. rack singular.second CDH 5, CDH 6 kafka_stop_replica_ requests_avg_rate Number of StopReplica requests: requests per cluster, kafka, Avg Rate message.units. rack singular.second CDH 5, CDH 6 kafka_stop_replica_ requests_rate Number of StopReplica requests requests per second CDH 5, CDH 6 148 | Apache Kafka Guide cluster, kafka, rack Kafka Reference Metric Name Description Unit Parents CDH Version kafka_stop_replica_ response_queue_time_ 75th_percentile Response Queue Time spent in responding to StopReplica requests: 75th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_stop_replica_ response_queue_time_ 999th_percentile Response Queue Time spent in responding to StopReplica requests: 999th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_stop_replica_ response_queue_time_ 99th_percentile Response Queue Time spent in responding to StopReplica requests: 99th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_stop_replica_ response_queue_time_ avg Response Queue Time spent in responding to StopReplica requests: Avg ms cluster, kafka, rack CDH 5, CDH 6 kafka_stop_replica_ response_queue_time_ max Response Queue Time spent in responding to StopReplica requests: Max ms cluster, kafka, rack CDH 5, CDH 6 kafka_stop_replica_ response_queue_time_ median Response Queue Time spent in responding to StopReplica requests: 50th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_stop_replica_ response_queue_time_ min Response Queue Time spent in responding to StopReplica requests: Min ms cluster, kafka, rack CDH 5, CDH 6 kafka_stop_replica_ response_queue_time_ rate Response Queue Time spent in responding to StopReplica requests: Samples requests per second cluster, kafka, rack CDH 5, CDH 6 kafka_stop_replica_ response_queue_time_ stddev Response Queue Time spent in responding to StopReplica requests: Standard Deviation ms cluster, kafka, rack CDH 5, CDH 6 kafka_stop_replica_ response_send_time_ 75th_percentile Response Send Time spent in responding to StopReplica requests: 75th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_stop_replica_ response_send_time_ 999th_percentile Response Send Time spent in responding to StopReplica requests: 999th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_stop_replica_ response_send_time_ 99th_percentile Response Send Time spent in responding to StopReplica requests: 99th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_stop_replica_ Response Send Time spent in response_send_time_avg responding to StopReplica requests: Avg ms cluster, kafka, rack CDH 5, CDH 6 kafka_stop_replica_ Response Send Time spent in response_send_time_max responding to StopReplica requests: Max ms cluster, kafka, rack CDH 5, CDH 6 Response Send Time spent in responding to StopReplica requests: 50th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_stop_replica_ response_send_time_ median Apache Kafka Guide | 149 Kafka Reference Metric Name Description kafka_stop_replica_ Response Send Time spent in response_send_time_min responding to StopReplica requests: Min Unit Parents CDH Version ms cluster, kafka, rack CDH 5, CDH 6 kafka_stop_replica_ response_send_time_ rate Response Send Time spent in responding to StopReplica requests: Samples requests per second cluster, kafka, rack CDH 5, CDH 6 kafka_stop_replica_ response_send_time_ stddev Response Send Time spent in responding to StopReplica requests: Standard Deviation ms cluster, kafka, rack CDH 5, CDH 6 kafka_stop_replica_ total_time_75th_ percentile Total Time spent in responding to StopReplica requests: 75th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_stop_replica_ total_time_999th_ percentile Total Time spent in responding to StopReplica requests: 999th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_stop_replica_ total_time_99th_ percentile Total Time spent in responding to StopReplica requests: 99th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_stop_replica_ total_time_avg Total Time spent in responding to StopReplica requests: Avg ms cluster, kafka, rack CDH 5, CDH 6 kafka_stop_replica_ total_time_max Total Time spent in responding to StopReplica requests: Max ms cluster, kafka, rack CDH 5, CDH 6 kafka_stop_replica_ total_time_median Total Time spent in responding to StopReplica requests: 50th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_stop_replica_ total_time_min Total Time spent in responding to StopReplica requests: Min ms cluster, kafka, rack CDH 5, CDH 6 kafka_stop_replica_ total_time_rate Total Time spent in responding requests per to StopReplica requests: Samples second cluster, kafka, rack CDH 5, CDH 6 kafka_stop_replica_ total_time_stddev Total Time spent in responding to StopReplica requests: Standard Deviation cluster, kafka, rack CDH 5, CDH 6 kafka_thread_count JVM daemon and non-daemon threads thread count cluster, kafka, rack CDH 5, CDH 6 kafka_unclean_leader_ Unclean leader elections. Cloudera recommends disabling elections_15min_rate unclean leader elections, to avoid potential data loss, so this should be 0: 15 Min Rate message.units. cluster, kafka, elections per rack message.units. singular.second CDH 5, CDH 6 kafka_unclean_leader_ Unclean leader elections. Cloudera recommends disabling elections_1min_rate unclean leader elections, to avoid potential data loss, so this should be 0: 1 Min Rate message.units. cluster, kafka, elections per rack message.units. singular.second CDH 5, CDH 6 150 | Apache Kafka Guide ms Kafka Reference Metric Name Description Unit Parents CDH Version kafka_unclean_leader_ Unclean leader elections. Cloudera recommends disabling elections_5min_rate unclean leader elections, to avoid potential data loss, so this should be 0: 5 Min Rate message.units. cluster, kafka, elections per rack message.units. singular.second CDH 5, CDH 6 kafka_unclean_leader_ Unclean leader elections. Cloudera recommends disabling elections_avg_rate unclean leader elections, to avoid potential data loss, so this should be 0: Avg Rate message.units. cluster, kafka, elections per rack message.units. singular.second CDH 5, CDH 6 kafka_unclean_leader_ Unclean leader elections. message.units. cluster, kafka, Cloudera recommends disabling elections per rack elections_rate unclean leader elections, to second avoid potential data loss, so this should be 0 CDH 5, CDH 6 kafka_under_ Number of partitions with replicated_partitions unavailable replicas cluster, kafka, rack CDH 5, CDH 6 kafka_update_metadata_ Local Time spent in responding ms to UpdateMetadata requests: local_time_75th_ 75th Percentile percentile cluster, kafka, rack CDH 5, CDH 6 kafka_update_metadata_ Local Time spent in responding ms to UpdateMetadata requests: local_time_999th_ 999th Percentile percentile cluster, kafka, rack CDH 5, CDH 6 kafka_update_metadata_ Local Time spent in responding ms to UpdateMetadata requests: local_time_99th_ 99th Percentile percentile cluster, kafka, rack CDH 5, CDH 6 kafka_update_metadata_ Local Time spent in responding ms to UpdateMetadata requests: local_time_avg Avg cluster, kafka, rack CDH 5, CDH 6 kafka_update_metadata_ Local Time spent in responding ms to UpdateMetadata requests: local_time_max Max cluster, kafka, rack CDH 5, CDH 6 kafka_update_metadata_ Local Time spent in responding ms to UpdateMetadata requests: local_time_median 50th Percentile cluster, kafka, rack CDH 5, CDH 6 kafka_update_metadata_ Local Time spent in responding ms to UpdateMetadata requests: local_time_min Min cluster, kafka, rack CDH 5, CDH 6 kafka_update_metadata_ Local Time spent in responding requests per to UpdateMetadata requests: second local_time_rate Samples cluster, kafka, rack CDH 5, CDH 6 kafka_update_metadata_ Local Time spent in responding ms to UpdateMetadata requests: local_time_stddev Standard Deviation cluster, kafka, rack CDH 5, CDH 6 kafka_update_metadata_ Remote Time spent in ms responding to UpdateMetadata remote_time_75th_ requests: 75th Percentile percentile cluster, kafka, rack CDH 5, CDH 6 partitions Apache Kafka Guide | 151 Kafka Reference Metric Name Parents CDH Version kafka_update_metadata_ Remote Time spent in ms responding to UpdateMetadata remote_time_999th_ requests: 999th Percentile percentile cluster, kafka, rack CDH 5, CDH 6 kafka_update_metadata_ Remote Time spent in ms responding to UpdateMetadata remote_time_99th_ requests: 99th Percentile percentile cluster, kafka, rack CDH 5, CDH 6 kafka_update_metadata_ Remote Time spent in ms responding to UpdateMetadata remote_time_avg requests: Avg cluster, kafka, rack CDH 5, CDH 6 kafka_update_metadata_ Remote Time spent in ms responding to UpdateMetadata remote_time_max requests: Max cluster, kafka, rack CDH 5, CDH 6 kafka_update_metadata_ Remote Time spent in ms responding to UpdateMetadata remote_time_median requests: 50th Percentile cluster, kafka, rack CDH 5, CDH 6 kafka_update_metadata_ Remote Time spent in ms responding to UpdateMetadata remote_time_min requests: Min cluster, kafka, rack CDH 5, CDH 6 kafka_update_metadata_ Remote Time spent in requests per responding to UpdateMetadata second remote_time_rate requests: Samples cluster, kafka, rack CDH 5, CDH 6 kafka_update_metadata_ Remote Time spent in ms responding to UpdateMetadata remote_time_stddev requests: Standard Deviation cluster, kafka, rack CDH 5, CDH 6 kafka_update_metadata_ Request Queue Time spent in ms responding to UpdateMetadata request_queue_time_ requests: 75th Percentile 75th_percentile cluster, kafka, rack CDH 5, CDH 6 kafka_update_metadata_ Request Queue Time spent in ms responding to UpdateMetadata request_queue_time_ requests: 999th Percentile 999th_percentile cluster, kafka, rack CDH 5, CDH 6 kafka_update_metadata_ Request Queue Time spent in ms responding to UpdateMetadata request_queue_time_ requests: 99th Percentile 99th_percentile cluster, kafka, rack CDH 5, CDH 6 kafka_update_metadata_ Request Queue Time spent in ms request_queue_time_avg responding to UpdateMetadata requests: Avg cluster, kafka, rack CDH 5, CDH 6 kafka_update_metadata_ Request Queue Time spent in ms request_queue_time_max responding to UpdateMetadata requests: Max cluster, kafka, rack CDH 5, CDH 6 kafka_update_metadata_ Request Queue Time spent in ms responding to UpdateMetadata request_queue_time_ requests: 50th Percentile median cluster, kafka, rack CDH 5, CDH 6 kafka_update_metadata_ Request Queue Time spent in ms request_queue_time_min responding to UpdateMetadata requests: Min cluster, kafka, rack CDH 5, CDH 6 152 | Apache Kafka Guide Description Unit Kafka Reference Metric Name Description Unit Parents CDH Version kafka_update_metadata_ Request Queue Time spent in requests per responding to UpdateMetadata second request_queue_time_ requests: Samples rate cluster, kafka, rack CDH 5, CDH 6 kafka_update_metadata_ Request Queue Time spent in ms responding to UpdateMetadata request_queue_time_ requests: Standard Deviation stddev cluster, kafka, rack CDH 5, CDH 6 kafka_update_metadata_ Number of UpdateMetadata requests: 15 Min Rate requests_15min_rate requests per cluster, kafka, message.units. rack singular.second CDH 5, CDH 6 kafka_update_metadata_ Number of UpdateMetadata requests: 1 Min Rate requests_1min_rate requests per cluster, kafka, message.units. rack singular.second CDH 5, CDH 6 kafka_update_metadata_ Number of UpdateMetadata requests: 5 Min Rate requests_5min_rate requests per cluster, kafka, message.units. rack singular.second CDH 5, CDH 6 kafka_update_metadata_ Number of UpdateMetadata requests: Avg Rate requests_avg_rate requests per cluster, kafka, message.units. rack singular.second CDH 5, CDH 6 kafka_update_metadata_ Number of UpdateMetadata requests requests_rate requests per second cluster, kafka, rack CDH 5, CDH 6 kafka_update_metadata_ Response Queue Time spent in ms responding to UpdateMetadata response_queue_time_ requests: 75th Percentile 75th_percentile cluster, kafka, rack CDH 5, CDH 6 kafka_update_metadata_ Response Queue Time spent in ms responding to UpdateMetadata response_queue_time_ requests: 999th Percentile 999th_percentile cluster, kafka, rack CDH 5, CDH 6 kafka_update_metadata_ Response Queue Time spent in ms responding to UpdateMetadata response_queue_time_ requests: 99th Percentile 99th_percentile cluster, kafka, rack CDH 5, CDH 6 kafka_update_metadata_ Response Queue Time spent in ms responding to UpdateMetadata response_queue_time_ requests: Avg avg cluster, kafka, rack CDH 5, CDH 6 kafka_update_metadata_ Response Queue Time spent in ms responding to UpdateMetadata response_queue_time_ requests: Max max cluster, kafka, rack CDH 5, CDH 6 kafka_update_metadata_ Response Queue Time spent in ms responding to UpdateMetadata response_queue_time_ requests: 50th Percentile median cluster, kafka, rack CDH 5, CDH 6 kafka_update_metadata_ Response Queue Time spent in ms responding to UpdateMetadata response_queue_time_ requests: Min min cluster, kafka, rack CDH 5, CDH 6 kafka_update_metadata_ Response Queue Time spent in requests per responding to UpdateMetadata second response_queue_time_ requests: Samples rate cluster, kafka, rack CDH 5, CDH 6 Apache Kafka Guide | 153 Kafka Reference Metric Name Parents CDH Version kafka_update_metadata_ Response Queue Time spent in ms responding to UpdateMetadata response_queue_time_ requests: Standard Deviation stddev cluster, kafka, rack CDH 5, CDH 6 kafka_update_metadata_ Response Send Time spent in ms responding to UpdateMetadata response_send_time_ requests: 75th Percentile 75th_percentile cluster, kafka, rack CDH 5, CDH 6 kafka_update_metadata_ Response Send Time spent in ms responding to UpdateMetadata response_send_time_ requests: 999th Percentile 999th_percentile cluster, kafka, rack CDH 5, CDH 6 kafka_update_metadata_ Response Send Time spent in ms responding to UpdateMetadata response_send_time_ requests: 99th Percentile 99th_percentile cluster, kafka, rack CDH 5, CDH 6 kafka_update_metadata_ Response Send Time spent in ms response_send_time_avg responding to UpdateMetadata requests: Avg cluster, kafka, rack CDH 5, CDH 6 kafka_update_metadata_ Response Send Time spent in ms response_send_time_max responding to UpdateMetadata requests: Max cluster, kafka, rack CDH 5, CDH 6 kafka_update_metadata_ Response Send Time spent in ms responding to UpdateMetadata response_send_time_ requests: 50th Percentile median cluster, kafka, rack CDH 5, CDH 6 kafka_update_metadata_ Response Send Time spent in ms response_send_time_min responding to UpdateMetadata requests: Min cluster, kafka, rack CDH 5, CDH 6 kafka_update_metadata_ Response Send Time spent in requests per responding to UpdateMetadata second response_send_time_ requests: Samples rate cluster, kafka, rack CDH 5, CDH 6 kafka_update_metadata_ Response Send Time spent in ms responding to UpdateMetadata response_send_time_ requests: Standard Deviation stddev cluster, kafka, rack CDH 5, CDH 6 kafka_update_metadata_ Total Time spent in responding to UpdateMetadata requests: total_time_75th_ 75th Percentile percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_update_metadata_ Total Time spent in responding to UpdateMetadata requests: total_time_999th_ 999th Percentile percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_update_metadata_ Total Time spent in responding to UpdateMetadata requests: total_time_99th_ 99th Percentile percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_update_metadata_ Total Time spent in responding to UpdateMetadata requests: total_time_avg Avg ms cluster, kafka, rack CDH 5, CDH 6 kafka_update_metadata_ Total Time spent in responding to UpdateMetadata requests: total_time_max Max ms cluster, kafka, rack CDH 5, CDH 6 154 | Apache Kafka Guide Description Unit Kafka Reference Metric Name Description Unit Parents CDH Version kafka_update_metadata_ Total Time spent in responding to UpdateMetadata requests: total_time_median 50th Percentile ms cluster, kafka, rack CDH 5, CDH 6 kafka_update_metadata_ Total Time spent in responding to UpdateMetadata requests: total_time_min Min ms cluster, kafka, rack CDH 5, CDH 6 kafka_update_metadata_ Total Time spent in responding to UpdateMetadata requests: total_time_rate Samples requests per second cluster, kafka, rack CDH 5, CDH 6 kafka_update_metadata_ Total Time spent in responding to UpdateMetadata requests: total_time_stddev Standard Deviation ms cluster, kafka, rack CDH 5, CDH 6 mem_rss Resident memory used bytes cluster, kafka, rack CDH 5, CDH 6 mem_swap Amount of swap memory used by this role's process. bytes cluster, kafka, rack CDH 5, CDH 6 mem_virtual Virtual memory used bytes cluster, kafka, rack CDH 5, CDH 6 oom_exits_rate The number of times the role's exits per backing process was killed due second to an OutOfMemory error. This counter is only incremented if the Cloudera Manager "Kill When Out of Memory" option is enabled. cluster, kafka, rack CDH 5, CDH 6 read_bytes_rate The number of bytes read from bytes per the device second cluster, kafka, rack CDH 5, CDH 6 exits per second cluster, kafka, rack CDH 5, CDH 6 uptime For a host, the amount of time seconds since the host was booted. For a role, the uptime of the backing process. cluster, kafka, rack CDH 5, CDH 6 write_bytes_rate The number of bytes written to bytes per the device second cluster, kafka, rack CDH 5, CDH 6 unexpected_exits_rate The number of times the role's backing process exited unexpectedly. Broker Topic Metrics Metric Name Description Unit kafka_bytes_ fetched_15min_rate Amount of data consumers bytes per fetched from this topic on message. this broker: 15 Min Rate units. singular. second Parents CDH Version cluster, kafka, kafka-kafka_broker, kafka_topic, rack CDH 5, CDH 6 Apache Kafka Guide | 155 Kafka Reference Metric Name Description kafka_bytes_ fetched_1min_rate Unit Parents CDH Version Amount of data consumers bytes per fetched from this topic on message. this broker: 1 Min Rate units. singular. second cluster, kafka, kafka-kafka_broker, kafka_topic, rack CDH 5, CDH 6 kafka_bytes_ fetched_5min_rate Amount of data consumers bytes per fetched from this topic on message. this broker: 5 Min Rate units. singular. second cluster, kafka, kafka-kafka_broker, kafka_topic, rack CDH 5, CDH 6 kafka_bytes_ fetched_avg_rate Amount of data consumers bytes per fetched from this topic on message. this broker: Avg Rate units. singular. second cluster, kafka, kafka-kafka_broker, kafka_topic, rack CDH 5, CDH 6 kafka_bytes_ fetched_rate Amount of data consumers bytes per fetched from this topic on second this broker cluster, kafka, kafka-kafka_broker, kafka_topic, rack CDH 5, CDH 6 kafka_bytes_ Amount of data written to bytes per received_15min_rate topic on this broker: 15 Min message. Rate units. singular. second cluster, kafka, kafka-kafka_broker, kafka_topic, rack CDH 5, CDH 6 kafka_bytes_ received_1min_rate Amount of data written to topic on this broker: 1 Min Rate bytes per message. units. singular. second cluster, kafka, kafka-kafka_broker, kafka_topic, rack CDH 5, CDH 6 kafka_bytes_ received_5min_rate Amount of data written to topic on this broker: 5 Min Rate bytes per message. units. singular. second cluster, kafka, kafka-kafka_broker, kafka_topic, rack CDH 5, CDH 6 kafka_bytes_ received_avg_rate Amount of data written to topic on this broker: Avg Rate bytes per message. units. singular. second cluster, kafka, kafka-kafka_broker, kafka_topic, rack CDH 5, CDH 6 kafka_bytes_ received_rate Amount of data written to topic on this broker bytes per second cluster, kafka, kafka-kafka_broker, kafka_topic, rack CDH 5, CDH 6 kafka_bytes_ Amount of data in messages bytes per rejected_15min_rate rejected by broker for this message. topic: 15 Min Rate units. singular. second cluster, kafka, kafka-kafka_broker, kafka_topic, rack CDH 5, CDH 6 156 | Apache Kafka Guide Kafka Reference Metric Name Description kafka_bytes_ rejected_1min_rate Unit Parents CDH Version Amount of data in messages bytes per rejected by broker for this message. topic: 1 Min Rate units. singular. second cluster, kafka, kafka-kafka_broker, kafka_topic, rack CDH 5, CDH 6 kafka_bytes_ rejected_5min_rate Amount of data in messages bytes per rejected by broker for this message. topic: 5 Min Rate units. singular. second cluster, kafka, kafka-kafka_broker, kafka_topic, rack CDH 5, CDH 6 kafka_bytes_ rejected_avg_rate Amount of data in messages bytes per rejected by broker for this message. topic: Avg Rate units. singular. second cluster, kafka, kafka-kafka_broker, kafka_topic, rack CDH 5, CDH 6 kafka_bytes_ rejected_rate Amount of data in messages bytes per rejected by broker for this second topic cluster, kafka, kafka-kafka_broker, kafka_topic, rack CDH 5, CDH 6 kafka_fetch_ request_failures_ 15min_rate Number of data read requests from consumers that brokers failed to process for this topic: 15 Min Rate message. cluster, kafka, units. kafka-kafka_broker, fetch_requests kafka_topic, rack per message. units. singular. second CDH 5, CDH 6 kafka_fetch_ request_failures_ 1min_rate Number of data read requests from consumers that brokers failed to process for this topic: 1 Min Rate message. cluster, kafka, units. kafka-kafka_broker, fetch_requests kafka_topic, rack per message. units. singular. second CDH 5, CDH 6 kafka_fetch_ request_failures_ 5min_rate Number of data read requests from consumers that brokers failed to process for this topic: 5 Min Rate message. cluster, kafka, units. kafka-kafka_broker, fetch_requests kafka_topic, rack per message. units. singular. second CDH 5, CDH 6 kafka_fetch_ request_failures_ avg_rate Number of data read requests from consumers that brokers failed to process for this topic: Avg Rate message. cluster, kafka, units. kafka-kafka_broker, fetch_requests kafka_topic, rack per message. units. singular. second CDH 5, CDH 6 kafka_fetch_ request_failures_ rate Number of data read requests from consumers that brokers failed to process for this topic message. cluster, kafka, units. kafka-kafka_broker, fetch_requests kafka_topic, rack per second CDH 5, CDH 6 Apache Kafka Guide | 157 Kafka Reference Metric Name Description Unit Parents CDH Version kafka_messages_ Number of messages written messages per cluster, kafka, kafka-kafka_broker, received_15min_rate to topic on this broker: 15 message. Min Rate units. kafka_topic, rack singular. second CDH 5, CDH 6 kafka_messages_ received_1min_rate Number of messages written messages per cluster, kafka, to topic on this broker: 1 message. kafka-kafka_broker, Min Rate units. kafka_topic, rack singular. second CDH 5, CDH 6 kafka_messages_ received_5min_rate Number of messages written messages per cluster, kafka, to topic on this broker: 5 message. kafka-kafka_broker, Min Rate units. kafka_topic, rack singular. second CDH 5, CDH 6 kafka_messages_ received_avg_rate Number of messages written messages per cluster, kafka, to topic on this broker: Avg message. kafka-kafka_broker, Rate units. kafka_topic, rack singular. second CDH 5, CDH 6 kafka_messages_ received_rate Number of messages written messages per cluster, kafka, to topic on this broker second kafka-kafka_broker, kafka_topic, rack CDH 5, CDH 6 kafka_rejected_ message_batches_ 15min_rate Number of message batches sent by producers that the broker rejected for this topic: 15 Min Rate message. cluster, kafka, units. kafka-kafka_broker, message_batches kafka_topic, rack per message. units. singular. second CDH 5, CDH 6 kafka_rejected_ message_batches_ 1min_rate Number of message batches sent by producers that the broker rejected for this topic: 1 Min Rate message. cluster, kafka, units. kafka-kafka_broker, message_batches kafka_topic, rack per message. units. singular. second CDH 5, CDH 6 kafka_rejected_ message_batches_ 5min_rate Number of message batches sent by producers that the broker rejected for this topic: 5 Min Rate message. cluster, kafka, units. kafka-kafka_broker, message_batches kafka_topic, rack per message. units. singular. second CDH 5, CDH 6 kafka_rejected_ message_batches_ avg_rate Number of message batches sent by producers that the broker rejected for this topic: Avg Rate message. cluster, kafka, units. kafka-kafka_broker, message_batches kafka_topic, rack per message. units. CDH 5, CDH 6 158 | Apache Kafka Guide Kafka Reference Metric Name Description Unit Parents CDH Version singular. second kafka_rejected_ message_batches_ rate Number of message batches message. cluster, kafka, sent by producers that the units. kafka-kafka_broker, broker rejected for this topic message_batches kafka_topic, rack per second CDH 5, CDH 6 Mirror Maker Metrics Metric Name Description Unit Parents CDH Version alerts_rate The number of alerts. events per second cluster, kafka, rack CDH 5, CDH 6 seconds per second cluster, kafka, rack CDH 5, CDH 6 seconds per second cluster, kafka, rack CDH 5, CDH 6 cgroup_mem_page_cache Page cache usage of the role's cgroup bytes cluster, kafka, rack CDH 5, CDH 6 cgroup_mem_rss Resident memory of the role's cgroup bytes cluster, kafka, rack CDH 5, CDH 6 cgroup_mem_swap Swap usage of the role's cgroup bytes cluster, kafka, rack CDH 5, CDH 6 cluster, kafka, rack CDH 5, CDH 6 cgroup_read_ios_rate Number of read I/O operations ios per second cluster, kafka, from all disks by the role's rack cgroup CDH 5, CDH 6 cgroup_write_bytes_ rate Bytes written to all disks by the bytes per role's cgroup second cluster, kafka, rack CDH 5, CDH 6 cgroup_write_ios_rate Number of write I/O operations ios per second cluster, kafka, to all disks by the role's cgroup rack CDH 5, CDH 6 cgroup_cpu_system_rate CPU usage of the role's cgroup cgroup_cpu_user_rate User Space CPU usage of the role's cgroup cgroup_read_bytes_rate Bytes read from all disks by the bytes per role's cgroup second cpu_system_rate Total System CPU seconds per second cluster, kafka, rack CDH 5, CDH 6 cpu_user_rate Total CPU user time seconds per second cluster, kafka, rack CDH 5, CDH 6 events_critical_rate The number of critical events. events per second cluster, kafka, rack CDH 5, CDH 6 events_important_rate The number of important events. events per second cluster, kafka, rack CDH 5, CDH 6 events_informational_ The number of informational events. rate cluster, kafka, rack CDH 5, CDH 6 file descriptors cluster, kafka, rack CDH 5, CDH 6 fd_max Maximum number of file descriptors events per second Apache Kafka Guide | 159 Kafka Reference Metric Name Description Unit fd_open Open file descriptors. file descriptors cluster, kafka, rack CDH 5, CDH 6 health_bad_rate Percentage of Time with Bad Health seconds per second cluster, kafka, rack CDH 5, CDH 6 health_concerning_rate Percentage of Time with Concerning Health seconds per second cluster, kafka, rack CDH 5, CDH 6 health_disabled_rate Percentage of Time with Disabled Health seconds per second cluster, kafka, rack CDH 5, CDH 6 health_good_rate Percentage of Time with Good Health seconds per second cluster, kafka, rack CDH 5, CDH 6 health_unknown_rate Percentage of Time with Unknown Health seconds per second cluster, kafka, rack CDH 5, CDH 6 mem_rss Resident memory used bytes cluster, kafka, rack CDH 5, CDH 6 mem_swap Amount of swap memory used by this role's process. bytes cluster, kafka, rack CDH 5, CDH 6 mem_virtual Virtual memory used bytes cluster, kafka, rack CDH 5, CDH 6 oom_exits_rate The number of times the role's exits per backing process was killed due second to an OutOfMemory error. This counter is only incremented if the Cloudera Manager "Kill When Out of Memory" option is enabled. cluster, kafka, rack CDH 5, CDH 6 read_bytes_rate The number of bytes read from bytes per the device second cluster, kafka, rack CDH 5, CDH 6 exits per second cluster, kafka, rack CDH 5, CDH 6 uptime For a host, the amount of time seconds since the host was booted. For a role, the uptime of the backing process. cluster, kafka, rack CDH 5, CDH 6 write_bytes_rate The number of bytes written to bytes per the device second cluster, kafka, rack CDH 5, CDH 6 unexpected_exits_rate The number of times the role's backing process exited unexpectedly. Parents CDH Version Replica Metrics Metric Name Description Unit Parents CDH Version kafka_log_end_ offset The offset of the next message that will be appended to the log message. units.offset cluster, kafka, kafka-kafka_broker, kafka_broker_topic, kafka_topic, rack CDH 5, CDH 6 kafka_log_start_ offset The earliest message offset message. in the log units.offset cluster, kafka, kafka-kafka_broker, CDH 5, CDH 6 160 | Apache Kafka Guide Kafka Reference Metric Name Description Unit Parents CDH Version kafka_broker_topic, kafka_topic, rack kafka_num_log_ segments The number of segments in message. the log units. segments cluster, kafka, kafka-kafka_broker, kafka_broker_topic, kafka_topic, rack CDH 5, CDH 6 kafka_size The size of the log cluster, kafka, kafka-kafka_broker, kafka_broker_topic, kafka_topic, rack CDH 5, CDH 6 bytes Useful Shell Command Reference Hardware Information There are many ways to get information about hardware: CPU Memory Network interface I/O device (hard drive) Virtual environment cat /proc/cpuinfo lscpu cat /proc/meminfo vmstat -s free -m or free -g ip link show netstat -i lsblk -d fdisk -l virt-what Disk Space df or mount show the disks mounted and can be used to show disk space. On a file or directory level, the du command is useful for seeing how much disk space is being used. I/O Activity and Utilization iostat and sar come with Linux package sysstat-9.0.4-27. iostat is used for tracking I/O performance. The recommended options are: • • • • -d for disk utilization -m for calculations in MB/sec -x for extended report -t sec to repeat statistics every sec seconds sar has several forms of output: • -b for I/O and transfer rate statistics • -d for block device activity Apache Kafka Guide | 161 Kafka Reference • -n for network statistics • -v for various file system statistics File Descriptor Usage lsof is used to identify mapping between processes and open files. By passing multiple arguments, lsof can be used to help isolate the output: • lsof <filename> to list processes that have <filename> open. • lsof -p <pid> to list all files opened by the process corresponding to <pid>. • lsof -r <secs> to keep producing output with a period of <secs>. Network Ports, States, and Connections • nc (netcat) or ss (socket statistics) are good for showing network activity. • netstat is the Swiss-army knife tool for network interfaces. • tcpdump or Wireshark with Kafka filter should be good for packet sniffing. Process Information • top shows a sorted list of processes. • ps shows a snapshot list of processes. Arguments can be used to filter the output. • ps -o min_flt,maj_flt pid shows page fault information. Kernel Configuration • ulimit -a is used to display kernel limits and shows which flags affect which kernel settings. • ulimit -n FD to set a limit on open file descriptors. 162 | Apache Kafka Guide Kafka Public APIs Kafka Public APIs What is a Public API The following parts of Apache Kafka in CDH are considered as public APIs: • Kafka wire protocol format: the format itself might change, but brokers will be able to use the old format as long as documentation and upgrade instructions are followed properly. • Binary log format: the format itself might change, but brokers will be able to use the old format as long as documentation and upgrade instructions are followed properly. • Interfaces and classes in the following packages: – – – – org/apache/kafka/common/serialization org/apache/kafka/common/errors org/apache/kafka/clients/producer org/apache/kafka/clients/consumer • Command-line admin tools: arguments, except ZooKeeper related options, that are subject to change and/or removal. • HttpMetricsReporter: existing fields will stay backward compatible, but new fields may be introduced. The only public API of HttpMetricsReporter is the /api/metrics REST endpoint. For a list of supported metrics, see Kafka Metrics. • Properties, excluding their default values • Config file content and format, and the effect of configuration attributes • Endpoints What is NOT a public API There are structures that third parties might regard as an interface but Cloudera Kafka distributions do not consider them public APIs. In general, any API that is not listed as public in the What is a Public API section should be considered private, and client code should not rely on behavior/data content or format. Some examples are: • Data structures in ZooKeeper: the content and format what Kafka stores in ZooKeeper are internal implementation details. • Authorizer interface: the only supported authorizer in CHD is the Sentry one. • AdminClient: it is a new and rapidly evolving part of Kafka, so Cloudera can’t provide the same guarantees as for other interfaces. • Interfaces marked with the @Evolving or @Unstable annotations in the Kafka source code • Index files generated by Kafka • Application log file content and format (for example what Log4J/SLF4J/… produces) • Any classes used for testing • Relying on transitive dependencies: any dependency pulled in by Kafka • Any other interfaces not listed above • Anything that Cloudera does not support, even if it fits the definition of a public API Apache Kafka Guide | 163 Kafka Frequently Asked Questions Kafka Frequently Asked Questions This is intended to be an easy to understand FAQ on the topic of Kafka. One part is for beginners, one for advanced users and use cases. We hope you find it fruitful. If you are missing a question, please send it to your favorite Cloudera representative and we’ll populate this FAQ over time. Basics What is Kafka? Kafka is a streaming message platform. Breaking it down a bit further: “Streaming”: Lots of messages (think tens or hundreds of thousands) being sent frequently by publishers ("producers"). Message polling occurring frequently by lots of subscribers ("consumers"). “Message”: From a technical standpoint, a key value pair. From a non-technical standpoint, a relatively small number of bytes (think hundreds to a few thousand bytes). If this isn’t your planned use case, Kafka may not be the solution you are looking for. Contact your favorite Cloudera representative to discuss and find out. It is better to understand what you can and cannot do upfront than to go ahead based on some enthusiastic arbitrary vendor message with a solution that will not meet your expectations in the end. What is Kafka designed for? Kafka was designed at LinkedIn to be a horizontally scaling publish-subscribe system. It offers a great deal of configurability at the system- and message-level to achieve these performance goals. There are well documented cases (Uber and LinkedIn) that showcase how well Kafka can scale when everything is done right. What is Kafka not well fitted for (or what are the tradeoffs)? It’s very easy to get caught up in all the things that Kafka can be used for without considering the tradeoffs. Kafka configuration is also not automatic. You need to understand each of your use cases to determine which configuration properties can be used to tune (and retune!) Kafka for each use case. Some more specific examples where you need to be deeply knowledgeable and careful when configuring are: • Using Kafka as your microservices communication hub Kafka can replace both the message queue and the services discovery part of your software infrastructure. However, this is generally at the cost of some added latency as well as the need to monitor a new complex system (i.e. your Kafka cluster). • Using Kafka as long-term storage While Kafka does have a way to configure message retention, it’s primarily designed for low latency message delivery. Kafka does not have any support for the features that are usually associated with filesystems (such as metadata or backups). As such, using some form of long-term ingestion, such as HDFS, is recommended instead. • Using Kafka as an end-to-end solution Kafka is only part of a solution. There are a lot of best practices to follow and support tools to build before you can get the most out of it (see this wise LinkedIn post). • Deploying Kafka without the right support Uber has given some numbers for their engineering organization. These numbers could help give you an idea what it takes to reach that kind of scale: 1300 microservers, 2000 engineers. 164 | Apache Kafka Guide Kafka Frequently Asked Questions Where can I get a general Kafka overview? The first four sections (Introduction, Setup, Clients, Brokers) of the CDH 6 Kafka Documentation cover the basics and design of Kafka. This should serve as a good starting point. If you have any remaining questions after reading that documentation, come to this FAQ or talk to your favorite Cloudera representative about training or a best practices deep dive. Where does Kafka fit well into an Analytic DB solution? ADB deployments benefit from Kafka by utilizing it for data ingest. Data can then populate tables for various analytics workloads. For ad hoc BI the real-time aspect is less critical, but the ability to utilize the same data used in real time applications, in BI and analytics as well, is a benefit that Cloudera’s platform provides, as you will have Kafka for both purposes, already integrated, secured, governed and centrally managed. Where does Kafka fit well into an Operational DB solution? Kafka is commonly used in the real-time, mission-critical world of Operational DB deployments. It is used to ingest data and allow immediate serving to other applications and services via Kudu or HBase. The benefit of utilizing Kafka in the Cloudera platform for ODB is the integration, security, governance and central management. You avoid the risks and costs of siloed architecture and “yet another solution” to support. What is a Kafka consumer? If Kafka is the system that stores messages, then a consumer is the part of your system that reads those messages from Kafka. While Kafka does come with a command line tool that can act as a consumer, practically speaking, you will most likely write Java code using the KafkaConsumer API for your production system. What is a Kafka producer? While consumers read from a Kafka cluster, producers write to a Kafka cluster. Similar to the consumer (see previous question), your producer is also custom Java code for your particular use case. Your producer may need some tuning for write performance and SLA guarantees, but will generally be simpler (fewer error cases) to tune than your consumer. What functionality can I call in my Kafka Java code? The best way to get more information on what functionality you can call in your Kafka Java code is to look at the Java docs. And read very carefully! What’s a good size of a Kafka record if I care about performance and stability? There is an older blog post from 2014 from LinkedIn titled: Benchmarking Apache Kafka: 2 Million Writes Per Second (On Three Cheap Machines). In the “Effect of Message Size” section, you can see two charts which indicate that Kafka throughput starts being affected at a record size of 100 bytes through 1000 bytes and bottoming out around 10000 bytes. In general, keeping topics specific and keeping message sizes deliberately small helps you get the most out of Kafka. Excerpting from Deploying Apache Kafka: A Practical FAQ: How to send large messages or payloads through Kafka? Cloudera benchmarks indicate that Kafka reaches maximum throughput with message sizes of around 10 KB. Larger messages show decreased throughput. However, in certain cases, users need to send messages much larger than 10 KB. If the message payload sizes are in the order of 100s of MB, consider exploring the following alternatives: Apache Kafka Guide | 165 Kafka Frequently Asked Questions • If shared storage is available (HDFS, S3, NAS), place the large payload on shared storage and use Kafka just to send a message with the payload location. • Handle large messages by chopping them into smaller parts before writing into Kafka, using a message key to make sure all the parts are written to the same partition so that they are consumed by the same Consumer, and re-assembling the large message from its parts when consuming. Where can I get Kafka training? You have many options. Cloudera provides training as listed in the next two questions. You can also ask your Resident Solution Architect to do a deep dive on Kafka architecture and best practices. And you could always engage in the community to get insight and expertise on specific topics. Where can I get basic Kafka training? Cloudera training offers a basic on-demand training for Kafka1. This covers basics of Kafka architecture, messages, ordering, and a few slides (code examples) of (to my knowledge) an older version of the Java API. It also covers using Flume + Kafka. Where can I get Kafka developer training? Kafka developer training is included in Cloudera’s Developer Training for Apache Spark and Hadoop2. Use Cases Like most Open Source projects, Kafka provides a lot of configuration options to maximize performance. In some cases, it is not obvious how best to map your specific use case to those configuration options. We attempt to address some of those situations below. What can I do to ensure that I never lose a Kafka event? This is a simple question which has lots of far-reaching implications for your entire Kafka setup. A complete answer includes the next few related FAQs and their answers. What is the recommended node hardware for best reliability? Operationally, you need to make sure your Kafka cluster meets the following hardware setup: 1. 2. 3. 4. Have a 3 or 5 node cluster only running Zookeeper (higher only necessary at largest scales). Have at least a 3 node cluster only running Kafka. Have the disks on the Kafka cluster running in RAID 10. (Required for resiliency against disk failure.) Have sufficient memory for both the Kafka and Zookeeper roles in the cluster. (Recommended: 4GB for the broker, the rest of memory automatically used by the kernel as file cache.) 5. Have sufficient disk space on the Kafka cluster. 6. Have a sufficient number of disks to handle the bandwidth requirements for Kafka and Zookeeper. 7. You need a number of nodes greater than or equal to the highest replication factor you expect to use. What are the network requirements for best reliability? Kafka expects a reliable, low-latency connection between the brokers and the Zookeeper nodes. 1. The number of network hops between the Kafka cluster and the Zookeeper cluster is relatively low. 2. Have highly reliable network services (such as DNS). 166 | Apache Kafka Guide Kafka Frequently Asked Questions What are the system software requirements for best reliability? Assuming you’re following the recommendations of the previous two questions, the actual system outside of Kafka must be configured properly. 1. The kernel must be configured for maximum I/O usage that Kafka requires. a. Large page cache b. Maximum file descriptions c. Maximum file memory map limits 2. Kafka JVM configuration settings: a. Brokers generally don’t need more than 4GB-8GB of heap space. b. Run with the +G1GC garbage collection using Java 8 or later. How can I configure Kafka to ensure that events are stored reliably? The following recommendations for Kafka configuration settings make it extremely difficult for data loss to occur. 1. Producer a. b. c. d. e. block.on.buffer.full=true retries=Long.MAX_VALUE acks=all max.in.flight.requests.per.connections=1 Remember to close the producer when it is finished or when there is a long pause. 2. Broker a. Topic replication.factor >= 3 b. Min.insync.replicas = 2 c. Disable unclean leader election 3. Consumer a. Disable enable.auto.commit b. Commit offsets after messages are processed by your consumer client(s). If you have more than 3 hosts, you can increase the broker settings appropriately on topics that need more protection against data loss. Once I’ve followed all the previous recommendations, my cluster should never lose data, right? Kafka does not ensure that data loss never occurs. There are the following tradeoffs: 1. Throughput vs. reliability. For example, the higher the replication factor, the more resilient your setup will be against data loss. However, to make those extra copies takes time and can affect throughput. 2. Reliability vs. free disk space. Extra copies due to replication use up disk space that would otherwise be used for storing events. Beyond the above design tradeoffs, there are also the following issues: • To ensure events are consumed you need to monitor your Kafka brokers and topics to verify sufficient consumption rates are sustained to meet your ingestion requirements. • Ensure that replication is enabled on any topic that requires consumption guarantees. This protects against Kafka broker failure and host failure. • Kafka is designed to store events for a defined duration after which the events are deleted. You can increase the duration that events are retained up to the amount of supporting storage space. • You will always run out of disk space unless you add more nodes to the cluster. Apache Kafka Guide | 167 Kafka Frequently Asked Questions My Kafka events must be processed in order. How can I accomplish this? After your topic is configured with partitions, Kafka sends each record (based on key/value pair) to a particular partition based on key. So, any given key, the corresponding records are “in order” within a partition. For global ordering, you have two options: 1. Your topic must consist of one partition (but a higher replication factor could be useful for redundancy and failover). However, this will result in very limited message throughput. 2. You configure your topic with a small number of partitions and perform the ordering after the consumer has pulled data. This does not result in guaranteed ordering, but, given a large enough time window, will likely be equivalent. Conversely, it is best to take Kafka’s partitioning design into consideration when designing your Kafka setup rather than rely on global ordering of events. How do I size my topic? Alternatively: What is the “right” number of partitions for a topic? Choosing the proper number of partitions for a topic is the key to achieving a high degree of parallelism with respect to writes to and reads and to distribute load. Evenly distributed load over partitions is a key factor to have good throughput (avoid hot spots). Making a good decision requires estimation based on the desired throughput of producers and consumers per partition. For example, if you want to be able to read 1 GB/sec, but your consumer is only able process 50 MB/sec, then you need at least 20 partitions and 20 consumers in the consumer group. Similarly, if you want to achieve the same for producers, and 1 producer can only write at 100 MB/sec, you need 10 partitions. In this case, if you have 20 partitions, you can maintain 1 GB/sec for producing and consuming messages. You should adjust the exact number of partitions to number of consumers or producers, so that each consumer and producer achieve their target throughput. So a simple formula could be: #Partitions = max(NP, NC) where: • • • • • NP is the number of required producers determined by calculating: TT/TP NC is the number of required consumers determined by calculating: TT/TC TT is the total expected throughput for our system TP is the max throughput of a single producer to a single partition TC is the max throughput of a single consumer from a single partition This calculation gives you a rough indication of the number of partitions. It's a good place to start. Keep in mind the following considerations for improving the number of partitions after you have your system in place: • The number of partitions can be specified at topic creation time or later. • Increasing the number of partitions also affects the number of open file descriptors. So make sure you set file descriptor limit properly. • Reassigning partitions can be very expensive, and therefore it's better to over- than under-provision. • Changing the number of partitions that are based on keys is challenging and involves manual copying (see Kafka Administration on page 59). 168 | Apache Kafka Guide Kafka Frequently Asked Questions • Reducing the number of partitions is not currently supported. Instead, create a new a topic with a lower number of partitions and copy over existing data. • Metadata about partitions are stored in ZooKeeper in the form of znodes. Having a large number of partitions has effects on ZooKeeper and on client resources: – Unneeded partitions put extra pressure on ZooKeeper (more network requests), and might introduce delay in controller and/or partition leader election if a broker goes down. – Producer and consumer clients need more memory, because they need to keep track of more partitions and also buffer data for all partitions. • As guideline for optimal performance, you should not have more than 4000 partitions per broker and not more than 200,000 partitions in a cluster. Make sure consumers don’t lag behind producers by monitoring consumer lag. To check consumers' position in a consumer group (that is, how far behind the end of the log they are), use the following command: $ kafka-consumer-groups --bootstrap-server BROKER_ADDRESS --describe --group CONSUMER_GROUP --new-consumer How can I scale a topic that's already deployed in production? Recall the following facts about Kafka: • When you create a topic, you set the number of partitions. The higher the partition count, the better the parallelism and the better the events are spread somewhat evenly through the cluster. • In most cases, as events go to the Kafka cluster, events with the same key go to the same partition. This is a consequence of using a hash function to determine which key goes to which partition. Now, you might assume that scaling means increasing the number of partitions in a topic. However, due to the way hashing works, simply increasing the number of partitions means that you will lose the "events with the same key go to the same partition" fact. Given that, there are two options: 1. Your cluster may not be scaling well because the partition loads are not balanced properly (for example, one broker has four very active partitions, while another has none). In those cases, you can use the kafka-reassign-partitions script to manually balance partitions. 2. Create a new topic with more partitions, pause the producers, copy data over from the old topic, and then move the producers and consumers over to the new topic. This can be a bit tricky operationally. How do I rebalance my Kafka cluster? This one comes up when new nodes or disks are added to existing nodes. Partitions are not automatically balanced. If a topic already has a number of nodes equal to the replication factor (typically 3), then adding disks does not help with rebalancing. Using the kafka-reassign-partitions command after adding new hosts is the recommended method. Caveats There are several caveats to using this command: • It is highly recommended that you minimize the volume of replica changes to make sure the cluster remains healthy. Say, instead of moving ten replicas with a single command, move two at a time. • It is not possible to use this command to make an out-of-sync replica into the leader partition. • If too many replicas are moved, then there could be serious performance impact on the cluster. When using the kafka-reassign-partitions command, look at the partition counts and sizes. From there, you can test various partition sizes along with the --throttle flag to determine what volume of data can be copied without affecting broker performance significantly. • Given the earlier restrictions, it is best to use this command only when all brokers and topics are healthy. Apache Kafka Guide | 169 Kafka Frequently Asked Questions How do I monitor my Kafka cluster? As of Cloudera Enterprise 5.14, Cloudera Manager has monitoring for a Kafka cluster. Currently, there are three GitHub projects as well that provide additional monitoring functionality: • Doctor Kafka3 (Pinterest, Apache 2.0 License) • Kafka Manager4 (Yahoo, Apache 2.0 License) • Cruise Control5 (LinkedIn, BSD 2-clause License) These projects are Apache-compatible licensed, but are not Open Source (no community, bug filing, or transparency). What are the best practices concerning consumer group.id? The group.id is just a string that helps Kafka track which consumers are related (by having the same group id). • In general, timestamps as part of group.id are not useful. Because each group.id corresponds to multiple consumers, you cannot have a unique timestamp for each consumer. • Add any helpful identifiers. This could be related to a group (for example, transactions, marketing), purpose (fraud, alerts), or technology (Flume, Spark). How do I monitor consumer group lag? This is typically done using the kafka-consumer-groups command line tool. Copying directly from the upstream documentation6, we have this example output (reformatted for readability): $ bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group my-group TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID HOST CLIENT-ID my-topic 0 2 4 2 consumer-1-69d6 /127.0.0.1 consumer-1 my-topic 1 2 3 1 consumer-1-69d6 /127.0.0.1 consumer-1 my-topic 2 2 3 1 consumer-2-9bb2 /127.0.0.1 consumer-2 In general, if everything is going well with a particular topic, each consumer’s CURRENT-OFFSET should be up-to-date or nearly up-to-date with the LOG-END-OFFSET. From this command, you can determine whether a particular host or a particular partition is having issues keeping up with the data rate. How do I reset the consumer offset to an arbitrary value? This is also done using the kafka-consumer-groups command line tool. This is generally an administration feature used to get around corrupted records, data loss, or recovering from failure of the broker or host. Aside from those special cases, using the command line tool for this purpose is not recommended. By using the --execute --reset-offsets flags, you can change the consumer offsets for a consumer group (or even all groups) to a specific setting based on each partitions log’s beginning/end or a fixed timestamp. Typing the kafka-consumer-groups command with no arguments will give you the complete help output. How do I configure MirrorMaker for bi-directional replication across DCs? Mirror Maker is a one way copy of one or more topics from a Source Kafka Cluster to a Destination Kafka Cluster. Given this restriction on Mirror Maker, you need to run two instances, one to copy from A to B and another to copy from B to A. In addition, consider the following: • Cloudera recommends using the "pull" model for Mirror Maker, meaning that the Mirror Maker instance that is writing to the destination is running on a host "near" the destination cluster. • The topics must be unique across the two clusters being copied. 170 | Apache Kafka Guide Kafka Frequently Asked Questions • On secure clusters, the source cluster and destination cluster must be in the same Kerberos realm. How does the consumer max retries vs timeout work? With the newer versions of Kafka, consumers have two ways they communicate with brokers. • Retries: This is generally related to reading data. When a consumer reads from a brokers, it’s possible for that attempt to fail due to problems such as intermittent network outages or I/O issues on the broker. To improve reliability, the consumer retries (up to the configured max.retries value) before actually failing to read a log offset. • Timeout. This term is a bit vague because there are two timeouts related to consumers: – Poll Timeout: This is the timeout between calls to KafkaConsumer.poll(). This timeout is set based on whatever read latency requirements your particular use case needs. – Heartbeat Timeout: The newer consumer has a “heartbeat thread” which give a heartbeat to the broker (actually the Group Coordinator within a broker) to let the broker know that the consumer is still alive. This happens on a regular basis and if the broker doesn’t receive at least one heartbeat within the timeout period, it assumes the consumer is dead and disconnects it. How do I size my Kafka cluster? There are several considerations for sizing your Kafka cluster. • Disk space Disk space will primarily consist of your Kafka data and broker logs. When in debug mode, the broker logs can get quite large (10s to 100s of GB), so reserving a significant amount of space could save you some future headaches. For Kafka data, you need to perform estimates on message size, number of topics, and redundancy. Also remember that you will be using RAID10 for Kafka’s data, so half your hard drives will go towards redundancy. From there, you can calculate how many drives will be needed. In general, you will want to have more hosts than the minimum suggested by the number of drives. This leaves room for growth and some scalability headroom. • Zookeeper nodes One node is fine for a test cluster. Three is standard for most Kafka clusters. At large scale, five nodes is fairly common for reliability. • Looking at leader partition count/bandwidth usage This is likely the metric with the highest variability. Any Kafka broker will be overloaded if it has too many leader partitions. In the worst cases, each leader partition requires high bandwidth, high message rates, or both. For other topics, leader partitions will be a tiny fraction of what a broker can handle (limited by software and hardware). To estimate an average that works on a per-host basis, try grouping topics by partition data throughput requirements, such as 2 high bandwidth data partitions, 4 medium bandwidth data partitions, 20 small bandwidth data partitions. From there, you can determine how many hosts are needed. How can I combine Kafka with Flume to ingest into HDFS? We have two blog posts on using Kafka with Flume: • The original post: Flafka: Apache Flume Meets Apache Kafka for Event Processing • This updated version for CDH 5.8/Apache Kafka 0.9/Apache Flume 1.7: New in Cloudera Enterprise 5.8: Flafka Improvements for Real-Time Data Ingest How can I build a Spark streaming application that consumes data from Kafka? You will need to set up your development environment to use both Spark libraries and Kafka libraries: • Building Spark Applications Apache Kafka Guide | 171 Kafka Frequently Asked Questions • The kafka-examples directory on Cloudera’s public GitHub has an example pom.xml. From there, you should be able to read data using the KafkaConsumer class and using Spark libraries for real-time data processing. The blog post Reading data securely from Apache Kafka to Apache Spark has a pointer to a GitHub repository that contains a word count example. For further background, read the blog post Architectural Patterns for Near Real-Time Data Processing with Apache Hadoop. References 1. 2. 3. 4. 5. 6. Kafka basic training: https://ondemand.cloudera.com/courses/course-v1:Cloudera+Kafka+201601/info Kafka developer training: https://ondemand.cloudera.com/courses/course-v1:Cloudera+DevSH+201709/info Doctor Kafka: http://github.com/pinterest/doctorkafka Kafka manager: http://github.com/yahoo/kafka-manager Cruise control: http://github.com/linkedin/cruise-control Upstream documentation: http://kafka.apache.org/documentation/#basic_ops_consumer_lag 172 | Apache Kafka Guide Appendix: Apache License, Version 2.0 Appendix: Apache License, Version 2.0 SPDX short identifier: Apache-2.0 Apache License Version 2.0, January 2004 http://www.apache.org/licenses/ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 1. Definitions. "License" shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through 9 of this document. "Licensor" shall mean the copyright owner or entity authorized by the copyright owner that is granting the License. "Legal Entity" shall mean the union of the acting entity and all other entities that control, are controlled by, or are under common control with that entity. For the purposes of this definition, "control" means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity. "You" (or "Your") shall mean an individual or Legal Entity exercising permissions granted by this License. "Source" form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files. "Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types. "Work" shall mean the work of authorship, whether in Source or Object form, made available under the License, as indicated by a copyright notice that is included in or attached to the work (an example is provided in the Appendix below). "Derivative Works" shall mean any work, whether in Source or Object form, that is based on (or derived from) the Work and for which the editorial revisions, annotations, elaborations, or other modifications represent, as a whole, an original work of authorship. For the purposes of this License, Derivative Works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work and Derivative Works thereof. "Contribution" shall mean any work of authorship, including the original version of the Work and any modifications or additions to that Work or Derivative Works thereof, that is intentionally submitted to Licensor for inclusion in the Work by the copyright owner or by an individual or Legal Entity authorized to submit on behalf of the copyright owner. For the purposes of this definition, "submitted" means any form of electronic, verbal, or written communication sent to the Licensor or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, the Licensor for the purpose of discussing and improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by the copyright owner as "Not a Contribution." "Contributor" shall mean Licensor and any individual or Legal Entity on behalf of whom a Contribution has been received by Licensor and subsequently incorporated within the Work. 2. Grant of Copyright License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form. 3. Grant of Patent License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims Cloudera | 173 Appendix: Apache License, Version 2.0 licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their Contribution(s) with the Work to which such Contribution(s) was submitted. If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporated within the Work constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for that Work shall terminate as of the date such litigation is filed. 4. Redistribution. You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions: 1. You must give any other recipients of the Work or Derivative Works a copy of this License; and 2. You must cause any modified files to carry prominent notices stating that You changed the files; and 3. You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and 4. If the Work includes a "NOTICE" text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construed as modifying the License. You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Derivative Works as a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated in this License. 5. Submission of Contributions. Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to the Licensor shall be under the terms and conditions of this License, without any additional terms or conditions. Notwithstanding the above, nothing herein shall supersede or modify the terms of any separate license agreement you may have executed with Licensor regarding such Contributions. 6. Trademarks. This License does not grant permission to use the trade names, trademarks, service marks, or product names of the Licensor, except as required for reasonable and customary use in describing the origin of the Work and reproducing the content of the NOTICE file. 7. Disclaimer of Warranty. Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor provides its Contributions) on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Work and assume any risks associated with Your exercise of permissions under this License. 8. Limitation of Liability. In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Work (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Contributor has been advised of the possibility of such damages. 9. Accepting Warranty or Additional Liability. 174 | Cloudera Appendix: Apache License, Version 2.0 While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional liability. END OF TERMS AND CONDITIONS APPENDIX: How to apply the Apache License to your work To apply the Apache License to your work, attach the following boilerplate notice, with the fields enclosed by brackets "[]" replaced with your own identifying information. (Don't include the brackets!) The text should be enclosed in the appropriate comment syntax for the file format. We also recommend that a file or class name and description of purpose be included on the same "printed page" as the copyright notice for easier identification within third-party archives. Copyright [yyyy] [name of copyright owner] Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. Cloudera | 175