Apache Kafka Reference
Free reference guide: Apache Kafka Reference
About Apache Kafka Reference
The Apache Kafka Reference is a comprehensive, searchable guide to Kafka's core APIs, CLI tools, and configuration, organized across six essential categories. The Producer section covers kafka-console-producer for CLI messaging, ProducerRecord with key/value/topic construction in Java, critical producer configurations (acks=all for durability, retries, serializer settings), and producer.flush() for flushing the send buffer. The Consumer section covers kafka-console-consumer with --from-beginning, Java consumer.subscribe() with a polling loop, group.id and offset management configuration, manual consumer.commitSync() for exactly-once semantics, and kafka-consumer-groups --describe for lag monitoring. The Topic section covers kafka-topics --create with partition and replication-factor options, --list, --describe for partition and leader information, and --alter for expanding partition count. The Partition section covers partitioner.class for custom partitioning strategies, kafka-reassign-partitions for partition rebalancing, and min.insync.replicas for durability guarantees. The Streams section covers KStream creation with StreamsBuilder, KTable for latest-state materialization, mapValues for transformations, and stream.join() with time windows. The Connect section covers standalone and distributed mode startup, the REST API for connector lifecycle (POST/GET/DELETE /connectors), and a JDBC source connector example.
Apache Kafka is the backbone of event-driven architectures, real-time data pipelines, and stream processing systems at companies of every scale. Backend engineers use Kafka to decouple microservices and guarantee message delivery. Data engineers rely on Kafka as the central hub connecting databases, data warehouses, and stream processors like Apache Flink and Apache Spark Structured Streaming. Platform teams use Kafka Connect to integrate dozens of source and sink systems without writing custom code. This reference provides the exact commands, Java code snippets, and configuration properties you need to work effectively with Kafka without memorizing the full documentation.
The reference is organized around the six core Kafka concepts that every practitioner needs to understand: how data gets into Kafka (Producer), how it gets consumed (Consumer), how it is organized (Topic and Partition), how it can be processed as streams (Streams), and how external systems connect to it (Connect). Whether you are running a local development Kafka cluster, debugging consumer lag, or setting up a production multi-broker deployment, this guide gives you the precise syntax and example for each task.
Key Features
- Producer CLI and Java API: kafka-console-producer, ProducerRecord with key/topic/value, acks=all durability config, retries, serializer properties, and producer.flush()/close()
- Consumer CLI and Java API: kafka-console-consumer with --from-beginning, consumer.subscribe() polling loop, group.id, auto.offset.reset, enable.auto.commit=false, and commitSync()
- Consumer group monitoring: kafka-consumer-groups --describe to check partition assignment, current offset, log end offset, and consumer lag per partition
- Topic lifecycle management: kafka-topics --create with --partitions and --replication-factor, --list, --describe for leader and ISR details, --alter to expand partition count
- Partition management: partitioner.class for round-robin or custom partitioning, kafka-reassign-partitions for rebalancing, min.insync.replicas for write durability guarantees
- Kafka Streams API: KStream with filter and to(), KTable for word count aggregation, mapValues for stateless transformation, and stream join with JoinWindows for time-windowed enrichment
- Kafka Connect: standalone and distributed mode startup commands, REST API for connector CRUD (POST, GET, DELETE /connectors), and JDBC source connector configuration example
- Instant search and category filtering across all Kafka CLI, Java API, and configuration entries with working code examples
Frequently Asked Questions
What does acks=all mean in Kafka producer configuration?
acks=all (equivalent to acks=-1) means the producer waits for acknowledgment from all in-sync replicas (ISRs) before considering a message successfully sent. This provides the strongest durability guarantee — a message will not be lost even if the leader broker fails immediately after acknowledging. Combined with min.insync.replicas=2, it ensures at least two replicas have confirmed the write.
What is the difference between auto.offset.reset=earliest and latest?
When a consumer group has no committed offset (first start or after offset expiry), earliest starts reading from the very beginning of the topic partition, while latest starts reading only from new messages produced after the consumer starts. Use earliest in batch processing or catch-up scenarios, and latest for real-time event processing where you only care about new events.
Why should I set enable.auto.commit=false in Kafka consumers?
Auto-commit periodically commits offsets at a fixed interval (default 5 seconds) regardless of whether your application has successfully processed the messages. If your process crashes after auto-commit but before finishing processing, messages are lost. By setting enable.auto.commit=false and calling commitSync() only after successful processing, you achieve at-least-once delivery semantics and prevent data loss.
What is the difference between KStream and KTable in Kafka Streams?
A KStream represents an unbounded stream of individual events — every record is an independent event. A KTable represents the latest state for each key — it is a changelog stream where each new record for a key updates (upserts) the current value. Use KStream for event-by-event processing (click events, transactions) and KTable for aggregations, joins, or maintaining current state (user profiles, running totals).
How do I check consumer lag in Kafka?
Use kafka-consumer-groups --bootstrap-server localhost:9092 --describe --group my-group. The output shows each partition's current offset (where the consumer is), log end offset (latest message in the partition), and the lag (difference). High lag means consumers are falling behind producers. Monitor lag with Kafka's JMX metrics or tools like Kafka Exporter for Prometheus.
Can I decrease the number of partitions in a Kafka topic?
No, Kafka does not support decreasing partition count. You can only increase partitions with kafka-topics --alter --partitions N. Increasing partitions changes the mapping of keys to partitions, which can break ordering guarantees for key-based producers. The only way to truly reduce partitions is to create a new topic with fewer partitions and migrate data using kafka-streams or a custom migration job.
What is Kafka Connect and when should I use it?
Kafka Connect is a framework for reliably streaming data between Kafka and external systems (databases, S3, Elasticsearch, etc.) without writing custom code. Source connectors pull data into Kafka from external systems; sink connectors push data from Kafka to external systems. Use Connect when you need standard integrations — it handles offset management, error handling, and scaling automatically. Common connectors include JDBC Source/Sink, Debezium CDC, and S3 Sink.
How do I set up a Kafka consumer group for parallel processing?
Create multiple consumer instances in the same group (same group.id). Kafka automatically distributes partitions among group members — each partition is assigned to exactly one consumer in the group at a time. For maximum parallelism, ensure your topic has at least as many partitions as consumers. Consumers in excess of the partition count will be idle. Use kafka-consumer-groups --describe to verify partition assignment and monitor lag per consumer.