Skip to main content
  1. Distributed Logs: From Theory to Production/

Advanced Producer Patterns

Throughput Without Latency Spikes: Batching, Framing, and Compression #

High-performance producer throughput is not about sending faster — it is about sending less often with more per send. When application code calls send(), the record does not immediately cross the network. It enters a client-side pipeline designed to amortize the overhead of network I/O and protocol framing across many records. Understanding this pipeline is the prerequisite for tuning a producer that delivers wire-speed throughput without tail latency spikes.

The RecordBatch: The Actual Unit of Transmission #

The Kafka protocol does not send individual records. It sends RecordBatches — sequences of records destined for the same topic partition, encapsulated with a shared header containing the base offset, timestamp, and compression attributes. A producer sending 100 records in 100 separate requests pays 100× the protocol framing overhead. The same 100 records in one batch pay it once.

The efficiency of the entire system depends on batch density. A sparse batch is expensive: the 61-byte RecordBatch header represents ~6% overhead on a 1KB batch and is negligible on a 1MB batch. Compression compounds this — most algorithms need sufficient data volume to build an effective dictionary. A 1KB batch might compress 10%. The same data structure at 16KB might compress 50%.

Record Accumulator: application thread → per-partition buffers → drain conditions → broker

The Record Accumulator: batch.size and linger.ms #

The producer’s internal accumulator buffers records per partition. A buffer drains when one of two conditions is met:

  • The buffer reaches batch.size — the producer is throughput-bound and fills buffers faster than the timer expires.
  • The wait exceeds linger.ms — the producer is latency-bound and times out waiting for more data.

linger.ms is commonly misread as a mandatory delay added to every send. It is an upper bound on waiting. If the application generates data fast enough to fill batch.size before linger.ms expires, the batch sends immediately. linger.ms only governs the floor of produce latency for low-throughput producers.

The control loop can be modeled as a race between production rate and the accumulator timer. Let R be the record production rate in bytes/sec and L be linger.ms. The theoretical maximum batch size without stalling is R × L. If R × L is significantly smaller than batch.size, the producer is latency-bound — it constantly times out waiting for data. If R × L exceeds batch.size, the producer is throughput-bound — it fills buffers and sends immediately.

A subtle pathology arises with high partition counts and sparse uniform traffic. A producer emitting 10,000 records/sec across 100 partitions sends only 100 records/sec per partition. Even with linger.ms=50, each batch captures only 5 records. The solution is the Sticky Partitioner: it temporarily routes all records to one partition until the batch is full, then rotates. This intentionally creates short-term imbalance to maximize batch cohesion.

Sticky Partitioner: full batches vs round-robin micro-batches

Compression: Codecs and Their Tradeoffs #

Compression is applied at the RecordBatch level. The client serializes the records, compresses the entire set, and stores the codec in the batch header. The choice of codec shifts the bottleneck:

  • Gzip — highest compression ratio, CPU-intensive. Often becomes the bottleneck on the producer itself at high throughput.
  • Snappy / LZ4 — optimized for speed, lower ratios, minimal CPU impact. Suitable for latency-sensitive, high-throughput workloads.
  • Zstd — comparable ratios to Gzip with decompression speeds closer to LZ4. The best default for most workloads. Compression speed can still consume significant CPU at high levels.

The network environment matters. In cloud deployments where cross-zone data transfer is metered, spending CPU cycles on Zstd to reduce bytes on the wire is economically rational. In a local 100GbE datacenter, LZ4 or no compression may yield higher end-to-end throughput.

Compression codec tradeoffs: ratio vs CPU vs speed

Backpressure: When the Broker Cannot Keep Up #

The accumulator allocates batch.size bytes for every active partition. A producer writing to 1,000 partitions with batch.size=1MB may attempt to reserve 1GB of heap for buffers alone. If this exceeds buffer.memory, send() blocks. This is the moment where latency jumps from milliseconds to seconds.

Increasing buffer.memory delays blocking but does not eliminate it. The right signal is record-queue-time-max: if this metric rises, the producer is waiting on the broker. If compression-time-avg rises, the producer is CPU-bound. These two metrics together locate the bottleneck before it becomes a user-visible incident.

Tuning Playbook #

  1. Baseline — start with batch.size=16KB, linger.ms=5ms, compression.type=lz4.
  2. Maximize batching — if batch-size-avg < batch.size, increase linger.ms (10–20ms) or enable Sticky Partitioning.
  3. Optimize throughput — increase batch.size (64KB, 128KB) until batch-size-avg plateaus.
  4. Manage latency — if tail latency is high, check compression-time-avg. If high, switch to Snappy/LZ4. If record-queue-time-max is high, the bottleneck is downstream.
  5. Safety — ensure buffer.memory > batch.size × partition count to avoid immediate blocking.

Durability at Wire Speed: Idempotent Producers, Acknowledgment Semantics, and Raft Replication Reality #

The guarantee that a write survived is the fundamental contract between a producer and a storage system. In Redpanda, this contract is negotiated per request via acks, but the mechanism that enforces it runs deep in the cluster: Raft consensus. Misunderstanding this interaction produces two failure modes — producers configured for high throughput that silently risk data loss during leader failovers, and producers paying the latency penalty of extreme durability without actually achieving it due to misconfigured replication constraints.

The acks Ladder: Three Points on the Durability Spectrum #

acks ladder: fire-and-forget → leader ack → quorum ack

  • acks=0 — the producer serializes the record to the TCP socket and considers the request complete. No broker confirmation. Zero durability. Use only for ephemeral metrics where gaps are acceptable.
  • acks=1 — the leader writes to its local log and acknowledges. A leader crash before replication loses the data permanently. This is not durability — it is confirmation that one machine accepted a write the cluster has not committed.
  • acks=all — the leader waits for all ISR members to confirm before acknowledging. The only setting that provides durability against node failure. This is the setting that means something.

Under acks=all, the producer’s latency budget changes fundamentally. The request is no longer bound by the leader’s local disk I/O — it is bound by the speed of light between the leader and its followers. The leader cannot acknowledge until a majority of the Raft group has persisted the entry. This introduces a replication tax on every write: the minimum latency is determined by the slowest follower required to satisfy the quorum.

\[\text{Latency}{acks=all} \geq \max{i \in \text{quorum}} \text{RTT}(leader, follower_i) + \text{fsync}\]

Replication tax: quorum write timeline showing slowest follower as the bottleneck

This explains why tail latency spikes during cluster maintenance or partial network degradation. If a follower becomes slow, the leader waits. If the number of healthy followers drops below the quorum size, the partition becomes unavailable for writes entirely. min.insync.replicas is the safety valve — it rejects writes if the available quorum is smaller than the configured minimum.

Idempotent Production: Moving Deduplication into the Broker #

The sequence diagram for acks=all reveals a specific failure mode: the write is successfully replicated and committed to the Raft log, but the acknowledgment is lost on the wire. The producer times out, treats it as a failure, and retries. Without deduplication, the broker appends the same record again — corrupting the log with a duplicate.

Idempotent production solves this by moving deduplication into the broker’s ingest path. When enable.idempotence=true is set, the producer requests a Producer ID (PID) from the cluster on initialization. Every batch sent subsequently includes this PID and a monotonically increasing Sequence Number (SeqNum), starting at zero. The broker maintains the highest SeqNum committed for each active PID-partition pair.

When the broker receives a batch, it evaluates the sequence number against its local state:

\[\text{Action}(SeqNum) = \begin{cases} \text{accept} & \text{if } SeqNum = LastSeqNum + 1 \ \text{deduplicate} & \text{if } SeqNum \leq LastSeqNum \ \text{reject} & \text{if } SeqNum > LastSeqNum + 1 \end{cases}\]

  • Expected (SeqNum = LastSeqNum + 1) — the batch is new and strictly ordered. Accepted, appended, replicated.
  • Duplicate (SeqNum ≤ LastSeqNum) — already processed. The broker acknowledges immediately without writing. This is exactly the retry scenario.
  • Gap (SeqNum > LastSeqNum + 1) — an intermediate batch was lost or reordered. The broker rejects with OutOfOrderSequenceException, forcing the producer to rewind or fail.

This mechanism provides an exactly-once guarantee for the duration of a producer session:

Idempotent retry: broker deduplicates on PID+SeqNum match

  1. Producer sends batch PID=101, SeqNum=5.
  2. Broker writes to offset 500, replicates to ISR, sends ACK — which the network drops.
  3. Producer times out. Retries PID=101, SeqNum=5.
  4. Broker checks state: SeqNum=5 for PID=101 already committed.
  5. Broker acknowledges immediately. Nothing written.
  6. Consumer reads offset 500. One record. Once.

Session Scope and the Transactional ID #

The PID is ephemeral. If the producer restarts, it gets a new PID and the broker’s deduplication state for the old PID no longer applies. A restart followed by a replay of application-level data is outside the scope of idempotent production — the broker cannot distinguish a legitimate retry from a duplicate application-level event sent from a new producer instance.

For cross-restart exactly-once, use Transactional IDs. A transactional ID persists the PID across restarts by binding the producer identity to a stable application-level string. The broker fences any previous producer instance with the same transactional ID, ensuring only one active producer writes to a given set of partitions at any time.

Throughput Ceiling: The Bandwidth-Delay Product #

\[\text{Throughput} \leq \frac{\text{max_in_flight} \times \text{batch.size}}{\text{RTT}}\]

Idempotence and In-Flight Requests #

Historically, enabling idempotence required max.in.flight.requests.per.connection=1 — one outstanding request at a time — to prevent reordering during retries. This destroyed throughput on any link with non-trivial latency.

With idempotence enabled, the broker’s sequence number validation allows up to 5 requests in flight simultaneously. Even if the network reorders packets, the broker detects the out-of-order arrival and rejects the missequenced batch, prompting the client to retry in the correct order. This allows the producer to saturate the TCP link while guaranteeing records are committed in the exact order they were enqueued.

Redpanda’s Durability Default #

In standard Kafka, acks=all confirms replication to ISR members but relies on the OS page cache for actual disk persistence. A power failure on all ISR nodes simultaneously can lose acknowledged writes.

Redpanda defaults to syncing writes to disk before acknowledging. acks=all on Redpanda means disk-fsynced on a majority of nodes — not just page-cached. This aligns the logical guarantee with the hardware reality. The latency cost is real: the commit point now includes physical disk I/O on the majority of nodes. Amortizing this cost requires batching — sending single records with acks=all degrades throughput to the rotational latency of the disk or the network RTT.

Baseline Configuration for Critical Pipelines #

// Enforce strong consistency
props.put(ProducerConfig.ACKS_CONFIG, "all");

// Enable idempotence — defaults to true in modern clients, but explicit is safer
props.put(ProducerConfig.ENABLE_IDEMPOTENCE_CONFIG, "true");

// Allow retries to handle transient Raft leader elections
props.put(ProducerConfig.RETRIES_CONFIG, Integer.MAX_VALUE);
props.put(ProducerConfig.DELIVERY_TIMEOUT_MS_CONFIG, 120000);

// Allow pipelining while preserving order via sequence numbers
props.put(ProducerConfig.MAX_IN_FLIGHT_REQUESTS_PER_CONNECTION, 5);

Wire-speed durability is a tradeoff managed by pipeline depth and batch size — not by compromising safety. acks=all with idempotence and max.in.flight=5 treats the cluster as a reliable, ordered log even as nodes crash and leaders rotate in the background.

Failure as a Load Multiplier: Error Classification, Backoff Discipline, and Retry-Storm Containment #

A producer operating under normal conditions is a predictable data source — bounded by network bandwidth and application generation rate. When the downstream cluster degrades, the producer undergoes a phase transition. It ceases to be a data source and becomes a load multiplier. A producer attempting to guarantee delivery can inadvertently sustain the very outage it is trying to survive. This is the retry storm.

Retry storm: positive feedback loop between producer retries and broker degradation

The Positive Feedback Loop #

When a broker is degraded, requests fail. The producer retries. Retries add load to an already saturated broker. More requests fail. More retries. The system enters a metastable failure state: throughput drops to near zero, yet network utilization remains high from the constant cycling of failed and retried requests.

The load on a cluster can be modeled as:

\[\frac{dL}{dt} \propto \lambda_{new} + f(L(t))\]

where $\lambda_{new}$ is new ingress traffic and $f(L(t))$ is the retry traffic generated by the current load. In a healthy state, $\lambda_{retry} \approx 0$. Once $L(t)$ exceeds the broker’s saturation point, $f(L(t))$ dominates — and the only way out is intervention at the source: the producer.

Error Classification: Three Archetypes #

Error classification: transient → retry, definitive → DLQ, semantic → discard

Not all errors warrant the same response. The Kafka protocol exposes dozens of error codes, but from the producer’s perspective they collapse into three operational archetypes:

  • Transient (retriable)LEADER_NOT_AVAILABLE, NOT_LEADER_FOR_PARTITION, NETWORK_EXCEPTION. The system is temporarily unable to fulfill the request but the request itself is valid. These are the only candidates for automatic retry. However, blindly retrying all transient errors is dangerous. A TIMEOUT implies the request may or may not have been persisted — retrying without idempotence risks duplication.

  • Definitive (fatal)TOPIC_AUTHORIZATION_FAILED, INVALID_CONFIG. The state of the world prevents success regardless of time. Retrying is a waste of cycles. The client must immediately fail the batch and notify the application.

  • Semantic (data-dependent)MESSAGE_TOO_LARGE, INVALID_RECORD. These are properties of the payload. Particularly insidious: a single oversized record in a batch causes the entire batch to fail. If the producer naively retries, it enters a tight loop of rejection. The producer must support record-level inspection or circuit breaking to discard the poisoning record and allow the rest of the stream to proceed.

Backoff with Jitter: Decoupling Retry Timing #

The second line of defense is disciplined retry timing. Exponential backoff limits the rate at which retries escalate:

\[W_n = \min(C,\ B \cdot 2^n)\]

where $B$ is the base delay, $n$ is the retry attempt number, and $C$ is the maximum cap. Without the cap, backoff grows unboundedly. Without a minimum floor, retries can still be instantaneous at $n=0$.

Exponential backoff alone is insufficient in distributed systems. If all producers back off on the same schedule — because they all hit the same error at the same time — they all retry simultaneously, creating a thundering herd. The solution is jitter: randomizing the wait within the backoff window.

\[W_n = \text{random}(0,\ \min(C,\ B \cdot 2^n))\]

Backoff with jitter: synchronized retries vs spread retries

Jitter decouples retry timing across producers. Instead of a synchronized wave of retries hitting the broker at $t + W_n$, retries are spread across the window, giving the broker time to recover between them.

Circuit Breakers: Enforcing Upper Bounds on Total Effort #

Backoff discipline controls the rate of retries. Circuit breakers enforce an upper bound on total effort. A circuit breaker tracks error rates and trips when they exceed a threshold — opening the circuit and rejecting all further attempts without even trying. This prevents a single misbehaving producer from continuously hammering a degraded broker.

The pattern maps directly onto error classification:

  1. On transient error — increment retry counter, apply jitter backoff, retry.
  2. On definitive error — increment fatal counter, trip circuit breaker, send to dead-letter queue (DLQ), do not retry.
  3. On semantic error — discard the poisoning record, log for inspection, continue with remaining records.
public void onCompletion(RecordMetadata metadata, Exception exception) {
    if (exception == null) return; // success path

    if (isFatal(exception)) {
        fatalErrorCounter.increment();
        // Do not retry. Log, alert, or send to DLQ.
        breaker.trip(); // prevent future attempts
        return;
    }

    if (isRetriable(exception)) {
        retryCounter.increment();
        // Let the client's internal retry loop handle it,
        // but monitor for retry storms via retryCounter rate.
        return;
    }

    // Semantic error — discard record, continue stream
    deadLetterQueue.send(metadata, exception);
}

The Containment Playbook #

When a retry storm is detected — record-queue-time-max rising, broker throughput near zero, network utilization high — the containment steps are:

  1. Identify — check retry-rate metric on producers. A rising retry rate against falling broker throughput confirms the storm.
  2. Throttle — reduce producer send rate at the application level or via client-side rate limiting. This reduces $\lambda_{new}$ and gives the broker room to drain its backlog.
  3. Isolate — if a specific topic or partition is the source, circuit-break producers writing to it. Route to DLQ until the partition recovers.
  4. Recover — once the broker stabilizes, re-enable producers with a slow ramp-up. Do not restore full send rate immediately — the broker’s backlog may still be draining.
  5. Prevent — set delivery.timeout.ms as a hard upper bound on total retry duration. A producer that cannot deliver within the timeout fails fast rather than retrying indefinitely.