The Distributed Log

Table of Contents

A user places an order. The payment service debits the account. The inventory service needs to reserve the item. The notification service needs to send a confirmation. Each runs on a different machine, each has its own database, and they talk over a network that can drop packets, reorder messages, or go silent without warning. The payment succeeds but the inventory update times out. Did it happen or not? The notification fires anyway. Now the customer has a confirmation for an item that may or may not be reserved. The payment service retries and debits twice. Each service, acting rationally on the information it has, contributes to a system that is collectively wrong. There is no shared truth — only a collection of local states that have drifted apart.

The ideal is a single, ordered history that every service reads from. Not a shared database with its own consistency problems, not point-to-point messages that can be lost or duplicated — a permanent record where every event has a position, and that position never changes. Any service can read forward from any point and reconstruct exactly what happened, in the order it happened.

The distributed log is that ideal. Every event is appended once, gets a permanent offset, and stays there. The payment service writes “order placed” at offset 1042. The inventory service reads offset 1042 and reserves the item. The notification service reads offset 1042 and sends the confirmation. If the inventory service crashes mid-read, it restarts and reads from 1042 again — the record is still there. If the notification service is slow, it catches up in sequence. No service needs to coordinate with another in real time; they share a history, not a connection. The chaos of distributed state collapses into a single question: how far have you read?

Durability refers to the persistence of a record on disk across a sufficient number of nodes to survive failure. A record is durable when it has been written to the leader’s storage and replicated to a quorum of followers. The durability guarantee is probabilistic, defined by the replication factor and the independence of failure domains.

When a producer writes a record, it can ask the broker for different levels of confirmation — controlled by the acks setting. At one extreme, the producer sends and moves on without waiting. At the other, it waits until the write is confirmed by multiple brokers. What counts as “confirmed by multiple brokers” is defined by the In-Sync Replica set (ISR): the subset of replicas that are caught up with the leader within a configurable lag threshold. A replica falls out of ISR if it falls too far behind — due to network slowness, disk contention, or a slow follower. The ISR is not static; it shrinks and grows as cluster conditions change. Durability guarantees are only as strong as the current ISR membership.

The log makes three guarantees against this backdrop:

Ordering — guaranteed within a partition, not across them. Records appended to the same partition are always read in the order they were written. This guarantee does not extend across partitions. Partition count is not just a throughput dial; it is a constraint imposed by ordering requirements. Every additional partition is a declaration that data going there has no ordering relationship with data elsewhere.
Durability — a function of acks and ISR membership, not write acknowledgment. acks=0 is fire and forget. acks=1 confirms the leader received the write — a leader crash before replication loses it permanently. Only acks=all means the write survived to the full In-Sync Replica set. The acknowledgment from acks=1 feels like durability. It is not.
Visibility — governed by the High Water Mark, not the write acknowledgment. Consumers read only up to the HWM: the offset of the last record confirmed across all ISR members. The gap between the leader’s Log End Offset and the HWM is the uncommitted replication window — records on the leader’s disk that the ISR has not yet confirmed. Consumers are deliberately excluded from this window. If the leader crashes with records in it, the elected follower may not have them. A consumer that had already read those records would have processed events the cluster later decided never happened.

Visibility determines when a written record crosses from the broker’s internal state into what consumers can actually see. A record can be sitting on the leader’s disk — ordered, assigned an offset — and even copied to some followers, yet remain completely invisible to consumers. What unlocks readability is not the write, not partial replication, but a specific commitment signal from the broker.

That boundary is the High Water Mark (HWM). Consumers can only read records below it. The HWM advances when the last record has been successfully replicated to all In-Sync Replicas — not before. Everything above it exists in an intermediate state: written, possibly replicated to some nodes, but not committed.

Consider what happens without this boundary. The leader writes an order event at offset 1042 and a consumer reads it immediately. The consumer marks the order as confirmed and triggers fulfillment. Before the record replicates, the leader crashes. The new leader has no record of offset 1042 — it was never in the ISR. The order is gone. The fulfillment system is processing a ghost.

With the HWM, that sequence cannot happen. The leader writes offset 1042. The consumer asks to read but the HWM is at 1041 — the record hasn’t replicated yet. The leader replicates to the ISR, advances the HWM to 1042, then crashes. The new leader inherited offset 1042 — it was in the ISR. The consumer reads it, the order is confirmed, fulfillment proceeds on a record the cluster has guaranteed will not disappear.

The HWM is the line between what exists and what is safe to act on.

Consumers advance their position only up to the HWM — the offset of the last record the ISR has collectively confirmed. Nothing beyond that offset is readable, regardless of what the leader has locally. This gap between what the leader has written and what the ISR has confirmed is intentional. It is the mechanism that ensures a consumer never acts on a record the cluster might later discard. Without it, a leader election could erase records that consumers have already processed — events that, from the cluster’s perspective, never happened.

The contract of the log rests on three properties:

Ordered by the leader — the leader assigns offsets, establishing the canonical sequence. Every record has a permanent position.
Hardened by replication — a record is not safe until the ISR has confirmed it. Replication is what converts a local write into a cluster-wide commitment.
Revealed by the High Water Mark — consumers see only what the cluster has committed. The HWM is the line between internal state and consumer reality.

The guarantees you hear most about in streaming — at-most-once, at-least-once, exactly-once — are not settings you flip on a broker. They are not features. They are outcomes that emerge from how the producer sends, how the broker replicates, and how the consumer commits its position. Each one requires a specific combination of behaviors across all three. Get one wrong and the label no longer applies, regardless of what the configuration says. Treating them as product tiers rather than engineering contracts is how systems end up with the name of a guarantee and none of the substance.

At-Most-Once #

At-most-once is fire and forget. The producer sends a record and moves on — no waiting for acknowledgment, no retry on failure. If the network drops the packet or the broker crashes before writing to disk, the record is gone. The consumer never sees it, never knows it existed. In exchange, the producer is never blocked waiting for confirmation, which gives this mode the lowest latency and highest throughput of the three.

Consider the failure case. A metrics collector emits a CPU reading at offset 5510. The packet is dropped in transit. The broker never receives it. The consumer’s view of CPU usage has a gap at that timestamp — but with millions of readings per hour, the gap is statistically invisible. No alert fires incorrectly, no dashboard breaks.

Now the success case. The same collector emits the next reading. The packet arrives, the broker writes it, the consumer reads it. No acknowledgment was requested, no handshake occurred. The record moved from producer to consumer in the minimum number of steps the protocol allows.

At-most-once is the right choice when the cost of a gap is lower than the cost of the coordination required to prevent it. For high-volume telemetry, operational metrics, or any stream where occasional loss is acceptable by design, it is not a compromise — it is the correct engineering decision.

At-Least-Once #

At-least-once is the default foundation for data integrity. The producer sends a record and waits for an acknowledgment. If the acknowledgment doesn’t arrive — due to a network timeout, a broker crash, or a leader election mid-write — the producer retries. The guarantee is that the record will eventually land. The cost is that it may land more than once.

The failure case makes this concrete. A payment service sends a charge event. The broker writes it to offset 500 and replicates it. The broker sends the acknowledgment back, but the network drops it. The producer times out, sees no confirmation, and retries. The broker writes the same event again at offset 501. The acknowledgment arrives this time. The log now has two identical charge records. The consumer processes both. If the downstream payment processor is not idempotent, the customer is charged twice.

The success case is simpler. The same charge event is sent. The broker writes it to offset 500, replicates it, and the acknowledgment reaches the producer cleanly. No retry occurs. The consumer reads offset 500 once and processes it once. At-least-once delivered exactly-once behavior — not by guarantee, but by luck of a clean network path.

At-least-once effectively means guaranteed delivery with probable duplication during failure recovery. It is the right default for any system where losing a record is worse than processing it twice — which is most systems. The downstream responsibility is to handle duplicates gracefully, either through idempotent operations or explicit deduplication logic.

Exactly-Once #

Exactly-once is not a stronger version of at-least-once — it is a different mechanism. The producer still retries. The broker is what changes: it deduplicates. When a producer initializes with enable.idempotence=true, the broker assigns it a Producer ID (PID). Every batch the producer sends includes this PID and a monotonically increasing Sequence Number. The broker tracks the last sequence number it committed for each PID-partition pair. When a retry arrives, the broker checks: if the sequence number has already been committed, it acknowledges without writing again.

The failure case is the at-least-once scenario, resolved:

Payment service sends charge event — PID=101, SeqNum=5.
Broker writes it to offset 500 and replicates to the ISR.
Broker sends acknowledgment. Network drops it.
Producer times out — no ACK received. Retries with PID=101, SeqNum=5.
Broker checks its state: SeqNum=5 for PID=101 is already committed.
Broker acknowledges immediately. Nothing is written.
Consumer reads offset 500. One charge event. One charge processed.

The success case is identical to at-least-once when no failure occurs. The charge event is sent, acknowledged cleanly, no retry triggered. The deduplication machinery never activates. The overhead exists in the broker’s state tracking, but the happy path is unchanged.

The producer’s guarantee ends at the broker. What happens next depends entirely on when the consumer commits its offset — and that decision is where end-to-end correctness is won or lost.

Commit after processing — if the consumer crashes between processing and committing, it restarts and reprocesses the same records. At-least-once processing.
Commit before processing — if the consumer crashes between committing and processing, those records are skipped on restart. At-most-once processing.

Neither is safe by default. True end-to-end correctness requires coupling the offset commit with the side effect of processing — committing both atomically. In practice this means storing the offset in the same transaction as the derived state: in a transactional database alongside the write, or back into a Kafka topic within a transaction. If those two things don’t commit together, the guarantee breaks.

The boundary of this guarantee is the producer session. The PID is ephemeral — if the producer process restarts, it gets a new PID and the broker’s deduplication state for the old PID no longer applies. A restart followed by a replay is outside the scope of idempotent production. For cross-restart exactly-once, you need Transactional IDs, which persist the PID across sessions and introduce a two-phase commit between producer and broker.

A log that grows indefinitely is operationally impossible. Retention policies are how the log abstraction survives finite storage — they define what the log forgets and what it keeps.

Time and size-based retention are the simplest form: keep the last 7 days, or the last 1TB, then delete the oldest segments. The log becomes a bounded buffer of recent events rather than a complete history. What’s important is that deletion does not renumber what remains. If offsets 0 through 999 are deleted, the next readable record is still at offset 1000. The Log Start Offset (LSO) advances permanently. Consumers that try to fetch below the LSO receive an OFFSET_OUT_OF_RANGE error and must reset — either to the new beginning or the end of the log.

Log compaction is the alternative for stateful applications where throwing away old data breaks the ability to reconstruct current state. Instead of deleting by age, compaction retains only the latest value for each unique key — semantically equivalent to a database table where each key has exactly one current row:

# Before compaction
0: key=A  val=100
1: key=B  val=200
2: key=A  val=101   ← supersedes offset 0
3: key=C  val=300
4: key=B  val=NULL  ← tombstone: delete B

# After compaction
2: key=A  val=101
3: key=C  val=300
4: key=B  val=NULL  ← tombstone retained until next compaction pass

Offsets are preserved. Superseded values are removed. A new consumer replaying the compacted log from the beginning gets a complete snapshot of current state without needing the full history of every intermediate update.

Guarantee	Producer retries?	Broker deduplicates?	Consumer sees
At-most-once	No	No	0 or 1 records
At-least-once	Yes	No	1 or more records
Exactly-once	Yes	Yes (PID + SeqNum)	Exactly 1 record

At-least-once delivery: duplicate on retry

The log has a deeper property that makes it more than a transport layer: it records changes, and changes are state in disguise.

A stream is a sequence of events. State is what you get when you apply those events in order.

\[\text{State}(t) = \text{State}(0) + \sum_{i=0}^{t} \text{Apply}(\text{Event}_i)\]

The log holds every Event(i). State(0) is the empty baseline. Apply is your application logic. Replay the log from the beginning, apply each event in sequence, and you reconstruct exactly where the system is — not approximately, not eventually, but deterministically. This is the principle behind event sourcing, and it reframes what a database actually is: not the system of record, but a materialized view of it. The log is the system of record.

This has practical consequences. A search index becomes corrupted — delete it and rebuild from the log. A revenue calculation had a bug for six months — reprocess the historical events with the corrected logic and the derived state corrects itself. Any downstream store is disposable as long as the log is intact. The log is the only thing that needs to be durable.

This reliability depends entirely on the two invariants established earlier. Ordering ensures determinism — applying events in the same sequence always produces the same state. Durability ensures the history has no holes — a gap in the log means a gap in the reconstructed state that no amount of reprocessing can fill.

But a single linear log cannot scale indefinitely. When throughput exceeds what a single disk or network interface can handle, the log must be partitioned — split into independent segments that can be written and read in parallel. Partitioning trades total order for throughput. Each partition is its own ordered sequence, but there is no guaranteed ordering across partitions. This introduces two new problems: routing (which partition receives a given event?) and coordination (how does the cluster agree on what is committed?). Solving coordination at scale requires a consensus mechanism — and consensus requires replicas, leaders, and followers. The abstract log becomes a physical architecture.

Topics and Partitions #

In distributed streaming systems, the mental model provided by the API often diverges from the physical reality of the engine. While the Kafka protocol presents a hierarchy of topics, partitions, and replicas, Redpanda implements these concepts through a distinct architectural lens: the Raft consensus group. Understanding this mapping is essential for reasoning about performance, concurrency limits, and failure modes. A topic is an administrative abstraction; a partition is a physical unit of parallelism and consistency; and leadership is the dynamic state that governs availability.

The Topic as a Policy Envelope #

A topic in Redpanda functions primarily as a namespace and a container for configuration policy. It defines what the data is — schema, retention rules, replication factor — but does not dictate how the data is processed or stored physically. When you create a topic, you are defining a contract for durability and visibility.

Operational policies such as retention (time-based or size-based) and compaction strategies apply at the topic level but are enforced at the partition level. The topic serves as a template; the actual work of storage, replication, and retrieval occurs within the partitions. Consequently, topic-level metrics are often aggregates of partition-level behaviors. A topic with high throughput is simply a collection of hot partitions. A topic with high latency is often a symptom of a single degraded partition leader.

Partitions: The Unit of Ordering and Parallelism #

The partition is the atomic unit of the system. It represents the intersection of three critical constraints:

Ordering — a partition is the largest scope in which strict ordering is guaranteed. Records appended to a partition are assigned monotonically increasing offsets (0, 1, 2, …).
Concurrency — a partition can be actively written to by exactly one leader at a time. While a topic can scale horizontally across hundreds of cores, a single partition is bound by the sequential processing capacity of its leader.
State Ownership — in Redpanda, a partition maps one-to-one with a Raft consensus group. Every partition has its own independent leader election, transaction log, and failure domain.

This architecture implies a direct cost model. Increasing partition count increases potential parallelism, allowing more concurrent producers and consumers. However, each partition introduces overhead: metadata management, heartbeat traffic, and memory for indices. Partition counts should be chosen based on required throughput and consumer parallelism, not set arbitrarily high.

The choice of partition key determines data placement and causal boundaries. Because ordering is guaranteed only within a partition, the partition key defines the scope of consistency. If two events must be processed in order — “create user” followed by “update user” — they must resolve to the same partition key. If no key is provided, the system defaults to round-robin or sticky partitioning, maximizing throughput at the expense of semantic ordering between related events.

Replication: The Write Path and Commit Semantics #

Replication transforms a volatile write into a durable record. In Redpanda, each partition has one leader and multiple followers. The leader handles all produce and fetch requests; followers replicate to remain in sync. Redpanda enforces this through the Raft consensus algorithm.

When a producer sends a record to a partition leader:

Append — the leader writes the record to its local log and assigns it an offset.
Replicate — the leader broadcasts the AppendEntries RPC to its followers.
Commit — once a majority of replicas have acknowledged the write, the record is considered committed.
Acknowledge — the leader acknowledges the write to the producer (assuming acks=all).

This majority commit rule is the hard architectural invariant. No committed data is lost as long as a majority of the replica set remains intact.

Write path: Append → Replicate → Commit → Acknowledge

The ISR acts as a safety valve. If a follower falls too far behind or becomes unresponsive, it is removed from the ISR. If the ISR drops below min.insync.replicas, the leader rejects writes with NOT_ENOUGH_REPLICAS — prioritizing consistency over availability.

Leadership as a Control Mechanism #

Leadership is not a static property; it is a leased role that can move between nodes. The leader is the serialization point for all state transitions within the partition — it decides the order of writes and gates the visibility of records to consumers. Leadership movement is the primary source of transient unavailability in the system.

When a leader fails, the remaining followers detect the absence of heartbeats and trigger an election:

Detection — followers notice the leader is silent.
Election — a candidate requests votes. If it has a sufficiently up-to-date log, it wins.
Term Change — the logical clock (term) advances, invalidating the old leader.
Catch-up — the new leader asserts authority and ensures all followers align with its log.

During this window — typically milliseconds — the partition is unavailable for writes. Producers may observe timeout errors or leader-epoch mismatches, forcing them to refresh metadata and retry. This behavior is why client-side retries are part of the contract, not just error handling.

Leader election: Detection → Election → Term Change → Catch-up

Redpanda employs leadership balancing to distribute this load. If one broker holds leadership for a disproportionate number of partitions, it creates a hot spot on that node’s CPU and network. The background balancer proactively transfers leadership to less loaded nodes, spreading the cost of serialization evenly across available cores.

Consumer Groups: Mapping Partitions to Readers #

While partitions define the supply of parallelism, consumer groups define the demand. A consumer group is a logical subscriber that coordinates the reading of a topic. The group protocol assigns each partition to exactly one consumer process within the group.

This exclusive assignment is what allows consumers to process data without locking. Consumer A knows it is the sole reader of Partition 0 — it can buffer, batch, and commit offsets without coordinating with Consumer B on Partition 1. If the number of consumers exceeds the number of partitions, the surplus consumers remain idle. If there are fewer consumers than partitions, individual consumers are assigned multiple partitions.

The consumer group coordinator — a broker-side component — manages this assignment. When a consumer joins or leaves the group, or when topic metadata changes, the group performs a rebalance: partition ownership is revoked and reassigned. Minimizing the frequency and duration of rebalances is a key operational concern covered in depth in the consumer reliability chapter.

Topic partitions mapped to Raft groups across brokers, with consumer group assignment

Redpanda’s Architecture #

Most Kafka-compatible brokers share the same operational DNA: a JVM process, ZooKeeper for coordination, and a thread pool that multiplexes I/O across partitions. Redpanda replaces all three. It is a single statically-linked C++ binary with no external dependencies, no garbage collector, and no separate coordination service. The architectural choices are not cosmetic — they change the performance envelope, the failure model, and the operational surface area. Understanding why requires looking at what each eliminated component was actually doing.

The Single Binary: No ZooKeeper, No JVM #

Traditional Kafka separates concerns across processes. ZooKeeper manages cluster metadata and leader elections. The broker manages data. The two must stay in sync. This split creates a coordination dependency that adds latency on every metadata operation and a failure domain that has nothing to do with data.

Redpanda eliminates this split. Metadata coordination, leader election, and data replication all run inside a single process. There is no ZooKeeper to bootstrap, no separate controller to manage, no cross-process metadata lag. The cluster is the broker. Operationally, this means one binary to deploy, one process to monitor, and one failure domain to reason about.

Thread-Per-Core: No Locks, No GC #

Redpanda is built on Seastar, a C++ framework that assigns each CPU core its own dedicated thread and event loop — called a reactor. Each reactor owns a fixed set of partitions. Partition I/O never crosses thread boundaries. There is no shared mutable state between reactors, so there are no locks, no context switches, and no contention.

This has three direct consequences:

Predictable tail latency — no GC pauses, no lock contention spikes. P99 latency tracks P50 latency closely. In JVM-based brokers, GC pauses can add hundreds of milliseconds to tail latency at random intervals.
CPU-bound scaling — throughput scales linearly with core count. Adding cores adds capacity without coordination overhead.
Partition isolation — a hot partition on core 3 does not degrade partitions on core 7. Each reactor’s load is independent.

The cost model is explicit: one reactor per core means one thread per core means one partition set per core. If a partition’s workload exceeds what a single core can handle, the only solution is to split the partition — not add threads.

Thread-per-core: each reactor owns a fixed partition set, no cross-core I/O

Raft as the Single Consensus Primitive #

In traditional Kafka, ZooKeeper handles cluster-wide consensus. Raft handles nothing — replication is a custom ISR-based protocol, not a general consensus algorithm. Redpanda uses Raft for everything: per-partition replication, controller elections, and cluster metadata management. Every partition is a Raft group. The controller itself is a Raft group.

This unification has direct consequences:

No split-brain on metadata — metadata changes go through Raft consensus, not an eventually-consistent ZooKeeper watch. A leader election and the metadata update propagating to clients are part of the same protocol.
Consistent failure semantics — a partition leader crash and a controller crash are both handled by Raft election. The recovery path is the same mechanism, not two different codepaths.
Deterministic durability — Redpanda syncs writes to disk before acknowledging by default. In standard Kafka, acks=all confirms replication to ISR members but relies on OS page cache flushing for actual disk persistence. Redpanda’s default aligns the logical guarantee with the hardware reality.

What This Means for You as a Client #

From a client perspective, the Kafka wire protocol is unchanged. Producers, consumers, and admin clients connect the same way. The behavioral differences are in the performance envelope:

Lower and more consistent latency — no GC pauses means no unexpected spikes at the tail.
Higher throughput per node — zero-copy networking and thread-per-core eliminate the overhead that JVM brokers pay on every I/O operation.
Stronger durability defaults — acks=all on Redpanda means disk-fsynced on a majority of nodes, not just page-cached.

The protocol is Kafka. The guarantees are stronger. The operational footprint is smaller.

Protocol Compatibility #

The Kafka wire protocol is not Kafka. It is a binary RPC specification — a set of API keys, versioned request/response schemas, and behavioral contracts that any broker can implement. Redpanda implements this specification without sharing a line of Kafka’s codebase. This distinction matters: protocol compatibility means clients connect and operate identically. It does not mean the broker behaves identically under the hood. Understanding where the protocol ends and the implementation begins is what separates correct assumptions from subtle bugs.

Kafka protocol: all client-broker interactions across the full lifecycle

The Wire Format: Binary RPC over TCP #

Every Kafka client interaction is a request/response pair over a persistent TCP connection. Each request carries a fixed header:

api_key — identifies the operation (Produce=0, Fetch=1, Metadata=3, etc.)
api_version — the version of the schema for this request
correlation_id — a client-assigned integer echoed back in the response for matching async replies
client_id — an optional string identifying the client application

The broker reads the header, dispatches to the correct handler, and returns a response with the same correlation_id. There is no session state beyond what the client tracks. The protocol is stateless at the transport layer.

Kafka protocol API families and wire format layers

ApiVersions Negotiation: How Compatibility is Enforced #

On first connect, every client sends an ApiVersions request before any other operation. The broker responds with a list of every API key it supports and the minimum and maximum version it accepts for each. The client selects the highest mutually supported version per API key and uses that version for all subsequent requests.

ApiVersions negotiation on first connect

This negotiation is what makes rolling upgrades safe. A client on version 2.8 connecting to a Redpanda broker negotiates down to the versions both understand. Neither side needs to know the other’s exact version — only the intersection of their supported ranges. When Redpanda adds a new API version, old clients continue using the old version. New clients use the new version. Both work simultaneously.

The practical consequence: never hardcode API versions in client configuration. Let the negotiation happen. Pinning to a specific version defeats the compatibility mechanism and will break silently when either side evolves.

Leader Epoch Fencing: Detecting Stale Leaders #

Leader elections create a window where a client may be talking to a broker that was the leader but no longer is. Without a mechanism to detect this, a producer could write to a stale leader — a broker that accepted the write but whose log will be discarded when the new leader takes over.

The Kafka protocol handles this with leader epochs: a monotonically increasing integer that advances every time a new leader is elected for a partition. Every produce and fetch request includes the client’s current leader epoch. If the broker receives a request with a stale epoch, it rejects it with NOT_LEADER_FOR_PARTITION or FENCED_LEADER_EPOCH. The client must refresh its metadata and retry against the new leader.

This fencing is what makes acks=all with idempotent producers safe across leader elections:

Client sends Produce with leader_epoch=3.
Leader crashes. New leader elected with epoch=4.
Client retries with leader_epoch=3.
New leader rejects: FENCED_LEADER_EPOCH.
Client refreshes metadata. Discovers new leader, epoch=4.
Client retries with leader_epoch=4. Write succeeds.

Protocol vs. Behavioral Compatibility #

Redpanda passes the Kafka protocol test suite. A producer that works against Kafka works against Redpanda — same API keys, same versions, same error codes. This is protocol compatibility.

Behavioral compatibility is narrower. Some Kafka configuration knobs have no equivalent in Redpanda’s architecture. Some defaults differ. Some guarantees are stronger:

acks=all in Kafka confirms replication to the ISR but may rely on OS page cache for disk persistence. In Redpanda, it confirms an fsync on a majority of nodes.
Kafka’s controller uses ZooKeeper for elections. Redpanda’s controller uses Raft. The visible behavior is the same; the failure semantics differ.
Kafka supports unclean.leader.election.enable=true — electing an out-of-sync replica to restore availability at the cost of data loss. Redpanda does not support this; it prioritizes consistency.

The rule: if your client code interacts only with the Kafka wire protocol, it is fully compatible. If your operational tooling relies on ZooKeeper APIs, JMX metrics, or specific Kafka broker internals, it will need adjustment.