Kafka-Class Event Streaming Platform Analysis Note
Table of Contents
Kafka-Class Event Streaming Platform Analysis Note #
This note captures the full step-by-step analysis for a Kafka-class event streaming platform: partitioned immutable log, consumer-group offset management, leader-based partition ownership, replication, and replayable consumption.
Step 1 — Normalize #
Assume the baseline prompt is:
- design a Kafka-class event streaming platform
- producers publish to topics
- topics are partitioned
- consumers read in order from partitions
- consumers manage offsets/progress
- system should support replay and scalable fanout
Normalize into state-affecting paths.
| Requirement | Actor | Operation | State touched | Priority |
|---|---|---|---|---|
| Producer publishes event to topic partition | Client | append event | S1create targetPartitionLog | C1 |
| Consumer reads events from partition | Client | read source | S1read source targetPartitionLog | R1 |
| Consumer group commits offset/progress | Client | overwrite state | S1update targetConsumerOffset | C1 |
| System routes topic partition to current leader | System | read source | S1read source targetPartitionMap | C1 |
| System reassigns partition leadership after node failure | System | state transition | S1update targetPartitionOwnership | C1 |
| System replicates partition log to followers | System | async process | S1hidden write targetReplicaState | C1 |
| Consumer group rebalances partition assignment | System | async process | S1hidden write targetConsumerAssignment | C1 |
| Client reads topic/lag metrics | Client | read projection | S1read projection targetTopicMetricsView | R2 |
Notes on normalization:
Important choices:
- publish is
append event- event log is immutable
- consumer read is
read source- consumers read authoritative log
- offset commit is
overwrite state- current progress is the main truth for a group/partition
- partition ownership and replication are explicit because this is distributed infra
- consumer rebalance is an internal coordination flow
Likely C1:
- publish
- offset commit
- partition routing
- partition leadership reassignment
- replication
- consumer-group assignment changes
Likely R1:
- consumer read from partition
This is already clearly:
- log + consumer progress not:
- queue with in-flight visibility timeout
Step 2 — Critical Path Selection #
| Requirement | Priority class | Why |
|---|---|---|
| Publish event to partition | C1 | event durability and ordered append are core truth |
| Consumer reads events from partition | R1 | core serving path |
| Consumer group commits offset/progress | C1 | delivery/progress semantics depend on correct offset state |
| Route partition to current leader | C1 | wrong routing breaks authoritative append order |
| Reassign partition leadership after node failure | C1 | failover must preserve ordered committed log and authority |
| Replicate partition log to followers | C1 | durability and failover correctness depend on it |
| Rebalance consumer assignments | C1 | one group member should own a partition at a time for ordered processing in group semantics |
| Read topic/lag metrics | R2 | operational only |
Baseline critical paths:
Main C1 paths:
P1publish eventP2commit consumer offsetP3route to partition leaderP4reassign partition leadershipP5replicate partition logP6rebalance consumer assignments
Main R1 path:
P7read events from partition
What drives the design:
- one authoritative append order per partition
- one current leader per partition
- durable replication of committed log
- monotonic consumer-group progress per partition
- controlled consumer-group assignment
Step 3 — Primary State Extraction #
For a Kafka-class platform, the minimal primary state is the partition log, consumer progress, partition ownership/routing, and replica state.
| Candidate object label | Candidate source | Candidate needed for C1/R1? | Candidate decomposition action | Class | Primary? | Owner | Evolution | Scope kind | Scope value |
|---|---|---|---|---|---|---|---|---|---|
| PartitionLog | direct noun | Yes | keep as candidate | event | Yes | service | append-only | collection | topic_id + partition_id |
| ConsumerOffset | hidden write target | Yes | keep as candidate | relationship | Yes | service | overwrite | relation | group_id + topic_id + partition_id |
| PartitionOwnership | hidden write target | Yes | keep as candidate | process | Yes | service | state machine | instance | topic_id + partition_id |
| PartitionMap | hidden write target | Yes | keep as candidate | entity | Yes | service | overwrite | collection | topic partitions |
| ReplicaState | hidden write target | Yes | keep as candidate | process | Yes | service | state machine | relation | partition_id + replica_id |
| ConsumerAssignment | hidden write target | Yes | keep as candidate | relationship | Yes | service | overwrite | relation | group_id + partition_id |
| TopicMetricsView | derived read model | No | reject as UI artifact | projection | No | derived | overwrite | collection | topic_id |
Important modeling choices:
PartitionLog #
This is the core truth of the system:
- immutable ordered event sequence per partition
ConsumerOffset #
This is also primary:
- it records current group progress through a partition
- replay behavior and delivery semantics depend on it
PartitionOwnership #
Needed because:
- one leader/owner must serialize authoritative appends for a partition
ReplicaState #
Needed because:
- failover must know which replicas are sufficiently current
ConsumerAssignment #
Needed because:
- within a consumer group, partition ownership by consumers is meaningful for ordered processing and rebalance correctness
Minimal strict primary set:
PartitionLogConsumerOffsetPartitionOwnershipPartitionMapReplicaStateConsumerAssignment
Step 4 — Hard Invariants #
For a Kafka-class platform, the hard invariants are about authoritative append order, valid partition leadership, durable replication, and monotonic consumer progress.
| Path | Tier | Type | Invariant template | Invariant statement |
|---|---|---|---|---|
P1write pathPublish event | HARD | ordering | ordering template | Instances partition log appends are ordered by authoritative commit order within topic_id + partition_id. |
P1write pathPublish event | HARD | eligibility | eligibility template | Action append_event is valid only if request is routed to current partition leader and append preconditions, if any, hold on topic_id + partition_id at decision time. |
P2write pathCommit consumer offset | HARD | ordering | ordering template | Instances consumer offsets are ordered by monotonic progress within group_id + topic_id + partition_id. |
P2write pathCommit consumer offset | HARD | eligibility | eligibility template | Action commit_offset is valid only if new_offset is not behind current committed offset for group-partition scope at decision time. |
P3write pathRoute to partition leader | HARD | uniqueness | uniqueness template | Key topic_id + partition_id maps to at most one logical outcome current authoritative leader within that partition scope. |
P4write pathReassign partition leadership | HARD | eligibility | eligibility template | Action reassign_partition_leader is valid only if current leader failed or relinquished and candidate replica is eligible and sufficiently current on topic_id + partition_id at decision time. |
P5write pathReplicate partition log | HARD | accounting | accounting template | Replica state for topic_id + partition_id equals authoritative committed partition log modulo bounded replication lag. |
P6write pathRebalance consumer assignments | HARD | uniqueness | uniqueness template | Key group_id + topic_id + partition_id maps to at most one logical outcome current assigned consumer within that group-partition scope. |
P7read pathRead events from partition | HARD or SOFT | freshness | freshness template | Read path consume_partition reflects authoritative partition log within configured consistency bound. |
What matters most:
1. One authoritative append order per partition #
This is the central Kafka-like invariant.
2. Monotonic consumer progress #
Offsets should not go backward accidentally for the same group/partition scope unless the API explicitly supports reset semantics.
3. Valid failover only to sufficiently current replica #
Otherwise committed events may be lost or reordered.
4. Assignment uniqueness within group #
One consumer group member should own a partition at one time for ordered processing.
Step 5 — Execution Context #
For the Kafka-class baseline:
| Field | Value | Why |
|---|---|---|
| Topology | single service distributed | one logical streaming platform across many nodes |
| Write coordination scope | per object scope | correctness is per partition, per group-partition offset, and per assignment scope |
| Read consistency target | strong only | safest baseline for publish/offset correctness; read relaxation can be added later |
| Holder model | node | partition leadership is held by broker nodes |
| Compensation acceptable? | No | lost committed events or split leaders cannot be safely repaired later |
Derived implications:
holder_may_crash = true- brokers can fail while leading a partition
cross_service_write = false- baseline keeps log, offsets, assignment, and routing within one logical service
bounded_staleness_allowed = false- for authoritative publish and offset paths in the baseline
cross_service_atomicity_required = false- no multi-service transaction required in baseline
exclusive_claim_required = true- partition leadership and group-partition assignment require exclusivity
guarded_by_current_state = true- offset commits, failover, and assignment changes depend on current state
This pushes us toward:
- one leader per partition
- append-only authoritative partition log
- durable follower replication
- monotonic offset updates
- fenced partition leadership
Step 6 — Deterministic Mechanism Selection #
6A. Write Shape #
| Path | Why | Write shape |
|---|---|---|
P1 publish event | immutable append to ordered partition log | append-only event |
P2 commit consumer offset | replace current progress value subject to monotonicity | guarded state transition |
P3 route to partition leader | one current authoritative leader per partition | exclusive claim |
P4 reassign partition leadership | valid only if ownership/failover state allows it | guarded state transition |
P5 replicate partition log | followers consume authoritative append stream | append-only event |
P6 rebalance consumer assignments | one current consumer per group-partition scope | exclusive claim |
6B. Base Mechanism #
| Path | Write shape | Base mechanism | Required companions |
|---|---|---|---|
P1 publish event | append-only event | append log | commit index, producer idempotency if required |
P2 commit consumer offset | guarded state transition | CAS on version or leader-applied monotonic overwrite | offset version |
P3 route to partition leader | exclusive claim | lease | fencing token, heartbeat |
P4 reassign partition leadership | guarded state transition | CAS on (state, version) | fencing token, replica catch-up check |
P5 replicate partition log | append-only event | append log | replica ack/progress |
P6 rebalance consumer assignments | exclusive claim | lease or coordinator-managed assignment epoch | assignment epoch |
Why these fit:
Publish #
Kafka-class logs are fundamentally:
- immutable append-only event streams
Offset commit #
Offset is current progress state, so it is not itself append-only truth. It is:
- current value
- but constrained by monotonicity
So a guarded overwrite/transition is the best classification.
Leadership and assignment #
Both are exclusive-ownership problems:
- one current partition leader
- one current consumer per group-partition assignment scope
Canonical substrate implied:
- partitioned broker cluster
- one leader per partition
- follower replication per partition
- dedicated or embedded group coordinator for assignments/offsets
Step 7 — Read Model / Source of Truth #
For a Kafka-class platform, truth is mostly direct source state. Operational metrics are projections.
| Concept | Truth | Read path | Rebuild path |
|---|---|---|---|
C1source conceptPartition event stream | PartitionLog | read source directly | authoritative partition log segments |
C2source conceptConsumer-group progress | ConsumerOffset | read source directly | authoritative offset store |
C3source conceptPartition leadership | PartitionOwnership | read source directly | authoritative ownership store |
C4source conceptPartition routing map | PartitionMap | read source directly | authoritative routing metadata |
C5source conceptReplica catch-up state | ReplicaState | read source directly | recompute from commit index and replica progress |
C6projection conceptLag / topic metrics | derived from logs, offsets, and replica state | materialized view | recompute from primary state |
Important point:
For the correctness-critical baseline:
- producers append to authoritative partition log via leader
- consumers read from authoritative log
- offsets are read and written against authoritative group-partition progress
- metrics and lag dashboards are projections only
Step 8 — Failure Handling #
| Path | Retry | Competing writers | Crash after commit | Publish failure | Stale holder |
|---|---|---|---|---|---|
P1 publish event | retry with producer idempotency key/sequence if required | partition leader serializes appends; stale leader loses authority | committed event survives leader crash if replicated past commit point | producer may retry; idempotent producer avoids duplicate append | stale leader fenced by epoch/term |
P2 commit consumer offset | retry safe with monotonic offset/version checks | stale/lower offset loses guarded transition | committed offset survives crash if persisted | n/a | stale coordinator/owner fenced by assignment epoch/version |
P3 route to partition leader | retry after refreshing partition map | only one valid leader should exist | refreshed map points to new leader after failover | n/a | stale owner rejected by fencing token |
P4 reassign partition leadership | retry failover safely | only one reassignment wins current ownership state | promoted leader crash triggers later failover | n/a | old leader fenced and must not serve writes |
P5 replicate partition log | follower retries from last known offset/index | followers independently catch up from committed log | leader crash stops new replication until failover; committed log remains truth | missing append retried from commit index | stale replica cannot become leader unless sufficiently current |
P6 rebalance consumer assignments | retry rebalance with assignment epoch | only one assignment per group-partition scope should be active | rebalance coordinator crash delays reassignment but next epoch retries | n/a | stale consumer assignment fenced by generation/assignment epoch |
What matters most:
1. Partition leader fencing #
An old leader must not continue accepting appends after failover.
2. Producer idempotency #
If producer retry happens after response loss:
- duplicate append is possible unless producer idempotency is used
3. Offset monotonicity #
Client retries or misordered commits must not silently move offsets backward unless explicit reset APIs exist.
4. Replica catch-up before leadership #
A stale replica becoming leader can lose acknowledged events.
Step 9 — Scale Adjustments #
| Hotspot | Type | First response |
|---|---|---|
| hot partition with too much producer traffic | contention hotspot | increase partition count and rebalance producers across partitions |
| hot consumer group coordinator / offset store | write throughput hotspot | shard consumer-group metadata and isolate busy groups |
| replication lag on busy partitions | write throughput hotspot | batch log replication, tune follower IO, and isolate hot partitions |
| leader-heavy read/write pressure | contention hotspot | spread partition leadership and add more brokers |
| rebalance storms in consumer groups | contention hotspot | reduce eager rebalances, use cooperative/sticky assignment, and slow membership churn |
| log storage growth | storage growth hotspot | segment retention, compaction where applicable, and tiered/archival storage |
| lag/metrics fanout | read hotspot | keep lag and topic metrics as derived views only |
What scales well:
This system scales by:
- partitioning topics
- making each partition an independent ordered append stream
- replicating partitions independently
- keeping offsets and assignments scoped by group-partition
Throughput grows mainly with:
- number of partitions
- number of brokers
- replication and IO efficiency
What fails first:
Usually:
- hot partitions
- replication lag
- rebalance churn
- metadata bottlenecks around consumer groups
Canonical design conclusion:
The mechanical outcome is:
- primary state:
PartitionLogConsumerOffsetPartitionOwnershipPartitionMapReplicaStateConsumerAssignment
- critical invariants:
- one authoritative append order per partition
- monotonic consumer progress per group-partition
- one leader per partition
- failover only to sufficiently current replica
- one current assignment per group-partition scope
- mechanisms:
append loglease- guarded offset commit
- fenced leadership and assignment epochs
- reads:
- direct source reads for log, offsets, and leadership
- projections only for lag/metrics
Polished interview answer:
“I’d design the event streaming platform as a partitioned immutable log system. Each partition has one leader that serializes appends and replicates them to followers. Consumers read directly from partition logs, while consumer-group progress is tracked separately as current offset state per group and partition. Leadership is lease- or epoch-backed and fenced, failover promotes only a sufficiently current replica, and consumer-group assignments are coordinated separately to ensure one active consumer per partition within a group. The main scaling levers are more partitions, balanced leadership, efficient replication, and controlling rebalance churn.”
Concrete Substrate #
I’ll choose a broker-owned partitioned log with per-partition leaders as the concrete baseline, because that fits the mechanical derivation:
- immutable partitioned log
- one leader per partition
- follower replication
- separate offset and assignment state
- fenced leadership
Concrete substrate:
- broker cluster
- topics split into partitions
- each partition stored as append-only segment files on leader and followers
- leader appends events and replicates to followers
- offset store and consumer-group coordinator track progress and assignment
- metadata layer tracks partition leadership and routing
Concrete tech family:
- broker service in Java, Go, or Scala
- partition logs stored as append-only segment files on local disk
- replication via Raft-like or Kafka-style leader-follower log replication
- metadata/control via etcd, internal quorum, or controller nodes
- optional RocksDB/state store for offsets/metadata, though offset state can also live in internal compacted topics
Operation Layer #
1. Publish event #
API
Produce(topic_id, partition_key?, records, producer_id?, sequence?)
Initiator
- producer/client
Entry point
- gateway or any broker
Authoritative decider
- current leader for target partition
Precondition
- valid partition routing
- producer idempotency sequence, if enabled, must match expected order
Transition
- append records to
PartitionLog - advance leader log end offset
Response
{partition_id, base_offset, high_watermark?}
Failure cases
- stale routing -> redirect/retry
- response loss -> duplicate append unless producer idempotency is enabled
Sequence
- producer sends
Produce - entry broker resolves partition
- forwards to partition leader
- leader appends records locally
- leader replicates to followers
- on required ack condition, leader commits
- leader replies with offsets
2. Read events from partition #
API
Fetch(topic_id, partition_id, offset, max_bytes)
Initiator
- consumer/client
Entry point
- broker
Authoritative decider
- partition leader in the strict baseline
Precondition
- requested offset exists or is within retention bounds
Transition
- none
Response
{records, next_offset, high_watermark}
Failure cases
- offset out of range due to retention
- stale routing
Sequence
- consumer sends
Fetch - broker routes to partition leader
- leader reads segment files from requested offset
- returns records and next offset
3. Commit consumer offset #
API
CommitOffset(group_id, topic_id, partition_id, offset, expected_version?)
Initiator
- consumer/client
Entry point
- group coordinator or offset store endpoint
Authoritative decider
- offset owner/coordinator
Precondition
- offset commit must satisfy monotonicity/version rules for that group-partition scope
Transition
- overwrite
ConsumerOffset
Response
{committed_offset, version}
Failure cases
- stale commit/version conflict
- coordinator change during rebalance
Sequence
- consumer sends
CommitOffset - coordinator loads current offset/version
- validates monotonicity/version
- persists new offset
- returns success
4. Reassign partition leadership #
API
- internal election/failover flow
Initiator
- system
Entry point
- controller/quorum or replica group
Authoritative decider
- metadata quorum / controller / replica-group protocol
Precondition
- current leader failed or relinquished
- candidate replica sufficiently current
Transition
PartitionOwnershipupdated to new leader epoch- old leader fenced
Response
- updated routing metadata
Failure cases
- split brain if stale leader not fenced
- stale replica chosen as leader
5. Replicate partition log #
API
- internal replication RPC
Initiator
- partition leader
Entry point
- follower replicas
Authoritative decider
- leader for proposed entries; quorum/high-watermark logic for commit
Precondition
- follower log matches previous offset/epoch boundary
Transition
- follower appends missing entries
- commit/high watermark advances after required acks
Response
- follower ack/progress
6. Rebalance consumer assignments #
API
- internal group-coordinator flow
Initiator
- system / group coordinator
Entry point
- group coordinator
Authoritative decider
- group coordinator with assignment epoch
Precondition
- current group membership known
Transition
- overwrite
ConsumerAssignmentfor group-partition scopes - increment assignment generation/epoch
Response
- assignment map to consumers
Failure cases
- rebalance storm
- stale consumer acting on old assignment
Entry Point vs Decider vs Responder #
| Path | Entry point | Authoritative decider | Physical responder | Logical responder |
|---|---|---|---|---|
Produce | gateway / any broker | partition leader | leader or front broker | streaming platform |
Fetch | broker | partition leader | leader or front broker | streaming platform |
CommitOffset | group coordinator endpoint | offset owner/coordinator | coordinator | streaming platform |
| leader failover | controller/quorum | metadata quorum / replica-group protocol | new leader / controller | streaming platform |
| replication RPC | leader | follower + commit logic | follower ack | partition replica group |
| assignment rebalance | group coordinator | group coordinator | coordinator | streaming platform |
Concrete HLD #
Main components:
- gateway/router
- resolves topic partition and forwards producer/consumer traffic
- partition leader
- authoritative append owner for one or more partitions
- partition followers
- replicate partition logs and can be promoted on failover
- group coordinator / offset service
- stores consumer-group offsets and assignments
- metadata/controller layer
- tracks partition routing and current leaders
Concrete Technology Realizations #
Stronger infra-native answer #
- broker service in Java, Scala, or Go
- append-only segment files on local disk for partition logs
- leader-follower partition replication
- controller/quorum metadata service for partition leadership
- offset and assignment metadata in compacted internal topics or durable metadata store
Short interview version #
“I’d build the platform as a broker cluster with partitioned immutable logs. Each partition has one leader that serializes appends and replicates to followers. Consumers fetch by offset directly from partition logs, and consumer-group progress is stored separately as current offset state. Partition leadership is fenced by epochs, failover promotes only a sufficiently current replica, and group coordinators manage partition assignments and rebalance generations.”