Skip to main content
  1. System Design Components/

Kafka-Class Event Streaming Platform Analysis Note

Kafka-Class Event Streaming Platform Analysis Note #

This note captures the full step-by-step analysis for a Kafka-class event streaming platform: partitioned immutable log, consumer-group offset management, leader-based partition ownership, replication, and replayable consumption.

Step 1 — Normalize #

Assume the baseline prompt is:

  • design a Kafka-class event streaming platform
  • producers publish to topics
  • topics are partitioned
  • consumers read in order from partitions
  • consumers manage offsets/progress
  • system should support replay and scalable fanout

Normalize into state-affecting paths.

RequirementActorOperationState touchedPriority
Producer publishes event to topic partitionClientappend eventS1
create target
PartitionLog
C1
Consumer reads events from partitionClientread sourceS1
read source target
PartitionLog
R1
Consumer group commits offset/progressClientoverwrite stateS1
update target
ConsumerOffset
C1
System routes topic partition to current leaderSystemread sourceS1
read source target
PartitionMap
C1
System reassigns partition leadership after node failureSystemstate transitionS1
update target
PartitionOwnership
C1
System replicates partition log to followersSystemasync processS1
hidden write target
ReplicaState
C1
Consumer group rebalances partition assignmentSystemasync processS1
hidden write target
ConsumerAssignment
C1
Client reads topic/lag metricsClientread projectionS1
read projection target
TopicMetricsView
R2

Notes on normalization:

Important choices:

  • publish is append event
    • event log is immutable
  • consumer read is read source
    • consumers read authoritative log
  • offset commit is overwrite state
    • current progress is the main truth for a group/partition
  • partition ownership and replication are explicit because this is distributed infra
  • consumer rebalance is an internal coordination flow

Likely C1:

  • publish
  • offset commit
  • partition routing
  • partition leadership reassignment
  • replication
  • consumer-group assignment changes

Likely R1:

  • consumer read from partition

This is already clearly:

  • log + consumer progress not:
  • queue with in-flight visibility timeout

Step 2 — Critical Path Selection #

RequirementPriority classWhy
Publish event to partitionC1event durability and ordered append are core truth
Consumer reads events from partitionR1core serving path
Consumer group commits offset/progressC1delivery/progress semantics depend on correct offset state
Route partition to current leaderC1wrong routing breaks authoritative append order
Reassign partition leadership after node failureC1failover must preserve ordered committed log and authority
Replicate partition log to followersC1durability and failover correctness depend on it
Rebalance consumer assignmentsC1one group member should own a partition at a time for ordered processing in group semantics
Read topic/lag metricsR2operational only

Baseline critical paths:

Main C1 paths:

  • P1 publish event
  • P2 commit consumer offset
  • P3 route to partition leader
  • P4 reassign partition leadership
  • P5 replicate partition log
  • P6 rebalance consumer assignments

Main R1 path:

  • P7 read events from partition

What drives the design:

  • one authoritative append order per partition
  • one current leader per partition
  • durable replication of committed log
  • monotonic consumer-group progress per partition
  • controlled consumer-group assignment

Step 3 — Primary State Extraction #

For a Kafka-class platform, the minimal primary state is the partition log, consumer progress, partition ownership/routing, and replica state.

Candidate object labelCandidate sourceCandidate needed for C1/R1?Candidate decomposition actionClassPrimary?OwnerEvolutionScope kindScope value
PartitionLogdirect nounYeskeep as candidateeventYesserviceappend-onlycollectiontopic_id + partition_id
ConsumerOffsethidden write targetYeskeep as candidaterelationshipYesserviceoverwriterelationgroup_id + topic_id + partition_id
PartitionOwnershiphidden write targetYeskeep as candidateprocessYesservicestate machineinstancetopic_id + partition_id
PartitionMaphidden write targetYeskeep as candidateentityYesserviceoverwritecollectiontopic partitions
ReplicaStatehidden write targetYeskeep as candidateprocessYesservicestate machinerelationpartition_id + replica_id
ConsumerAssignmenthidden write targetYeskeep as candidaterelationshipYesserviceoverwriterelationgroup_id + partition_id
TopicMetricsViewderived read modelNoreject as UI artifactprojectionNoderivedoverwritecollectiontopic_id

Important modeling choices:

PartitionLog #

This is the core truth of the system:

  • immutable ordered event sequence per partition

ConsumerOffset #

This is also primary:

  • it records current group progress through a partition
  • replay behavior and delivery semantics depend on it

PartitionOwnership #

Needed because:

  • one leader/owner must serialize authoritative appends for a partition

ReplicaState #

Needed because:

  • failover must know which replicas are sufficiently current

ConsumerAssignment #

Needed because:

  • within a consumer group, partition ownership by consumers is meaningful for ordered processing and rebalance correctness

Minimal strict primary set:

  • PartitionLog
  • ConsumerOffset
  • PartitionOwnership
  • PartitionMap
  • ReplicaState
  • ConsumerAssignment

Step 4 — Hard Invariants #

For a Kafka-class platform, the hard invariants are about authoritative append order, valid partition leadership, durable replication, and monotonic consumer progress.

PathTierTypeInvariant templateInvariant statement
P1
write path
Publish event
HARDorderingordering templateInstances partition log appends are ordered by authoritative commit order within topic_id + partition_id.
P1
write path
Publish event
HARDeligibilityeligibility templateAction append_event is valid only if request is routed to current partition leader and append preconditions, if any, hold on topic_id + partition_id at decision time.
P2
write path
Commit consumer offset
HARDorderingordering templateInstances consumer offsets are ordered by monotonic progress within group_id + topic_id + partition_id.
P2
write path
Commit consumer offset
HARDeligibilityeligibility templateAction commit_offset is valid only if new_offset is not behind current committed offset for group-partition scope at decision time.
P3
write path
Route to partition leader
HARDuniquenessuniqueness templateKey topic_id + partition_id maps to at most one logical outcome current authoritative leader within that partition scope.
P4
write path
Reassign partition leadership
HARDeligibilityeligibility templateAction reassign_partition_leader is valid only if current leader failed or relinquished and candidate replica is eligible and sufficiently current on topic_id + partition_id at decision time.
P5
write path
Replicate partition log
HARDaccountingaccounting templateReplica state for topic_id + partition_id equals authoritative committed partition log modulo bounded replication lag.
P6
write path
Rebalance consumer assignments
HARDuniquenessuniqueness templateKey group_id + topic_id + partition_id maps to at most one logical outcome current assigned consumer within that group-partition scope.
P7
read path
Read events from partition
HARD or SOFTfreshnessfreshness templateRead path consume_partition reflects authoritative partition log within configured consistency bound.

What matters most:

1. One authoritative append order per partition #

This is the central Kafka-like invariant.

2. Monotonic consumer progress #

Offsets should not go backward accidentally for the same group/partition scope unless the API explicitly supports reset semantics.

3. Valid failover only to sufficiently current replica #

Otherwise committed events may be lost or reordered.

4. Assignment uniqueness within group #

One consumer group member should own a partition at one time for ordered processing.

Step 5 — Execution Context #

For the Kafka-class baseline:

FieldValueWhy
Topologysingle service distributedone logical streaming platform across many nodes
Write coordination scopeper object scopecorrectness is per partition, per group-partition offset, and per assignment scope
Read consistency targetstrong onlysafest baseline for publish/offset correctness; read relaxation can be added later
Holder modelnodepartition leadership is held by broker nodes
Compensation acceptable?Nolost committed events or split leaders cannot be safely repaired later

Derived implications:

  • holder_may_crash = true

    • brokers can fail while leading a partition
  • cross_service_write = false

    • baseline keeps log, offsets, assignment, and routing within one logical service
  • bounded_staleness_allowed = false

    • for authoritative publish and offset paths in the baseline
  • cross_service_atomicity_required = false

    • no multi-service transaction required in baseline
  • exclusive_claim_required = true

    • partition leadership and group-partition assignment require exclusivity
  • guarded_by_current_state = true

    • offset commits, failover, and assignment changes depend on current state

This pushes us toward:

  • one leader per partition
  • append-only authoritative partition log
  • durable follower replication
  • monotonic offset updates
  • fenced partition leadership

Step 6 — Deterministic Mechanism Selection #

6A. Write Shape #

PathWhyWrite shape
P1 publish eventimmutable append to ordered partition logappend-only event
P2 commit consumer offsetreplace current progress value subject to monotonicityguarded state transition
P3 route to partition leaderone current authoritative leader per partitionexclusive claim
P4 reassign partition leadershipvalid only if ownership/failover state allows itguarded state transition
P5 replicate partition logfollowers consume authoritative append streamappend-only event
P6 rebalance consumer assignmentsone current consumer per group-partition scopeexclusive claim

6B. Base Mechanism #

PathWrite shapeBase mechanismRequired companions
P1 publish eventappend-only eventappend logcommit index, producer idempotency if required
P2 commit consumer offsetguarded state transitionCAS on version or leader-applied monotonic overwriteoffset version
P3 route to partition leaderexclusive claimleasefencing token, heartbeat
P4 reassign partition leadershipguarded state transitionCAS on (state, version)fencing token, replica catch-up check
P5 replicate partition logappend-only eventappend logreplica ack/progress
P6 rebalance consumer assignmentsexclusive claimlease or coordinator-managed assignment epochassignment epoch

Why these fit:

Publish #

Kafka-class logs are fundamentally:

  • immutable append-only event streams

Offset commit #

Offset is current progress state, so it is not itself append-only truth. It is:

  • current value
  • but constrained by monotonicity

So a guarded overwrite/transition is the best classification.

Leadership and assignment #

Both are exclusive-ownership problems:

  • one current partition leader
  • one current consumer per group-partition assignment scope

Canonical substrate implied:

  • partitioned broker cluster
  • one leader per partition
  • follower replication per partition
  • dedicated or embedded group coordinator for assignments/offsets

Step 7 — Read Model / Source of Truth #

For a Kafka-class platform, truth is mostly direct source state. Operational metrics are projections.

ConceptTruthRead pathRebuild path
C1
source concept
Partition event stream
PartitionLogread source directlyauthoritative partition log segments
C2
source concept
Consumer-group progress
ConsumerOffsetread source directlyauthoritative offset store
C3
source concept
Partition leadership
PartitionOwnershipread source directlyauthoritative ownership store
C4
source concept
Partition routing map
PartitionMapread source directlyauthoritative routing metadata
C5
source concept
Replica catch-up state
ReplicaStateread source directlyrecompute from commit index and replica progress
C6
projection concept
Lag / topic metrics
derived from logs, offsets, and replica statematerialized viewrecompute from primary state

Important point:

For the correctness-critical baseline:

  • producers append to authoritative partition log via leader
  • consumers read from authoritative log
  • offsets are read and written against authoritative group-partition progress
  • metrics and lag dashboards are projections only

Step 8 — Failure Handling #

PathRetryCompeting writersCrash after commitPublish failureStale holder
P1 publish eventretry with producer idempotency key/sequence if requiredpartition leader serializes appends; stale leader loses authoritycommitted event survives leader crash if replicated past commit pointproducer may retry; idempotent producer avoids duplicate appendstale leader fenced by epoch/term
P2 commit consumer offsetretry safe with monotonic offset/version checksstale/lower offset loses guarded transitioncommitted offset survives crash if persistedn/astale coordinator/owner fenced by assignment epoch/version
P3 route to partition leaderretry after refreshing partition maponly one valid leader should existrefreshed map points to new leader after failovern/astale owner rejected by fencing token
P4 reassign partition leadershipretry failover safelyonly one reassignment wins current ownership statepromoted leader crash triggers later failovern/aold leader fenced and must not serve writes
P5 replicate partition logfollower retries from last known offset/indexfollowers independently catch up from committed logleader crash stops new replication until failover; committed log remains truthmissing append retried from commit indexstale replica cannot become leader unless sufficiently current
P6 rebalance consumer assignmentsretry rebalance with assignment epochonly one assignment per group-partition scope should be activerebalance coordinator crash delays reassignment but next epoch retriesn/astale consumer assignment fenced by generation/assignment epoch

What matters most:

1. Partition leader fencing #

An old leader must not continue accepting appends after failover.

2. Producer idempotency #

If producer retry happens after response loss:

  • duplicate append is possible unless producer idempotency is used

3. Offset monotonicity #

Client retries or misordered commits must not silently move offsets backward unless explicit reset APIs exist.

4. Replica catch-up before leadership #

A stale replica becoming leader can lose acknowledged events.

Step 9 — Scale Adjustments #

HotspotTypeFirst response
hot partition with too much producer trafficcontention hotspotincrease partition count and rebalance producers across partitions
hot consumer group coordinator / offset storewrite throughput hotspotshard consumer-group metadata and isolate busy groups
replication lag on busy partitionswrite throughput hotspotbatch log replication, tune follower IO, and isolate hot partitions
leader-heavy read/write pressurecontention hotspotspread partition leadership and add more brokers
rebalance storms in consumer groupscontention hotspotreduce eager rebalances, use cooperative/sticky assignment, and slow membership churn
log storage growthstorage growth hotspotsegment retention, compaction where applicable, and tiered/archival storage
lag/metrics fanoutread hotspotkeep lag and topic metrics as derived views only

What scales well:

This system scales by:

  • partitioning topics
  • making each partition an independent ordered append stream
  • replicating partitions independently
  • keeping offsets and assignments scoped by group-partition

Throughput grows mainly with:

  • number of partitions
  • number of brokers
  • replication and IO efficiency

What fails first:

Usually:

  • hot partitions
  • replication lag
  • rebalance churn
  • metadata bottlenecks around consumer groups

Canonical design conclusion:

The mechanical outcome is:

  • primary state:
    • PartitionLog
    • ConsumerOffset
    • PartitionOwnership
    • PartitionMap
    • ReplicaState
    • ConsumerAssignment
  • critical invariants:
    • one authoritative append order per partition
    • monotonic consumer progress per group-partition
    • one leader per partition
    • failover only to sufficiently current replica
    • one current assignment per group-partition scope
  • mechanisms:
    • append log
    • lease
    • guarded offset commit
    • fenced leadership and assignment epochs
  • reads:
    • direct source reads for log, offsets, and leadership
    • projections only for lag/metrics

Polished interview answer:

“I’d design the event streaming platform as a partitioned immutable log system. Each partition has one leader that serializes appends and replicates them to followers. Consumers read directly from partition logs, while consumer-group progress is tracked separately as current offset state per group and partition. Leadership is lease- or epoch-backed and fenced, failover promotes only a sufficiently current replica, and consumer-group assignments are coordinated separately to ensure one active consumer per partition within a group. The main scaling levers are more partitions, balanced leadership, efficient replication, and controlling rebalance churn.”

Concrete Substrate #

I’ll choose a broker-owned partitioned log with per-partition leaders as the concrete baseline, because that fits the mechanical derivation:

  • immutable partitioned log
  • one leader per partition
  • follower replication
  • separate offset and assignment state
  • fenced leadership

Concrete substrate:

  • broker cluster
  • topics split into partitions
  • each partition stored as append-only segment files on leader and followers
  • leader appends events and replicates to followers
  • offset store and consumer-group coordinator track progress and assignment
  • metadata layer tracks partition leadership and routing

Concrete tech family:

  • broker service in Java, Go, or Scala
  • partition logs stored as append-only segment files on local disk
  • replication via Raft-like or Kafka-style leader-follower log replication
  • metadata/control via etcd, internal quorum, or controller nodes
  • optional RocksDB/state store for offsets/metadata, though offset state can also live in internal compacted topics

Operation Layer #

1. Publish event #

API

  • Produce(topic_id, partition_key?, records, producer_id?, sequence?)

Initiator

  • producer/client

Entry point

  • gateway or any broker

Authoritative decider

  • current leader for target partition

Precondition

  • valid partition routing
  • producer idempotency sequence, if enabled, must match expected order

Transition

  • append records to PartitionLog
  • advance leader log end offset

Response

  • {partition_id, base_offset, high_watermark?}

Failure cases

  • stale routing -> redirect/retry
  • response loss -> duplicate append unless producer idempotency is enabled

Sequence

  1. producer sends Produce
  2. entry broker resolves partition
  3. forwards to partition leader
  4. leader appends records locally
  5. leader replicates to followers
  6. on required ack condition, leader commits
  7. leader replies with offsets

2. Read events from partition #

API

  • Fetch(topic_id, partition_id, offset, max_bytes)

Initiator

  • consumer/client

Entry point

  • broker

Authoritative decider

  • partition leader in the strict baseline

Precondition

  • requested offset exists or is within retention bounds

Transition

  • none

Response

  • {records, next_offset, high_watermark}

Failure cases

  • offset out of range due to retention
  • stale routing

Sequence

  1. consumer sends Fetch
  2. broker routes to partition leader
  3. leader reads segment files from requested offset
  4. returns records and next offset

3. Commit consumer offset #

API

  • CommitOffset(group_id, topic_id, partition_id, offset, expected_version?)

Initiator

  • consumer/client

Entry point

  • group coordinator or offset store endpoint

Authoritative decider

  • offset owner/coordinator

Precondition

  • offset commit must satisfy monotonicity/version rules for that group-partition scope

Transition

  • overwrite ConsumerOffset

Response

  • {committed_offset, version}

Failure cases

  • stale commit/version conflict
  • coordinator change during rebalance

Sequence

  1. consumer sends CommitOffset
  2. coordinator loads current offset/version
  3. validates monotonicity/version
  4. persists new offset
  5. returns success

4. Reassign partition leadership #

API

  • internal election/failover flow

Initiator

  • system

Entry point

  • controller/quorum or replica group

Authoritative decider

  • metadata quorum / controller / replica-group protocol

Precondition

  • current leader failed or relinquished
  • candidate replica sufficiently current

Transition

  • PartitionOwnership updated to new leader epoch
  • old leader fenced

Response

  • updated routing metadata

Failure cases

  • split brain if stale leader not fenced
  • stale replica chosen as leader

5. Replicate partition log #

API

  • internal replication RPC

Initiator

  • partition leader

Entry point

  • follower replicas

Authoritative decider

  • leader for proposed entries; quorum/high-watermark logic for commit

Precondition

  • follower log matches previous offset/epoch boundary

Transition

  • follower appends missing entries
  • commit/high watermark advances after required acks

Response

  • follower ack/progress

6. Rebalance consumer assignments #

API

  • internal group-coordinator flow

Initiator

  • system / group coordinator

Entry point

  • group coordinator

Authoritative decider

  • group coordinator with assignment epoch

Precondition

  • current group membership known

Transition

  • overwrite ConsumerAssignment for group-partition scopes
  • increment assignment generation/epoch

Response

  • assignment map to consumers

Failure cases

  • rebalance storm
  • stale consumer acting on old assignment

Entry Point vs Decider vs Responder #

PathEntry pointAuthoritative deciderPhysical responderLogical responder
Producegateway / any brokerpartition leaderleader or front brokerstreaming platform
Fetchbrokerpartition leaderleader or front brokerstreaming platform
CommitOffsetgroup coordinator endpointoffset owner/coordinatorcoordinatorstreaming platform
leader failovercontroller/quorummetadata quorum / replica-group protocolnew leader / controllerstreaming platform
replication RPCleaderfollower + commit logicfollower ackpartition replica group
assignment rebalancegroup coordinatorgroup coordinatorcoordinatorstreaming platform

Concrete HLD #

Main components:

  • gateway/router
    • resolves topic partition and forwards producer/consumer traffic
  • partition leader
    • authoritative append owner for one or more partitions
  • partition followers
    • replicate partition logs and can be promoted on failover
  • group coordinator / offset service
    • stores consumer-group offsets and assignments
  • metadata/controller layer
    • tracks partition routing and current leaders

Concrete Technology Realizations #

Stronger infra-native answer #

  • broker service in Java, Scala, or Go
  • append-only segment files on local disk for partition logs
  • leader-follower partition replication
  • controller/quorum metadata service for partition leadership
  • offset and assignment metadata in compacted internal topics or durable metadata store

Short interview version #

“I’d build the platform as a broker cluster with partitioned immutable logs. Each partition has one leader that serializes appends and replicates to followers. Consumers fetch by offset directly from partition logs, and consumer-group progress is stored separately as current offset state. Partition leadership is fenced by epochs, failover promotes only a sufficiently current replica, and group coordinators manage partition assignments and rebalance generations.”