Table of Contents

Load Balancer Analysis Note #

This note captures the full step-by-step analysis for a load balancer service: backend membership, health state, effective routing state, control-plane propagation, and data-plane request routing.

Step 1 — Normalize #

Assume the baseline prompt is:

design a load balancer service
clients send requests to a virtual endpoint
requests should be distributed across healthy backends
unhealthy backends should stop receiving traffic
system should scale across nodes

Normalize into state-affecting paths.

Requirement	Actor	Operation	State touched	Priority
Client request is routed to a backend	Client	read source	`S1` `read source target` `RoutingState`	C1
Backend registers/unregisters with balancer	Client	state transition	`S1` `update target` `BackendMembership`	C1
System records backend health	System	overwrite state	`S1` `update target` `HealthState`	C1
System updates effective routing set/policy	System	state transition	`S1` `update target` `RoutingState`	C1
System propagates control-plane config to serving nodes	System	async process	`S1` `hidden write target` `ConfigSnapshot`	C1
Admin updates balancing policy/config	Admin	overwrite state	`S1` `update target` `LoadBalancingPolicy`	C1
Client reads balancer/backend status	Client	read projection	`S1` `read projection target` `StatusView`	R2

Notes on normalization:

Important choices:

request routing is read source
- the hot path mostly reads current routing state and makes a routing decision
backend membership is state transition
- backend joins/leaves active pool
health update is overwrite state
- current health is the main truth
effective routing set update is state transition
- backend eligibility in routing changes over time
config propagation is async process
- control-plane to data-plane dissemination

This is clearly a:

Control Plane + Data Plane system

not:

a queue/log/store problem

Step 2 — Critical Path Selection #

Requirement	Priority class	Why
Route client request to backend	C1	wrong routing can send traffic to unhealthy/wrong backends
Register/unregister backend	C1	active backend pool truth depends on it
Record backend health	C1	health truth determines safe routing eligibility
Update effective routing set/policy	C1	this is the control-to-data-plane correctness bridge
Propagate config to serving nodes	C1	stale serving nodes can route incorrectly
Update balancing policy/config	C1	changes future routing behavior and safety
Read balancer/backend status	R2	operational only

Baseline critical paths:

Main C1 paths:

P1 route client request
P2 register/unregister backend
P3 record backend health
P4 update effective routing state
P5 propagate config snapshot
P6 update balancing policy

This system is driven by:

authoritative backend membership
current backend health
effective routing eligibility
propagation of control-plane state to data-plane nodes

Step 3 — Primary State Extraction #

For a load balancer, the minimal primary state is backend membership, health, routing state, and policy/config.

Candidate object label	Candidate source	Candidate needed for C1/R1?	Candidate decomposition action	Class	Primary?	Owner	Evolution	Scope kind	Scope value
BackendMembership	direct noun	Yes	keep as candidate	relationship	Yes	service	state machine	relation	service_id + backend_id
HealthState	hidden write target	Yes	keep as candidate	entity	Yes	service	overwrite	instance	backend_id
RoutingState	hidden write target	Yes	keep as candidate	process	Yes	service	state machine	instance	service_id or listener_id
LoadBalancingPolicy	direct noun	Yes	keep as candidate	entity	Yes	service	overwrite	instance	service_id or listener_id
ConfigSnapshot	hidden write target	Yes	keep as candidate	projection	Yes	service	overwrite	instance	data_plane_node_id
StatusView	derived read model	No	reject as UI artifact	projection	No	derived	overwrite	collection	service_id

Important modeling choices:

`BackendMembership` #

Primary because:

defines which backends belong to the pool
registration/unregistration changes future routing

`HealthState` #

Primary because:

current health is authoritative input for routing eligibility

`RoutingState` #

Primary because:

it is the effective routing-ready set derived from membership + health + policy
data plane reads it to make forwarding decisions

`LoadBalancingPolicy` #

Primary because:

changes backend selection semantics

`ConfigSnapshot` #

For a control-plane/data-plane system, I keep this primary enough to model explicitly because:

data-plane nodes need a propagated current snapshot
propagation lag and versioning matter

Minimal strict primary set:

BackendMembership
HealthState
RoutingState
LoadBalancingPolicy
ConfigSnapshot

Step 4 — Hard Invariants #

For a load balancer, the hard invariants are about correct backend eligibility and safe propagation of routing state to serving nodes.

Path	Tier	Type	Invariant template	Invariant statement
`P1` `read path` `Route request`	HARD	eligibility	eligibility template	Action `route_request` is valid only if `selected backend is currently eligible under authoritative RoutingState and policy for that service/listener scope` at decision time.
`P2` `write path` `Register/unregister backend`	HARD	eligibility	eligibility template	Action `change_membership` is valid only if `backend identity and service binding rules hold` on `service_id + backend_id` at decision time.
`P2` `write path` `Register/unregister backend`	HARD	uniqueness	uniqueness template	Key `service_id + backend_id` maps to at most one logical outcome `current membership state` within that relation scope.
`P3` `write path` `Record health`	HARD	ordering	ordering template	Instances `health updates` are ordered by `monotonic health revision or observation timestamp` within `backend_id`.
`P4` `write path` `Update effective routing state`	HARD	accounting	accounting template	Effective `RoutingState(service_id)` equals function `eligible(backends, health, policy)` over authoritative membership, health, and policy state.
`P5` `write path` `Propagate config snapshot`	HARD	freshness	freshness template	Data-plane `ConfigSnapshot(node)` reflects authoritative routing/config state within `configured propagation bound`.
`P6` `write path` `Update balancing policy`	HARD	ordering	ordering template	Instances `policy revisions` are ordered by `monotonic policy version` within `service_id or listener_id`.

What matters most:

1. Route only to eligible backends #

The core safety property is:

data plane must not route to unhealthy or ineligible backends

2. Effective routing state is derived from authoritative inputs #

Membership, health, and policy feed the routing-ready set.

3. Propagation lag is bounded #

This is the key control-plane/data-plane correctness interface.

Step 5 — Execution Context #

For the load balancer baseline:

Field	Value	Why
Topology	single service distributed	one logical load-balancing system with distributed serving nodes
Write coordination scope	per object scope	correctness is per service/listener, backend, and data-plane node snapshot
Read consistency target	bounded stale allowed	serving nodes usually use recent config snapshots rather than synchronous strong control-plane reads on each request
Holder model	none	request routing does not rely on lease-like item ownership in the hot path
Compensation acceptable?	No	routing to an unhealthy backend is a serving correctness failure, not a compensable workflow

Derived implications:

holder_may_crash = false
- no temporary client-held ownership primitive is central to the hot path
cross_service_write = false
- baseline keeps control-plane state and data-plane config within one logical system
bounded_staleness_allowed = true
- data plane may route using recent snapshot, not necessarily latest control-plane write
cross_service_atomicity_required = false
- no multi-service transaction needed in baseline
exclusive_claim_required = false
- not a lease/claim problem in the hot request path
guarded_by_current_state = true
- routing-state derivation and membership changes depend on current state

This pushes us toward:

authoritative control plane
versioned snapshots to data-plane nodes
hot path reads from local or near-local routing snapshot

Step 6 — Deterministic Mechanism Selection #

6A. Write Shape #

Path	Why	Write shape
`P1` route request	read-only serving decision against current snapshot	n/a read path
`P2` register/unregister backend	membership lifecycle changes current relation state	`guarded state transition`
`P3` record health	replace current backend health view	`overwrite current value`
`P4` update effective routing state	recompute current routing-eligible set	`overwrite current value`
`P5` propagate config snapshot	push/refresh current serving snapshot	`overwrite current value`
`P6` update balancing policy	replace current policy revision	`overwrite current value`

6B. Base Mechanism #

Path	Write shape	Base mechanism	Required companions
`P2` register/unregister backend	`guarded state transition`	`CAS on (state, version)`	membership version
`P3` record health	`overwrite current value`	`CAS on version` or monotonic observation overwrite	health revision/timestamp
`P4` update effective routing state	`overwrite current value`	`single writer` control-plane recompute	routing version
`P5` propagate config snapshot	`overwrite current value`	`single writer` snapshot publication	config version
`P6` update balancing policy	`overwrite current value`	`CAS on version`	policy version

Why these fit:

Membership #

Backend pool membership has lifecycle semantics and should not flap via blind overwrites.

Health, routing, and policy #

These are current-state config/status values:

latest version matters
overwrite semantics are natural

Config propagation #

This is a classic control-plane publish of versioned snapshots.

Canonical substrate implied:

control plane owns membership, health, policy, and routing state
data plane consumes versioned snapshots
request routing is local snapshot read + backend selection algorithm

Step 7 — Read Model / Source of Truth #

For a load balancer, authoritative truth lives in the control plane, while the data plane uses propagated snapshots.

Concept	Truth	Read path	Rebuild path
`C1` `source concept` `Backend membership`	`BackendMembership`	`read source directly`	authoritative control-plane store
`C2` `source concept` `Backend health`	`HealthState`	`read source directly`	authoritative health store
`C3` `source concept` `Balancing policy`	`LoadBalancingPolicy`	`read source directly`	authoritative policy store
`C4` `source concept` `Effective routing state`	`RoutingState`	`read source directly`	recompute from membership + health + policy
`C5` `projection concept` `Data-plane config snapshot`	`ConfigSnapshot` derived from authoritative control state	`materialized view`	rebuild from latest control-plane version
`C6` `projection concept` `Status/dashboard`	derived from control-plane and serving-node observations	`materialized view`	recompute from primary state

Important point:

For the hot path:

data-plane routing typically does not query control-plane state synchronously per request
instead, it uses ConfigSnapshot

So the serving read path is:

local materialized view / config snapshot

not:

source read on every request

That is the core control-plane/data-plane distinction.

Step 8 — Failure Handling #

Path	Retry	Competing writers	Crash after commit	Publish failure	Stale holder
`P2` register/unregister backend	retry with membership version	stale membership change loses CAS	committed membership survives control-plane crash if persisted	config propagation may lag to serving nodes	n/a
`P3` record health	retry with monotonic observation time/revision	latest valid health view wins by version/timestamp policy	committed health update survives crash if persisted	snapshot propagation may lag within bounded window	n/a
`P4` update effective routing state	recompute retry safe from authoritative inputs	single writer or versioned recompute wins	recompute can be rerun after crash from primary state	snapshot publication may lag	n/a
`P5` propagate config snapshot	retry snapshot push/pull using versioned config	older snapshot version loses to newer version	serving node may keep last good snapshot until refreshed	failed push retried or pulled later	n/a
`P6` update balancing policy	retry with policy version	stale policy update loses CAS	committed policy survives crash if persisted	routing snapshot may lag briefly	n/a
`P1` route request	no mutation retry issue on hot path	multiple serving nodes can route concurrently using local snapshots	node crash just drops its in-flight traffic, not control-plane truth	n/a	stale serving node snapshot must age out or be refreshed

What matters most:

1. Versioned snapshot propagation #

Serving nodes must reject older snapshots and move monotonically forward by config version.

2. Bounded stale routing #

The hot path is usually eventually updated, not strongly synchronized with every control-plane change.

3. Health flapping #

Health-state overwrite policy must handle noisy observations without causing excessive routing churn.

Failure summary:

The load balancer stays correct if:

authoritative membership/health/policy live in control plane
effective routing state is recomputable
data-plane nodes use versioned snapshots
stale snapshots are bounded and replaced

Step 9 — Scale Adjustments #

Hotspot	Type	First response
very high request volume on data plane	read hotspot	add more serving nodes; keep routing decision local to snapshot
control-plane churn from health flaps	contention hotspot	dampen health transitions and rate-limit config recompute
hot listener/service with huge backend set	fan-out hotspot	shard backend set and snapshot propagation by listener/service scope
snapshot propagation storms	fan-out hotspot	use versioned incremental updates and pull-on-miss fallback
status/dashboard queries	read hotspot	keep them as derived views only
membership/config update bursts	write throughput hotspot	batch updates and recompute routing state incrementally

What scales well:

The system scales well if:

the hot path is local snapshot read + selection
control plane is narrow and versioned
propagation is incremental

What fails first:

Usually:

health-flap storms
config propagation bursts
giant backend pools for one service

Canonical design conclusion:

The mechanical outcome is:

primary state:
- BackendMembership
- HealthState
- RoutingState
- LoadBalancingPolicy
- ConfigSnapshot
critical invariants:
- only eligible backends receive traffic
- routing state equals function of membership, health, and policy
- data plane uses bounded-stale versioned snapshots
mechanisms:
- guarded membership changes
- overwrite current health/policy/routing state
- single-writer control-plane recompute
reads:
- source reads in control plane
- materialized config snapshots in data plane

Polished interview answer:

“I’d design the load balancer as a control-plane/data-plane system. The control plane owns backend membership, current health, balancing policy, and the effective routing-eligible set. Data-plane nodes don’t synchronously query control plane on every request; instead they route using versioned config snapshots propagated from control plane. Membership changes are guarded, health and policy are versioned current-state updates, and routing state is recomputed from those authoritative inputs. The main scaling levers are keeping routing local to the data plane, damping health flaps, and using incremental snapshot propagation.”

Concrete Substrate #

I’ll choose a versioned control-plane plus stateless data-plane balancer fleet as the concrete baseline, because that matches the mechanics we derived:

authoritative control-plane state
versioned propagated config snapshots
local routing decisions in data plane

Concrete substrate:

control plane
- authoritative store for backend membership, health, and policy
- recomputes effective routing state
- publishes versioned config snapshots
data plane
- many balancer nodes terminate client connections
- each holds current snapshot for one or more listeners/services
- selects healthy backend locally

Concrete tech family:

control plane in Go or Java
authoritative metadata in etcd or a small strongly consistent store
snapshot propagation via watch streams, pub/sub, or pull-on-version-miss
data-plane balancer in Envoy-like or custom L4/L7 proxy fleet

Operation Layer #

1. Route request #

API

client request to virtual IP / listener

Initiator

client

Entry point

data-plane load balancer node

Authoritative decider

local ConfigSnapshot on that serving node

Precondition

snapshot version current enough for bounded-stale routing contract
selected backend eligible in snapshot

Transition

none on control-plane truth
local backend selection for this request

Response

forwarded request / response relay

2. Register / unregister backend #

API

RegisterBackend(service_id, backend_id, metadata)
UnregisterBackend(service_id, backend_id, expected_version?)

Initiator

backend/service agent

Entry point

control-plane API

Authoritative decider

control-plane membership store

Precondition

valid service/backend identity

Transition

update BackendMembership
bump membership version

Response

{membership_version}

3. Record health #

API

internal ReportHealth(backend_id, health_state, observation_revision)

Initiator

system / health checker

Entry point

control plane

Authoritative decider

health-state owner in control plane

Precondition

observation revision/timestamp monotonic enough under policy

Transition

overwrite HealthState

Response

internal success

4. Update routing state #

API

internal recompute flow

Initiator

system/control plane

Entry point

routing-state owner

Authoritative decider

control-plane recompute worker

Precondition

latest membership/health/policy versions available

Transition

overwrite RoutingState

Response

internal success with new routing version

5. Propagate config snapshot #

API

internal snapshot push/pull stream

Initiator

control plane

Entry point

data-plane node

Authoritative decider

control-plane snapshot publisher

Precondition

newer config version available

Transition

overwrite local ConfigSnapshot(node)

Response

ack current config version

6. Update balancing policy #

API

PutPolicy(service_id, config, expected_version?)

Initiator

admin

Entry point

control-plane API

Authoritative decider

policy store owner

Precondition

expected version matches if supplied

Transition

overwrite LoadBalancingPolicy
bump version

Response

{policy_version}

Entry Point vs Decider vs Responder #

Path	Entry point	Authoritative decider	Physical responder	Logical responder
route request	data-plane node	local current config snapshot	data-plane node	load balancer service
register/unregister backend	control-plane API	control-plane membership store	control-plane node	load balancer service
record health	control-plane API	control-plane health owner	control-plane node	load balancer service
recompute routing state	control plane	control-plane recompute worker	internal	load balancer service
propagate snapshot	data-plane node	control-plane snapshot publisher	control-plane/data-plane nodes	load balancer service
update policy	control-plane API	policy store owner	control-plane node	load balancer service

Concrete HLD #

Main components:

control-plane API
- backend registration
- policy updates
- health ingestion
control-plane state store
- membership, health, policy, routing versions
routing recompute worker
- derives effective routing state
snapshot distribution layer
- pushes/pulls versioned config to serving nodes
data-plane balancer fleet
- handles client traffic using local snapshots

Concrete Technology Realizations #

Stronger infra-native answer #

control plane in Go or Java
authoritative metadata in etcd
watch-stream or pub/sub snapshot propagation
data-plane proxy fleet using Envoy-like process or custom proxy nodes

Short interview version #

“I’d build the load balancer as a versioned control-plane/data-plane system. Control plane owns backend membership, health, policy, and effective routing state, then publishes versioned snapshots to balancer nodes. The data plane uses only local snapshots on the hot path to choose a healthy backend, while control plane handles registration, health updates, and incremental config propagation.”

Load Balancer Analysis Note #

Step 1 — Normalize #

Step 2 — Critical Path Selection #

Step 3 — Primary State Extraction #

BackendMembership #

HealthState #

RoutingState #

LoadBalancingPolicy #

ConfigSnapshot #

Step 4 — Hard Invariants #

1. Route only to eligible backends #

2. Effective routing state is derived from authoritative inputs #

3. Propagation lag is bounded #

Step 5 — Execution Context #

Step 6 — Deterministic Mechanism Selection #

6A. Write Shape #

6B. Base Mechanism #

Membership #

Health, routing, and policy #

Config propagation #

Step 7 — Read Model / Source of Truth #

Step 8 — Failure Handling #

1. Versioned snapshot propagation #

2. Bounded stale routing #

3. Health flapping #

Step 9 — Scale Adjustments #

Concrete Substrate #

Operation Layer #

1. Route request #

2. Register / unregister backend #

3. Record health #

4. Update routing state #

5. Propagate config snapshot #

6. Update balancing policy #

Entry Point vs Decider vs Responder #

Concrete HLD #

Concrete Technology Realizations #

Stronger infra-native answer #

Short interview version #

`BackendMembership` #

`HealthState` #

`RoutingState` #

`LoadBalancingPolicy` #

`ConfigSnapshot` #