Skip to main content
  1. System Design Components/

Load Balancer Analysis Note

Load Balancer Analysis Note #

This note captures the full step-by-step analysis for a load balancer service: backend membership, health state, effective routing state, control-plane propagation, and data-plane request routing.

Step 1 — Normalize #

Assume the baseline prompt is:

  • design a load balancer service
  • clients send requests to a virtual endpoint
  • requests should be distributed across healthy backends
  • unhealthy backends should stop receiving traffic
  • system should scale across nodes

Normalize into state-affecting paths.

RequirementActorOperationState touchedPriority
Client request is routed to a backendClientread sourceS1
read source target
RoutingState
C1
Backend registers/unregisters with balancerClientstate transitionS1
update target
BackendMembership
C1
System records backend healthSystemoverwrite stateS1
update target
HealthState
C1
System updates effective routing set/policySystemstate transitionS1
update target
RoutingState
C1
System propagates control-plane config to serving nodesSystemasync processS1
hidden write target
ConfigSnapshot
C1
Admin updates balancing policy/configAdminoverwrite stateS1
update target
LoadBalancingPolicy
C1
Client reads balancer/backend statusClientread projectionS1
read projection target
StatusView
R2

Notes on normalization:

Important choices:

  • request routing is read source
    • the hot path mostly reads current routing state and makes a routing decision
  • backend membership is state transition
    • backend joins/leaves active pool
  • health update is overwrite state
    • current health is the main truth
  • effective routing set update is state transition
    • backend eligibility in routing changes over time
  • config propagation is async process
    • control-plane to data-plane dissemination

This is clearly a:

  • Control Plane + Data Plane system

not:

  • a queue/log/store problem

Step 2 — Critical Path Selection #

RequirementPriority classWhy
Route client request to backendC1wrong routing can send traffic to unhealthy/wrong backends
Register/unregister backendC1active backend pool truth depends on it
Record backend healthC1health truth determines safe routing eligibility
Update effective routing set/policyC1this is the control-to-data-plane correctness bridge
Propagate config to serving nodesC1stale serving nodes can route incorrectly
Update balancing policy/configC1changes future routing behavior and safety
Read balancer/backend statusR2operational only

Baseline critical paths:

Main C1 paths:

  • P1 route client request
  • P2 register/unregister backend
  • P3 record backend health
  • P4 update effective routing state
  • P5 propagate config snapshot
  • P6 update balancing policy

This system is driven by:

  • authoritative backend membership
  • current backend health
  • effective routing eligibility
  • propagation of control-plane state to data-plane nodes

Step 3 — Primary State Extraction #

For a load balancer, the minimal primary state is backend membership, health, routing state, and policy/config.

Candidate object labelCandidate sourceCandidate needed for C1/R1?Candidate decomposition actionClassPrimary?OwnerEvolutionScope kindScope value
BackendMembershipdirect nounYeskeep as candidaterelationshipYesservicestate machinerelationservice_id + backend_id
HealthStatehidden write targetYeskeep as candidateentityYesserviceoverwriteinstancebackend_id
RoutingStatehidden write targetYeskeep as candidateprocessYesservicestate machineinstanceservice_id or listener_id
LoadBalancingPolicydirect nounYeskeep as candidateentityYesserviceoverwriteinstanceservice_id or listener_id
ConfigSnapshothidden write targetYeskeep as candidateprojectionYesserviceoverwriteinstancedata_plane_node_id
StatusViewderived read modelNoreject as UI artifactprojectionNoderivedoverwritecollectionservice_id

Important modeling choices:

BackendMembership #

Primary because:

  • defines which backends belong to the pool
  • registration/unregistration changes future routing

HealthState #

Primary because:

  • current health is authoritative input for routing eligibility

RoutingState #

Primary because:

  • it is the effective routing-ready set derived from membership + health + policy
  • data plane reads it to make forwarding decisions

LoadBalancingPolicy #

Primary because:

  • changes backend selection semantics

ConfigSnapshot #

For a control-plane/data-plane system, I keep this primary enough to model explicitly because:

  • data-plane nodes need a propagated current snapshot
  • propagation lag and versioning matter

Minimal strict primary set:

  • BackendMembership
  • HealthState
  • RoutingState
  • LoadBalancingPolicy
  • ConfigSnapshot

Step 4 — Hard Invariants #

For a load balancer, the hard invariants are about correct backend eligibility and safe propagation of routing state to serving nodes.

PathTierTypeInvariant templateInvariant statement
P1
read path
Route request
HARDeligibilityeligibility templateAction route_request is valid only if selected backend is currently eligible under authoritative RoutingState and policy for that service/listener scope at decision time.
P2
write path
Register/unregister backend
HARDeligibilityeligibility templateAction change_membership is valid only if backend identity and service binding rules hold on service_id + backend_id at decision time.
P2
write path
Register/unregister backend
HARDuniquenessuniqueness templateKey service_id + backend_id maps to at most one logical outcome current membership state within that relation scope.
P3
write path
Record health
HARDorderingordering templateInstances health updates are ordered by monotonic health revision or observation timestamp within backend_id.
P4
write path
Update effective routing state
HARDaccountingaccounting templateEffective RoutingState(service_id) equals function eligible(backends, health, policy) over authoritative membership, health, and policy state.
P5
write path
Propagate config snapshot
HARDfreshnessfreshness templateData-plane ConfigSnapshot(node) reflects authoritative routing/config state within configured propagation bound.
P6
write path
Update balancing policy
HARDorderingordering templateInstances policy revisions are ordered by monotonic policy version within service_id or listener_id.

What matters most:

1. Route only to eligible backends #

The core safety property is:

  • data plane must not route to unhealthy or ineligible backends

2. Effective routing state is derived from authoritative inputs #

Membership, health, and policy feed the routing-ready set.

3. Propagation lag is bounded #

This is the key control-plane/data-plane correctness interface.

Step 5 — Execution Context #

For the load balancer baseline:

FieldValueWhy
Topologysingle service distributedone logical load-balancing system with distributed serving nodes
Write coordination scopeper object scopecorrectness is per service/listener, backend, and data-plane node snapshot
Read consistency targetbounded stale allowedserving nodes usually use recent config snapshots rather than synchronous strong control-plane reads on each request
Holder modelnonerequest routing does not rely on lease-like item ownership in the hot path
Compensation acceptable?Norouting to an unhealthy backend is a serving correctness failure, not a compensable workflow

Derived implications:

  • holder_may_crash = false

    • no temporary client-held ownership primitive is central to the hot path
  • cross_service_write = false

    • baseline keeps control-plane state and data-plane config within one logical system
  • bounded_staleness_allowed = true

    • data plane may route using recent snapshot, not necessarily latest control-plane write
  • cross_service_atomicity_required = false

    • no multi-service transaction needed in baseline
  • exclusive_claim_required = false

    • not a lease/claim problem in the hot request path
  • guarded_by_current_state = true

    • routing-state derivation and membership changes depend on current state

This pushes us toward:

  • authoritative control plane
  • versioned snapshots to data-plane nodes
  • hot path reads from local or near-local routing snapshot

Step 6 — Deterministic Mechanism Selection #

6A. Write Shape #

PathWhyWrite shape
P1 route requestread-only serving decision against current snapshotn/a read path
P2 register/unregister backendmembership lifecycle changes current relation stateguarded state transition
P3 record healthreplace current backend health viewoverwrite current value
P4 update effective routing staterecompute current routing-eligible setoverwrite current value
P5 propagate config snapshotpush/refresh current serving snapshotoverwrite current value
P6 update balancing policyreplace current policy revisionoverwrite current value

6B. Base Mechanism #

PathWrite shapeBase mechanismRequired companions
P2 register/unregister backendguarded state transitionCAS on (state, version)membership version
P3 record healthoverwrite current valueCAS on version or monotonic observation overwritehealth revision/timestamp
P4 update effective routing stateoverwrite current valuesingle writer control-plane recomputerouting version
P5 propagate config snapshotoverwrite current valuesingle writer snapshot publicationconfig version
P6 update balancing policyoverwrite current valueCAS on versionpolicy version

Why these fit:

Membership #

Backend pool membership has lifecycle semantics and should not flap via blind overwrites.

Health, routing, and policy #

These are current-state config/status values:

  • latest version matters
  • overwrite semantics are natural

Config propagation #

This is a classic control-plane publish of versioned snapshots.

Canonical substrate implied:

  • control plane owns membership, health, policy, and routing state
  • data plane consumes versioned snapshots
  • request routing is local snapshot read + backend selection algorithm

Step 7 — Read Model / Source of Truth #

For a load balancer, authoritative truth lives in the control plane, while the data plane uses propagated snapshots.

ConceptTruthRead pathRebuild path
C1
source concept
Backend membership
BackendMembershipread source directlyauthoritative control-plane store
C2
source concept
Backend health
HealthStateread source directlyauthoritative health store
C3
source concept
Balancing policy
LoadBalancingPolicyread source directlyauthoritative policy store
C4
source concept
Effective routing state
RoutingStateread source directlyrecompute from membership + health + policy
C5
projection concept
Data-plane config snapshot
ConfigSnapshot derived from authoritative control statematerialized viewrebuild from latest control-plane version
C6
projection concept
Status/dashboard
derived from control-plane and serving-node observationsmaterialized viewrecompute from primary state

Important point:

For the hot path:

  • data-plane routing typically does not query control-plane state synchronously per request
  • instead, it uses ConfigSnapshot

So the serving read path is:

  • local materialized view / config snapshot

not:

  • source read on every request

That is the core control-plane/data-plane distinction.

Step 8 — Failure Handling #

PathRetryCompeting writersCrash after commitPublish failureStale holder
P2 register/unregister backendretry with membership versionstale membership change loses CAScommitted membership survives control-plane crash if persistedconfig propagation may lag to serving nodesn/a
P3 record healthretry with monotonic observation time/revisionlatest valid health view wins by version/timestamp policycommitted health update survives crash if persistedsnapshot propagation may lag within bounded windown/a
P4 update effective routing staterecompute retry safe from authoritative inputssingle writer or versioned recompute winsrecompute can be rerun after crash from primary statesnapshot publication may lagn/a
P5 propagate config snapshotretry snapshot push/pull using versioned configolder snapshot version loses to newer versionserving node may keep last good snapshot until refreshedfailed push retried or pulled latern/a
P6 update balancing policyretry with policy versionstale policy update loses CAScommitted policy survives crash if persistedrouting snapshot may lag brieflyn/a
P1 route requestno mutation retry issue on hot pathmultiple serving nodes can route concurrently using local snapshotsnode crash just drops its in-flight traffic, not control-plane truthn/astale serving node snapshot must age out or be refreshed

What matters most:

1. Versioned snapshot propagation #

Serving nodes must reject older snapshots and move monotonically forward by config version.

2. Bounded stale routing #

The hot path is usually eventually updated, not strongly synchronized with every control-plane change.

3. Health flapping #

Health-state overwrite policy must handle noisy observations without causing excessive routing churn.

Failure summary:

The load balancer stays correct if:

  • authoritative membership/health/policy live in control plane
  • effective routing state is recomputable
  • data-plane nodes use versioned snapshots
  • stale snapshots are bounded and replaced

Step 9 — Scale Adjustments #

HotspotTypeFirst response
very high request volume on data planeread hotspotadd more serving nodes; keep routing decision local to snapshot
control-plane churn from health flapscontention hotspotdampen health transitions and rate-limit config recompute
hot listener/service with huge backend setfan-out hotspotshard backend set and snapshot propagation by listener/service scope
snapshot propagation stormsfan-out hotspotuse versioned incremental updates and pull-on-miss fallback
status/dashboard queriesread hotspotkeep them as derived views only
membership/config update burstswrite throughput hotspotbatch updates and recompute routing state incrementally

What scales well:

The system scales well if:

  • the hot path is local snapshot read + selection
  • control plane is narrow and versioned
  • propagation is incremental

What fails first:

Usually:

  • health-flap storms
  • config propagation bursts
  • giant backend pools for one service

Canonical design conclusion:

The mechanical outcome is:

  • primary state:
    • BackendMembership
    • HealthState
    • RoutingState
    • LoadBalancingPolicy
    • ConfigSnapshot
  • critical invariants:
    • only eligible backends receive traffic
    • routing state equals function of membership, health, and policy
    • data plane uses bounded-stale versioned snapshots
  • mechanisms:
    • guarded membership changes
    • overwrite current health/policy/routing state
    • single-writer control-plane recompute
  • reads:
    • source reads in control plane
    • materialized config snapshots in data plane

Polished interview answer:

“I’d design the load balancer as a control-plane/data-plane system. The control plane owns backend membership, current health, balancing policy, and the effective routing-eligible set. Data-plane nodes don’t synchronously query control plane on every request; instead they route using versioned config snapshots propagated from control plane. Membership changes are guarded, health and policy are versioned current-state updates, and routing state is recomputed from those authoritative inputs. The main scaling levers are keeping routing local to the data plane, damping health flaps, and using incremental snapshot propagation.”

Concrete Substrate #

I’ll choose a versioned control-plane plus stateless data-plane balancer fleet as the concrete baseline, because that matches the mechanics we derived:

  • authoritative control-plane state
  • versioned propagated config snapshots
  • local routing decisions in data plane

Concrete substrate:

  • control plane
    • authoritative store for backend membership, health, and policy
    • recomputes effective routing state
    • publishes versioned config snapshots
  • data plane
    • many balancer nodes terminate client connections
    • each holds current snapshot for one or more listeners/services
    • selects healthy backend locally

Concrete tech family:

  • control plane in Go or Java
  • authoritative metadata in etcd or a small strongly consistent store
  • snapshot propagation via watch streams, pub/sub, or pull-on-version-miss
  • data-plane balancer in Envoy-like or custom L4/L7 proxy fleet

Operation Layer #

1. Route request #

API

  • client request to virtual IP / listener

Initiator

  • client

Entry point

  • data-plane load balancer node

Authoritative decider

  • local ConfigSnapshot on that serving node

Precondition

  • snapshot version current enough for bounded-stale routing contract
  • selected backend eligible in snapshot

Transition

  • none on control-plane truth
  • local backend selection for this request

Response

  • forwarded request / response relay

2. Register / unregister backend #

API

  • RegisterBackend(service_id, backend_id, metadata)
  • UnregisterBackend(service_id, backend_id, expected_version?)

Initiator

  • backend/service agent

Entry point

  • control-plane API

Authoritative decider

  • control-plane membership store

Precondition

  • valid service/backend identity

Transition

  • update BackendMembership
  • bump membership version

Response

  • {membership_version}

3. Record health #

API

  • internal ReportHealth(backend_id, health_state, observation_revision)

Initiator

  • system / health checker

Entry point

  • control plane

Authoritative decider

  • health-state owner in control plane

Precondition

  • observation revision/timestamp monotonic enough under policy

Transition

  • overwrite HealthState

Response

  • internal success

4. Update routing state #

API

  • internal recompute flow

Initiator

  • system/control plane

Entry point

  • routing-state owner

Authoritative decider

  • control-plane recompute worker

Precondition

  • latest membership/health/policy versions available

Transition

  • overwrite RoutingState

Response

  • internal success with new routing version

5. Propagate config snapshot #

API

  • internal snapshot push/pull stream

Initiator

  • control plane

Entry point

  • data-plane node

Authoritative decider

  • control-plane snapshot publisher

Precondition

  • newer config version available

Transition

  • overwrite local ConfigSnapshot(node)

Response

  • ack current config version

6. Update balancing policy #

API

  • PutPolicy(service_id, config, expected_version?)

Initiator

  • admin

Entry point

  • control-plane API

Authoritative decider

  • policy store owner

Precondition

  • expected version matches if supplied

Transition

  • overwrite LoadBalancingPolicy
  • bump version

Response

  • {policy_version}

Entry Point vs Decider vs Responder #

PathEntry pointAuthoritative deciderPhysical responderLogical responder
route requestdata-plane nodelocal current config snapshotdata-plane nodeload balancer service
register/unregister backendcontrol-plane APIcontrol-plane membership storecontrol-plane nodeload balancer service
record healthcontrol-plane APIcontrol-plane health ownercontrol-plane nodeload balancer service
recompute routing statecontrol planecontrol-plane recompute workerinternalload balancer service
propagate snapshotdata-plane nodecontrol-plane snapshot publishercontrol-plane/data-plane nodesload balancer service
update policycontrol-plane APIpolicy store ownercontrol-plane nodeload balancer service

Concrete HLD #

Main components:

  • control-plane API
    • backend registration
    • policy updates
    • health ingestion
  • control-plane state store
    • membership, health, policy, routing versions
  • routing recompute worker
    • derives effective routing state
  • snapshot distribution layer
    • pushes/pulls versioned config to serving nodes
  • data-plane balancer fleet
    • handles client traffic using local snapshots

Concrete Technology Realizations #

Stronger infra-native answer #

  • control plane in Go or Java
  • authoritative metadata in etcd
  • watch-stream or pub/sub snapshot propagation
  • data-plane proxy fleet using Envoy-like process or custom proxy nodes

Short interview version #

“I’d build the load balancer as a versioned control-plane/data-plane system. Control plane owns backend membership, health, policy, and effective routing state, then publishes versioned snapshots to balancer nodes. The data plane uses only local snapshots on the hot path to choose a healthy backend, while control plane handles registration, health updates, and incremental config propagation.”