Load Balancer Analysis Note
Table of Contents
Load Balancer Analysis Note #
This note captures the full step-by-step analysis for a load balancer service: backend membership, health state, effective routing state, control-plane propagation, and data-plane request routing.
Step 1 — Normalize #
Assume the baseline prompt is:
- design a load balancer service
- clients send requests to a virtual endpoint
- requests should be distributed across healthy backends
- unhealthy backends should stop receiving traffic
- system should scale across nodes
Normalize into state-affecting paths.
| Requirement | Actor | Operation | State touched | Priority |
|---|---|---|---|---|
| Client request is routed to a backend | Client | read source | S1read source targetRoutingState | C1 |
| Backend registers/unregisters with balancer | Client | state transition | S1update targetBackendMembership | C1 |
| System records backend health | System | overwrite state | S1update targetHealthState | C1 |
| System updates effective routing set/policy | System | state transition | S1update targetRoutingState | C1 |
| System propagates control-plane config to serving nodes | System | async process | S1hidden write targetConfigSnapshot | C1 |
| Admin updates balancing policy/config | Admin | overwrite state | S1update targetLoadBalancingPolicy | C1 |
| Client reads balancer/backend status | Client | read projection | S1read projection targetStatusView | R2 |
Notes on normalization:
Important choices:
- request routing is
read source- the hot path mostly reads current routing state and makes a routing decision
- backend membership is
state transition- backend joins/leaves active pool
- health update is
overwrite state- current health is the main truth
- effective routing set update is
state transition- backend eligibility in routing changes over time
- config propagation is
async process- control-plane to data-plane dissemination
This is clearly a:
Control Plane + Data Planesystem
not:
- a queue/log/store problem
Step 2 — Critical Path Selection #
| Requirement | Priority class | Why |
|---|---|---|
| Route client request to backend | C1 | wrong routing can send traffic to unhealthy/wrong backends |
| Register/unregister backend | C1 | active backend pool truth depends on it |
| Record backend health | C1 | health truth determines safe routing eligibility |
| Update effective routing set/policy | C1 | this is the control-to-data-plane correctness bridge |
| Propagate config to serving nodes | C1 | stale serving nodes can route incorrectly |
| Update balancing policy/config | C1 | changes future routing behavior and safety |
| Read balancer/backend status | R2 | operational only |
Baseline critical paths:
Main C1 paths:
P1route client requestP2register/unregister backendP3record backend healthP4update effective routing stateP5propagate config snapshotP6update balancing policy
This system is driven by:
- authoritative backend membership
- current backend health
- effective routing eligibility
- propagation of control-plane state to data-plane nodes
Step 3 — Primary State Extraction #
For a load balancer, the minimal primary state is backend membership, health, routing state, and policy/config.
| Candidate object label | Candidate source | Candidate needed for C1/R1? | Candidate decomposition action | Class | Primary? | Owner | Evolution | Scope kind | Scope value |
|---|---|---|---|---|---|---|---|---|---|
| BackendMembership | direct noun | Yes | keep as candidate | relationship | Yes | service | state machine | relation | service_id + backend_id |
| HealthState | hidden write target | Yes | keep as candidate | entity | Yes | service | overwrite | instance | backend_id |
| RoutingState | hidden write target | Yes | keep as candidate | process | Yes | service | state machine | instance | service_id or listener_id |
| LoadBalancingPolicy | direct noun | Yes | keep as candidate | entity | Yes | service | overwrite | instance | service_id or listener_id |
| ConfigSnapshot | hidden write target | Yes | keep as candidate | projection | Yes | service | overwrite | instance | data_plane_node_id |
| StatusView | derived read model | No | reject as UI artifact | projection | No | derived | overwrite | collection | service_id |
Important modeling choices:
BackendMembership #
Primary because:
- defines which backends belong to the pool
- registration/unregistration changes future routing
HealthState #
Primary because:
- current health is authoritative input for routing eligibility
RoutingState #
Primary because:
- it is the effective routing-ready set derived from membership + health + policy
- data plane reads it to make forwarding decisions
LoadBalancingPolicy #
Primary because:
- changes backend selection semantics
ConfigSnapshot #
For a control-plane/data-plane system, I keep this primary enough to model explicitly because:
- data-plane nodes need a propagated current snapshot
- propagation lag and versioning matter
Minimal strict primary set:
BackendMembershipHealthStateRoutingStateLoadBalancingPolicyConfigSnapshot
Step 4 — Hard Invariants #
For a load balancer, the hard invariants are about correct backend eligibility and safe propagation of routing state to serving nodes.
| Path | Tier | Type | Invariant template | Invariant statement |
|---|---|---|---|---|
P1read pathRoute request | HARD | eligibility | eligibility template | Action route_request is valid only if selected backend is currently eligible under authoritative RoutingState and policy for that service/listener scope at decision time. |
P2write pathRegister/unregister backend | HARD | eligibility | eligibility template | Action change_membership is valid only if backend identity and service binding rules hold on service_id + backend_id at decision time. |
P2write pathRegister/unregister backend | HARD | uniqueness | uniqueness template | Key service_id + backend_id maps to at most one logical outcome current membership state within that relation scope. |
P3write pathRecord health | HARD | ordering | ordering template | Instances health updates are ordered by monotonic health revision or observation timestamp within backend_id. |
P4write pathUpdate effective routing state | HARD | accounting | accounting template | Effective RoutingState(service_id) equals function eligible(backends, health, policy) over authoritative membership, health, and policy state. |
P5write pathPropagate config snapshot | HARD | freshness | freshness template | Data-plane ConfigSnapshot(node) reflects authoritative routing/config state within configured propagation bound. |
P6write pathUpdate balancing policy | HARD | ordering | ordering template | Instances policy revisions are ordered by monotonic policy version within service_id or listener_id. |
What matters most:
1. Route only to eligible backends #
The core safety property is:
- data plane must not route to unhealthy or ineligible backends
2. Effective routing state is derived from authoritative inputs #
Membership, health, and policy feed the routing-ready set.
3. Propagation lag is bounded #
This is the key control-plane/data-plane correctness interface.
Step 5 — Execution Context #
For the load balancer baseline:
| Field | Value | Why |
|---|---|---|
| Topology | single service distributed | one logical load-balancing system with distributed serving nodes |
| Write coordination scope | per object scope | correctness is per service/listener, backend, and data-plane node snapshot |
| Read consistency target | bounded stale allowed | serving nodes usually use recent config snapshots rather than synchronous strong control-plane reads on each request |
| Holder model | none | request routing does not rely on lease-like item ownership in the hot path |
| Compensation acceptable? | No | routing to an unhealthy backend is a serving correctness failure, not a compensable workflow |
Derived implications:
holder_may_crash = false- no temporary client-held ownership primitive is central to the hot path
cross_service_write = false- baseline keeps control-plane state and data-plane config within one logical system
bounded_staleness_allowed = true- data plane may route using recent snapshot, not necessarily latest control-plane write
cross_service_atomicity_required = false- no multi-service transaction needed in baseline
exclusive_claim_required = false- not a lease/claim problem in the hot request path
guarded_by_current_state = true- routing-state derivation and membership changes depend on current state
This pushes us toward:
- authoritative control plane
- versioned snapshots to data-plane nodes
- hot path reads from local or near-local routing snapshot
Step 6 — Deterministic Mechanism Selection #
6A. Write Shape #
| Path | Why | Write shape |
|---|---|---|
P1 route request | read-only serving decision against current snapshot | n/a read path |
P2 register/unregister backend | membership lifecycle changes current relation state | guarded state transition |
P3 record health | replace current backend health view | overwrite current value |
P4 update effective routing state | recompute current routing-eligible set | overwrite current value |
P5 propagate config snapshot | push/refresh current serving snapshot | overwrite current value |
P6 update balancing policy | replace current policy revision | overwrite current value |
6B. Base Mechanism #
| Path | Write shape | Base mechanism | Required companions |
|---|---|---|---|
P2 register/unregister backend | guarded state transition | CAS on (state, version) | membership version |
P3 record health | overwrite current value | CAS on version or monotonic observation overwrite | health revision/timestamp |
P4 update effective routing state | overwrite current value | single writer control-plane recompute | routing version |
P5 propagate config snapshot | overwrite current value | single writer snapshot publication | config version |
P6 update balancing policy | overwrite current value | CAS on version | policy version |
Why these fit:
Membership #
Backend pool membership has lifecycle semantics and should not flap via blind overwrites.
Health, routing, and policy #
These are current-state config/status values:
- latest version matters
- overwrite semantics are natural
Config propagation #
This is a classic control-plane publish of versioned snapshots.
Canonical substrate implied:
- control plane owns membership, health, policy, and routing state
- data plane consumes versioned snapshots
- request routing is local snapshot read + backend selection algorithm
Step 7 — Read Model / Source of Truth #
For a load balancer, authoritative truth lives in the control plane, while the data plane uses propagated snapshots.
| Concept | Truth | Read path | Rebuild path |
|---|---|---|---|
C1source conceptBackend membership | BackendMembership | read source directly | authoritative control-plane store |
C2source conceptBackend health | HealthState | read source directly | authoritative health store |
C3source conceptBalancing policy | LoadBalancingPolicy | read source directly | authoritative policy store |
C4source conceptEffective routing state | RoutingState | read source directly | recompute from membership + health + policy |
C5projection conceptData-plane config snapshot | ConfigSnapshot derived from authoritative control state | materialized view | rebuild from latest control-plane version |
C6projection conceptStatus/dashboard | derived from control-plane and serving-node observations | materialized view | recompute from primary state |
Important point:
For the hot path:
- data-plane routing typically does not query control-plane state synchronously per request
- instead, it uses
ConfigSnapshot
So the serving read path is:
- local materialized view / config snapshot
not:
- source read on every request
That is the core control-plane/data-plane distinction.
Step 8 — Failure Handling #
| Path | Retry | Competing writers | Crash after commit | Publish failure | Stale holder |
|---|---|---|---|---|---|
P2 register/unregister backend | retry with membership version | stale membership change loses CAS | committed membership survives control-plane crash if persisted | config propagation may lag to serving nodes | n/a |
P3 record health | retry with monotonic observation time/revision | latest valid health view wins by version/timestamp policy | committed health update survives crash if persisted | snapshot propagation may lag within bounded window | n/a |
P4 update effective routing state | recompute retry safe from authoritative inputs | single writer or versioned recompute wins | recompute can be rerun after crash from primary state | snapshot publication may lag | n/a |
P5 propagate config snapshot | retry snapshot push/pull using versioned config | older snapshot version loses to newer version | serving node may keep last good snapshot until refreshed | failed push retried or pulled later | n/a |
P6 update balancing policy | retry with policy version | stale policy update loses CAS | committed policy survives crash if persisted | routing snapshot may lag briefly | n/a |
P1 route request | no mutation retry issue on hot path | multiple serving nodes can route concurrently using local snapshots | node crash just drops its in-flight traffic, not control-plane truth | n/a | stale serving node snapshot must age out or be refreshed |
What matters most:
1. Versioned snapshot propagation #
Serving nodes must reject older snapshots and move monotonically forward by config version.
2. Bounded stale routing #
The hot path is usually eventually updated, not strongly synchronized with every control-plane change.
3. Health flapping #
Health-state overwrite policy must handle noisy observations without causing excessive routing churn.
Failure summary:
The load balancer stays correct if:
- authoritative membership/health/policy live in control plane
- effective routing state is recomputable
- data-plane nodes use versioned snapshots
- stale snapshots are bounded and replaced
Step 9 — Scale Adjustments #
| Hotspot | Type | First response |
|---|---|---|
| very high request volume on data plane | read hotspot | add more serving nodes; keep routing decision local to snapshot |
| control-plane churn from health flaps | contention hotspot | dampen health transitions and rate-limit config recompute |
| hot listener/service with huge backend set | fan-out hotspot | shard backend set and snapshot propagation by listener/service scope |
| snapshot propagation storms | fan-out hotspot | use versioned incremental updates and pull-on-miss fallback |
| status/dashboard queries | read hotspot | keep them as derived views only |
| membership/config update bursts | write throughput hotspot | batch updates and recompute routing state incrementally |
What scales well:
The system scales well if:
- the hot path is local snapshot read + selection
- control plane is narrow and versioned
- propagation is incremental
What fails first:
Usually:
- health-flap storms
- config propagation bursts
- giant backend pools for one service
Canonical design conclusion:
The mechanical outcome is:
- primary state:
BackendMembershipHealthStateRoutingStateLoadBalancingPolicyConfigSnapshot
- critical invariants:
- only eligible backends receive traffic
- routing state equals function of membership, health, and policy
- data plane uses bounded-stale versioned snapshots
- mechanisms:
- guarded membership changes
- overwrite current health/policy/routing state
- single-writer control-plane recompute
- reads:
- source reads in control plane
- materialized config snapshots in data plane
Polished interview answer:
“I’d design the load balancer as a control-plane/data-plane system. The control plane owns backend membership, current health, balancing policy, and the effective routing-eligible set. Data-plane nodes don’t synchronously query control plane on every request; instead they route using versioned config snapshots propagated from control plane. Membership changes are guarded, health and policy are versioned current-state updates, and routing state is recomputed from those authoritative inputs. The main scaling levers are keeping routing local to the data plane, damping health flaps, and using incremental snapshot propagation.”
Concrete Substrate #
I’ll choose a versioned control-plane plus stateless data-plane balancer fleet as the concrete baseline, because that matches the mechanics we derived:
- authoritative control-plane state
- versioned propagated config snapshots
- local routing decisions in data plane
Concrete substrate:
- control plane
- authoritative store for backend membership, health, and policy
- recomputes effective routing state
- publishes versioned config snapshots
- data plane
- many balancer nodes terminate client connections
- each holds current snapshot for one or more listeners/services
- selects healthy backend locally
Concrete tech family:
- control plane in Go or Java
- authoritative metadata in etcd or a small strongly consistent store
- snapshot propagation via watch streams, pub/sub, or pull-on-version-miss
- data-plane balancer in Envoy-like or custom L4/L7 proxy fleet
Operation Layer #
1. Route request #
API
- client request to virtual IP / listener
Initiator
- client
Entry point
- data-plane load balancer node
Authoritative decider
- local
ConfigSnapshoton that serving node
Precondition
- snapshot version current enough for bounded-stale routing contract
- selected backend eligible in snapshot
Transition
- none on control-plane truth
- local backend selection for this request
Response
- forwarded request / response relay
2. Register / unregister backend #
API
RegisterBackend(service_id, backend_id, metadata)UnregisterBackend(service_id, backend_id, expected_version?)
Initiator
- backend/service agent
Entry point
- control-plane API
Authoritative decider
- control-plane membership store
Precondition
- valid service/backend identity
Transition
- update
BackendMembership - bump membership version
Response
{membership_version}
3. Record health #
API
- internal
ReportHealth(backend_id, health_state, observation_revision)
Initiator
- system / health checker
Entry point
- control plane
Authoritative decider
- health-state owner in control plane
Precondition
- observation revision/timestamp monotonic enough under policy
Transition
- overwrite
HealthState
Response
- internal success
4. Update routing state #
API
- internal recompute flow
Initiator
- system/control plane
Entry point
- routing-state owner
Authoritative decider
- control-plane recompute worker
Precondition
- latest membership/health/policy versions available
Transition
- overwrite
RoutingState
Response
- internal success with new routing version
5. Propagate config snapshot #
API
- internal snapshot push/pull stream
Initiator
- control plane
Entry point
- data-plane node
Authoritative decider
- control-plane snapshot publisher
Precondition
- newer config version available
Transition
- overwrite local
ConfigSnapshot(node)
Response
- ack current config version
6. Update balancing policy #
API
PutPolicy(service_id, config, expected_version?)
Initiator
- admin
Entry point
- control-plane API
Authoritative decider
- policy store owner
Precondition
- expected version matches if supplied
Transition
- overwrite
LoadBalancingPolicy - bump version
Response
{policy_version}
Entry Point vs Decider vs Responder #
| Path | Entry point | Authoritative decider | Physical responder | Logical responder |
|---|---|---|---|---|
| route request | data-plane node | local current config snapshot | data-plane node | load balancer service |
| register/unregister backend | control-plane API | control-plane membership store | control-plane node | load balancer service |
| record health | control-plane API | control-plane health owner | control-plane node | load balancer service |
| recompute routing state | control plane | control-plane recompute worker | internal | load balancer service |
| propagate snapshot | data-plane node | control-plane snapshot publisher | control-plane/data-plane nodes | load balancer service |
| update policy | control-plane API | policy store owner | control-plane node | load balancer service |
Concrete HLD #
Main components:
- control-plane API
- backend registration
- policy updates
- health ingestion
- control-plane state store
- membership, health, policy, routing versions
- routing recompute worker
- derives effective routing state
- snapshot distribution layer
- pushes/pulls versioned config to serving nodes
- data-plane balancer fleet
- handles client traffic using local snapshots
Concrete Technology Realizations #
Stronger infra-native answer #
- control plane in Go or Java
- authoritative metadata in etcd
- watch-stream or pub/sub snapshot propagation
- data-plane proxy fleet using Envoy-like process or custom proxy nodes
Short interview version #
“I’d build the load balancer as a versioned control-plane/data-plane system. Control plane owns backend membership, health, policy, and effective routing state, then publishes versioned snapshots to balancer nodes. The data plane uses only local snapshots on the hot path to choose a healthy backend, while control plane handles registration, health updates, and incremental config propagation.”