API Gateway Analysis Note
Table of Contents
API Gateway Analysis Note #
This note captures the full step-by-step analysis for an API gateway: route config, auth/policy config, backend health, effective gateway state, and versioned snapshot propagation to serving nodes.
Step 1 — Normalize #
Assume the baseline prompt is:
- design an API gateway
- clients send API requests to one endpoint
- gateway authenticates, authorizes, rate-limits, and routes to backend services
- policies can change over time
- system scales across nodes
| Requirement | Actor | Operation | State touched | Priority |
|---|---|---|---|---|
| Client request is authenticated and routed | Client | read source | S1read source targetGatewayRoutingState | C1 |
| Admin updates route config | Admin | overwrite state | S1update targetRouteConfig | C1 |
| Admin updates auth/policy config | Admin | overwrite state | S1update targetPolicyConfig | C1 |
| System records backend health | System | overwrite state | S1update targetHealthState | C1 |
| System updates effective gateway config snapshot | System | state transition | S1update targetGatewayRoutingState | C1 |
| System propagates config snapshot to serving nodes | System | async process | S1hidden write targetConfigSnapshot | C1 |
| Client reads gateway status/metrics | Client | read projection | S1read projection targetGatewayStatusView | R2 |
Notes on normalization:
Important choices:
- request routing is
read source- the hot path mostly reads current routing state and makes a routing decision
- route config and policy config are
overwrite state - backend health is
overwrite state- current health is the main truth
- effective gateway state update is a control-plane recompute transition
- config propagation is
async process- control-plane to data-plane dissemination
This is clearly a:
Control Plane + Data Planesystem
Step 2 — Critical Path Selection #
| Requirement | Priority class | Why |
|---|---|---|
| Request auth + routing | C1 | wrong policy/routing breaks correctness and security |
| Update route config | C1 | changes future request routing |
| Update auth/policy config | C1 | changes future enforcement |
| Record backend health | C1 | bad health state can route traffic to failing backend |
| Update effective gateway state | C1 | control-plane to data-plane correctness bridge |
| Propagate config snapshot | C1 | stale serving nodes can enforce wrong policy |
| Status/metrics reads | R2 | operational only |
Critical paths:
P1request handlingP2update route configP3update policy configP4record healthP5update effective gateway stateP6propagate config snapshot
Step 3 — Primary State Extraction #
| Candidate object label | Candidate source | Candidate needed for C1/R1? | Candidate decomposition action | Class | Primary? | Owner | Evolution | Scope kind | Scope value |
|---|---|---|---|---|---|---|---|---|---|
| RouteConfig | direct noun | Yes | keep as candidate | entity | Yes | service | overwrite | instance | route_id or listener_id |
| PolicyConfig | direct noun | Yes | keep as candidate | entity | Yes | service | overwrite | instance | policy_scope |
| HealthState | hidden write target | Yes | keep as candidate | entity | Yes | service | overwrite | instance | backend_id |
| GatewayRoutingState | hidden write target | Yes | keep as candidate | process | Yes | service | overwrite | instance | listener_id or service_id |
| ConfigSnapshot | hidden write target | Yes | keep as candidate | projection | Yes | service | overwrite | instance | gateway_node_id |
| GatewayStatusView | derived read model | No | reject as UI artifact | projection | No | derived | overwrite | collection | gateway cluster |
Minimal primary set:
RouteConfigPolicyConfigHealthStateGatewayRoutingStateConfigSnapshot
Step 4 — Hard Invariants #
| Path | Tier | Type | Invariant statement |
|---|---|---|---|
P1 request handling | HARD | eligibility | route_request is valid only if selected route and backend are eligible under current policy, auth result, and health state for the request scope. |
P2 update route config | HARD | ordering | Route config revisions are ordered by monotonic config version within route scope. |
P3 update policy config | HARD | ordering | Policy config revisions are ordered by monotonic config version within policy scope. |
P4 record health | HARD | ordering | Health updates are ordered by monotonic observation revision/timestamp within backend scope. |
P5 update effective gateway state | HARD | accounting | Effective gateway state equals function of route config, policy config, and health state. |
P6 propagate config snapshot | HARD | freshness | Serving-node config snapshot reflects authoritative gateway state within configured propagation bound. |
What matters most:
1. Route only under valid policy and health #
The core safety property is:
- gateway must not send requests using stale/invalid route, auth, or backend-eligibility state beyond the allowed bound
2. Effective gateway state is derived from authoritative inputs #
Route config, policy config, and health feed the hot-path serving state.
3. Propagation lag is bounded #
This is the key control-plane/data-plane correctness interface.
Step 5 — Execution Context #
For the API gateway baseline:
| Field | Value | Why |
|---|---|---|
| Topology | single service distributed | one logical gateway system with many serving nodes |
| Write coordination scope | per object scope | correctness is per route/policy/backend/listener scope |
| Read consistency target | bounded stale allowed | hot path usually uses recent snapshots, not synchronous control-plane reads |
| Holder model | none | no lease-like per-request ownership is central |
| Compensation acceptable? | No | wrong auth/routing decisions are not compensable |
Derived:
bounded_staleness_allowed = trueexclusive_claim_required = falseguarded_by_current_state = true
This pushes us toward:
- authoritative control plane
- versioned snapshots to data-plane nodes
- hot path reads from local or near-local routing snapshot
Step 6 — Deterministic Mechanism Selection #
| Path | Write shape | Base mechanism | Required companions |
|---|---|---|---|
P2 update route config | overwrite current value | CAS on version | config version |
P3 update policy config | overwrite current value | CAS on version | config version |
P4 record health | overwrite current value | CAS on version or monotonic overwrite | health revision/timestamp |
P5 update effective gateway state | overwrite current value | single writer control-plane recompute | routing version |
P6 propagate config snapshot | overwrite current value | single writer snapshot publication | config version |
Hot request path P1 is a read path, not a write-shape path.
Step 7 — Read Model / Source of Truth #
| Concept | Truth | Read path | Rebuild path |
|---|---|---|---|
C1 route config | RouteConfig | read source directly | authoritative config store |
C2 policy config | PolicyConfig | read source directly | authoritative config store |
C3 health state | HealthState | read source directly | authoritative health store |
C4 effective gateway state | GatewayRoutingState | read source directly | recompute from route + policy + health |
C5 serving-node snapshot | ConfigSnapshot | materialized view | rebuild from latest gateway state |
C6 status/metrics | derived | materialized view | recompute from primary state |
Important point:
For the hot path:
- data-plane node reads local
ConfigSnapshot - not control-plane state per request
So the serving read path is:
- local materialized view / config snapshot
Step 8 — Failure Handling #
| Path | Retry | Competing writers | Crash after commit | Publish failure | Stale holder |
|---|---|---|---|---|---|
| route/policy update | retry with config version | stale update loses CAS | committed config survives crash if persisted | snapshot propagation may lag | n/a |
| health update | retry with monotonic observation revision | latest valid observation wins | committed health survives crash if persisted | snapshot propagation may lag | n/a |
| effective-state recompute | retry safe from primary inputs | single recompute/version wins | recompute reruns after crash | snapshot propagation may lag | n/a |
| snapshot propagation | retry with versioned snapshot | older snapshot loses to newer version | node keeps last good snapshot until refresh | failed push retried or pulled | n/a |
| request handling | retries are application-level | serving nodes can handle concurrently with same snapshot version | node crash drops in-flight requests only | n/a | stale snapshot bounded by version/TTL refresh |
What matters most:
1. Versioned snapshot propagation #
Serving nodes must reject older snapshots and move monotonically forward by config version.
2. Bounded stale routing #
The hot path is usually eventually updated, not strongly synchronized with every control-plane change.
3. Health flapping #
Health-state overwrite policy must handle noisy observations without causing excessive routing churn.
Step 9 — Scale Adjustments #
| Hotspot | Type | First response |
|---|---|---|
| very high request volume | read hotspot | add more gateway nodes; keep hot path local to snapshots |
| config churn | fan-out hotspot | incremental snapshot propagation; batch updates |
| health-flap storms | contention hotspot | dampen health transitions and recompute cadence |
| large route/policy tables | read hotspot | shard config by listener/service and compress snapshots |
| metrics/status load | read hotspot | derived views only |
What scales well:
The system scales well if:
- the hot path is local snapshot read + policy/routing evaluation
- control plane is narrow and versioned
- propagation is incremental
What fails first:
Usually:
- health-flap storms
- config propagation bursts
- giant route/policy tables
Canonical design conclusion:
- primary truth:
RouteConfigPolicyConfigHealthStateGatewayRoutingStateConfigSnapshot
- hot path:
- local snapshot read + auth/policy check + backend selection
- control plane:
- versioned config + recomputed effective state + snapshot publication
Concrete Substrate #
- control plane in
Go/Java - authoritative config/health store in
etcdor similar strongly consistent store - snapshot propagation via watch streams/pub-sub
- data-plane fleet using
Envoy-like or custom proxy processes - local snapshot cache per node for hot path
Operation Layer #
1. HandleRequest(listener, request) #
- entry point: data-plane node
- authoritative decider: local
ConfigSnapshot - transition: none on source truth
- response: proxied backend response or policy rejection
2. PutRouteConfig(route_id, config, expected_version?) #
- entry point: control-plane API
- authoritative decider: config store
- transition: overwrite
RouteConfig, bump version
3. PutPolicy(policy_scope, config, expected_version?) #
- same shape as route config
4. ReportHealth(backend_id, health, observation_revision) #
- entry point: control plane
- authoritative decider: health-state owner
- transition: overwrite
HealthState
5. internal recompute #
- recompute
GatewayRoutingStatefrom config + policy + health
6. snapshot propagation #
- push/pull latest
ConfigSnapshot(version)to data-plane nodes
Entry Point vs Decider vs Responder #
| Path | Entry point | Authoritative decider | Physical responder | Logical responder |
|---|---|---|---|---|
| request handling | data-plane node | local config snapshot | data-plane node | API gateway |
| route/policy update | control-plane API | config store owner | control-plane node | API gateway |
| health update | control-plane API | health-state owner | control-plane node | API gateway |
| snapshot propagation | serving node / control plane | snapshot publisher | control/data-plane | API gateway |
Concrete HLD #
Main components:
- control-plane API
- route config
- policy config
- health ingestion
- control-plane state store
- route, policy, health, effective routing versions
- routing recompute worker
- derives effective gateway state
- snapshot distribution layer
- pushes/pulls versioned config to serving nodes
- data-plane gateway fleet
- handles client traffic using local snapshots
Short interview version #
“I’d design the API gateway as a control-plane/data-plane system. Control plane owns route config, auth/policy config, backend health, and effective gateway state. Data-plane nodes don’t query control plane per request; they use versioned local snapshots to authenticate, enforce policy, and route to healthy backends. Config and health are updated in the control plane, effective state is recomputed there, and snapshots are propagated incrementally to the serving fleet.”