Infra HLD Diagram Discipline Cheat Sheet #

Use this note as the infra-side counterpart to the product HLD diagram discipline note.

The main rule:

Draw the diagram around the contract, invariants, and ownership of critical state. Then attach the execution paths, workers, and projections.

Do not start with:

Kafka
Redis
Raft
Elasticsearch
generic cloud boxes

Start with:

the contract
the critical state
the hard path

1. Start from the contract #

Before drawing anything, write the system promise.

Examples:

Lock service #

one valid holder at a time
lease-based ownership
stale holders must be fenced

Queue #

durable enqueue
at-least-once delivery
visibility timeout

Scheduler #

jobs become runnable at or after run_at
one worker should own a run at a time
retry is supported

Why:

infra diagrams are driven by semantics more than user flow
the contract tells you which boxes must exist

2. Write the invariants next to the diagram #

For infra systems, invariants should be visible.

Examples:

only one holder may act
no stale config may override newer config
offset must not advance before durable processing
frontier must not skip uncovered work

Why:

these invariants explain why you need certain stores and checks
they help the interviewer follow the design

3. Draw the canonical state owner first #

The first real box should usually be the thing that owns critical truth.

Examples:

Metadata Store
Lease Store
Queue Log
Schedule Store
Config Store
Frontier Store

Rule:

the source of truth should be obvious
derived views should come later

4. Draw the control path before the data path #

In infra interviews, correctness often lives in the control path.

Examples:

Lock service #

client
lock API
metadata store
watch / renew path

Config service #

admin
config API
config store
snapshot publisher
local evaluators

Scheduler #

scheduler API
schedule store
due scanner
runnable queue
workers

Why:

infra systems often have one small critical control path and one large execution path

5. Put the execution plane in a separate lane #

Separate:

control plane
execution plane

Examples:

Queue #

control-ish state:
- message durability
- visibility timeout
- consumer progress
execution plane:
- consumers
- handlers

Scheduler #

control:
- schedule metadata
- due scan
- claim state
execution:
- workers
- downstream job handlers

This keeps the diagram readable.

6. Draw ownership, leases, and epochs explicitly #

For claim/lease systems, always show:

who owns what
where lease state lives
where fencing token / epoch comes from

Examples:

LeaseState
ShardOwnership
OwnerEpoch

If stale actor risk matters, the token must appear in the diagram.

7. Show async boundaries clearly #

Use different arrows or labels for:

sync metadata write
async scan
async publish
worker claim
replay / watch

Examples:

config store -> snapshot publisher
schedule store -> due scanner
queue log -> consumer
metadata store -> watch clients

Do not blur sync and async arrows.

8. Draw replay and recovery paths if they matter #

Infra designs often need repair paths in the HLD.

Examples:

reconciliation worker
reindexer
lease expiry scanner
replay from log
snapshot rebuild

If the main correctness story depends on repair, show the repair box.

9. Separate canonical truth from derived state #

Examples:

Config Store vs Local Snapshot
Queue Log vs Consumer Cache
Source Metrics vs Aggregated Dashboard
Membership Truth vs Presence View

Reason:

infra questions often hinge on whether the interviewer understands what is authoritative

10. Draw hot partitions or hot keys if they are central #

Examples:

due-time bucket in scheduler
hot tenant in rate limiter
hot lock id
hot queue partition
celebrity fanout equivalent in infra: hot config rollout or hot topic partition

If scale is a major deep dive, the hotspot should appear in the diagram.

11. Canonical drawing sequence #

Use this order every time:

write contract and invariants
draw client / caller
draw API / control service
draw canonical metadata or source-truth store
draw async scanner / publisher / worker
draw execution plane or downstream handlers
draw derived views or local snapshots
draw repair / reconciliation path
mark the hard spot for deep dive

This prevents infra diagrams from becoming tool soup.

12. Default skeletons by infra system type #

Coordination / lock service #

Client
Lock API
Metadata Store
Watch / Renew Path
Downstream Protected Resource

Optional:

Fencing Token Validator

Queue #

Producer
Broker API
Message Log
Visibility Timeout Manager
Consumer
DLQ

Scheduler #

Scheduler API
Schedule Store
Due Scanner
Runnable Queue
Worker
Run State Store

Config / feature flag / policy #

Admin / Control Client
Config API
Config Store
Snapshot Publisher
Agents / SDKs
Local Snapshot Cache

Rate limiter #

Client
Decision Service
Budget / Token Store
maybe Local Token Cache
Reconciliation / Refill Path

Metrics / tracing #

Agent / Emitter
Ingest Service
Durable Log / Time-Series Store
Aggregation Pipeline
Query Service
Dashboard / Alert Engine

13. Questions to ask for every box #

For each box, answer:

what contract does this box enforce?
what state does it own?
is it source truth or derived state?
is it sync or async?
what invariant would break if this box disappeared?

If you cannot answer these, simplify.

14. What to say while drawing #

Use lines like:

This box owns the canonical lease state.
This path is synchronous because correctness depends on it.
This worker is asynchronous because bounded lag is acceptable here.
These local snapshots are derived; the control store remains canonical.
I’m showing the reconciliation loop because repair is part of the design, not an afterthought.

15. Common mistakes #

drawing Raft/Kafka/Redis before defining the contract
not showing the canonical state owner
hiding lease / epoch / offset state
mixing control truth with execution state
skipping reconciliation or replay paths
drawing every operational detail instead of the correctness-critical pieces
not distinguishing sync control path from async execution path

16. Interview one-liner #

For infra HLDs I start from the contract and invariants, then draw the canonical state owner, then the control path, then the execution path, and finally the repair or replay path. That keeps the diagram focused on correctness instead of tools.