Skip to main content
  1. System Design Components/

Infra HLD Diagram Discipline Cheat Sheet

Infra HLD Diagram Discipline Cheat Sheet #

Use this note as the infra-side counterpart to the product HLD diagram discipline note.

The main rule:

Draw the diagram around the contract, invariants, and ownership of critical state. Then attach the execution paths, workers, and projections.

Do not start with:

  1. Kafka
  2. Redis
  3. Raft
  4. Elasticsearch
  5. generic cloud boxes

Start with:

  1. the contract
  2. the critical state
  3. the hard path

1. Start from the contract #

Before drawing anything, write the system promise.

Examples:

Lock service #

  1. one valid holder at a time
  2. lease-based ownership
  3. stale holders must be fenced

Queue #

  1. durable enqueue
  2. at-least-once delivery
  3. visibility timeout

Scheduler #

  1. jobs become runnable at or after run_at
  2. one worker should own a run at a time
  3. retry is supported

Why:

  1. infra diagrams are driven by semantics more than user flow
  2. the contract tells you which boxes must exist

2. Write the invariants next to the diagram #

For infra systems, invariants should be visible.

Examples:

  1. only one holder may act
  2. no stale config may override newer config
  3. offset must not advance before durable processing
  4. frontier must not skip uncovered work

Why:

  1. these invariants explain why you need certain stores and checks
  2. they help the interviewer follow the design

3. Draw the canonical state owner first #

The first real box should usually be the thing that owns critical truth.

Examples:

  1. Metadata Store
  2. Lease Store
  3. Queue Log
  4. Schedule Store
  5. Config Store
  6. Frontier Store

Rule:

  1. the source of truth should be obvious
  2. derived views should come later

4. Draw the control path before the data path #

In infra interviews, correctness often lives in the control path.

Examples:

Lock service #

  1. client
  2. lock API
  3. metadata store
  4. watch / renew path

Config service #

  1. admin
  2. config API
  3. config store
  4. snapshot publisher
  5. local evaluators

Scheduler #

  1. scheduler API
  2. schedule store
  3. due scanner
  4. runnable queue
  5. workers

Why:

  1. infra systems often have one small critical control path and one large execution path

5. Put the execution plane in a separate lane #

Separate:

  1. control plane
  2. execution plane

Examples:

Queue #

  1. control-ish state:
    • message durability
    • visibility timeout
    • consumer progress
  2. execution plane:
    • consumers
    • handlers

Scheduler #

  1. control:
    • schedule metadata
    • due scan
    • claim state
  2. execution:
    • workers
    • downstream job handlers

This keeps the diagram readable.


6. Draw ownership, leases, and epochs explicitly #

For claim/lease systems, always show:

  1. who owns what
  2. where lease state lives
  3. where fencing token / epoch comes from

Examples:

  1. LeaseState
  2. ShardOwnership
  3. OwnerEpoch

If stale actor risk matters, the token must appear in the diagram.


7. Show async boundaries clearly #

Use different arrows or labels for:

  1. sync metadata write
  2. async scan
  3. async publish
  4. worker claim
  5. replay / watch

Examples:

  1. config store -> snapshot publisher
  2. schedule store -> due scanner
  3. queue log -> consumer
  4. metadata store -> watch clients

Do not blur sync and async arrows.


8. Draw replay and recovery paths if they matter #

Infra designs often need repair paths in the HLD.

Examples:

  1. reconciliation worker
  2. reindexer
  3. lease expiry scanner
  4. replay from log
  5. snapshot rebuild

If the main correctness story depends on repair, show the repair box.


9. Separate canonical truth from derived state #

Examples:

  1. Config Store vs Local Snapshot
  2. Queue Log vs Consumer Cache
  3. Source Metrics vs Aggregated Dashboard
  4. Membership Truth vs Presence View

Reason:

  1. infra questions often hinge on whether the interviewer understands what is authoritative

10. Draw hot partitions or hot keys if they are central #

Examples:

  1. due-time bucket in scheduler
  2. hot tenant in rate limiter
  3. hot lock id
  4. hot queue partition
  5. celebrity fanout equivalent in infra: hot config rollout or hot topic partition

If scale is a major deep dive, the hotspot should appear in the diagram.


11. Canonical drawing sequence #

Use this order every time:

  1. write contract and invariants
  2. draw client / caller
  3. draw API / control service
  4. draw canonical metadata or source-truth store
  5. draw async scanner / publisher / worker
  6. draw execution plane or downstream handlers
  7. draw derived views or local snapshots
  8. draw repair / reconciliation path
  9. mark the hard spot for deep dive

This prevents infra diagrams from becoming tool soup.


12. Default skeletons by infra system type #

Coordination / lock service #

  1. Client
  2. Lock API
  3. Metadata Store
  4. Watch / Renew Path
  5. Downstream Protected Resource

Optional:

  1. Fencing Token Validator

Queue #

  1. Producer
  2. Broker API
  3. Message Log
  4. Visibility Timeout Manager
  5. Consumer
  6. DLQ

Scheduler #

  1. Scheduler API
  2. Schedule Store
  3. Due Scanner
  4. Runnable Queue
  5. Worker
  6. Run State Store

Config / feature flag / policy #

  1. Admin / Control Client
  2. Config API
  3. Config Store
  4. Snapshot Publisher
  5. Agents / SDKs
  6. Local Snapshot Cache

Rate limiter #

  1. Client
  2. Decision Service
  3. Budget / Token Store
  4. maybe Local Token Cache
  5. Reconciliation / Refill Path

Metrics / tracing #

  1. Agent / Emitter
  2. Ingest Service
  3. Durable Log / Time-Series Store
  4. Aggregation Pipeline
  5. Query Service
  6. Dashboard / Alert Engine

13. Questions to ask for every box #

For each box, answer:

  1. what contract does this box enforce?
  2. what state does it own?
  3. is it source truth or derived state?
  4. is it sync or async?
  5. what invariant would break if this box disappeared?

If you cannot answer these, simplify.


14. What to say while drawing #

Use lines like:

  1. This box owns the canonical lease state.
  2. This path is synchronous because correctness depends on it.
  3. This worker is asynchronous because bounded lag is acceptable here.
  4. These local snapshots are derived; the control store remains canonical.
  5. I’m showing the reconciliation loop because repair is part of the design, not an afterthought.

15. Common mistakes #

  1. drawing Raft/Kafka/Redis before defining the contract
  2. not showing the canonical state owner
  3. hiding lease / epoch / offset state
  4. mixing control truth with execution state
  5. skipping reconciliation or replay paths
  6. drawing every operational detail instead of the correctness-critical pieces
  7. not distinguishing sync control path from async execution path

16. Interview one-liner #

For infra HLDs I start from the contract and invariants, then draw the canonical state owner, then the control path, then the execution path, and finally the repair or replay path. That keeps the diagram focused on correctness instead of tools.