HLD Diagram Discipline Cheat Sheet
HLD Diagram Discipline Cheat Sheet #
Status: Archive candidate. Keep as historical reference; prefer system-design-core-index.md and the core notes for day-to-day use.
Prefer the infra-specific or core-index-guided notes instead of treating this as primary.
Use this note to draw cleaner interview diagrams.
The main rule:
Draw services around responsibilities for critical paths, then attach the stores and async pipelines they need.
Do not start from random boxes like User Service, DB, Cache, Kafka.
1. Start from paths, not boxes #
Before drawing anything, identify:
- the 2-4 most important user paths
- any hidden system paths that matter for correctness
Examples:
hold seatbuy seatview seat mapexpire stale hold
Why:
- paths tell you what the system must do
- services exist to own responsibilities on those paths
2. Group paths into service responsibilities #
Turn paths into a small set of responsibilities.
Examples:
Inventory ServiceBooking ServiceSearch ServiceFeed ServiceNotification WorkerScheduler / Expiry Worker
Good box test:
- which path hits this box?
- what truth does it own?
- is it sync or async?
If you cannot answer those, the box is probably fake.
3. Draw the sync path first #
Always draw the request-response path before async machinery.
Typical order:
ClientAPI / App ServiceCore Domain ServicePrimary Store
Example:
ClientBooking APIInventory/Booking ServicePrimary DB
Why:
- interviews care first about correctness on the main path
- async details are easier once the sync path is clear
4. Put source-of-truth stores directly under the owning service #
For each service, attach the canonical store it owns.
Examples:
Follow Service -> Relation StoreBooking Service -> Booking DBInventory Service -> Inventory DBDocument Service -> Document Store
Rule:
- source truth should be obvious in the diagram
- derived stores should be drawn later and separately
5. Add async edges only after the source truth path is clear #
Now add:
- outbox
- queue
- event bus
- worker
- scheduler
- fanout pipeline
Typical flow:
- source write commits
- event or outbox record is produced
- worker consumes event
- worker updates projections / sends notifications / triggers side effects
This is where:
- feed fanout
- search indexing
- counter maintenance
- expiry scanning
- notification delivery
usually appear.
6. Draw derived stores separately from source truth #
Keep these in a separate visual lane:
- cache
- search index
- feed store
- leaderboard store
- analytics store
- dashboard store
Reason:
- source truth and projection should not be visually conflated
- this makes consistency tradeoffs easier to explain
Good pattern:
- source truth in middle / left
- queue in middle
- projections on right
7. Add external systems last #
Only after the internal flow is stable, add:
- payment provider
- email provider
- push notification provider
- CDN
- object storage
- auth provider
Why:
- external systems are side effects or dependencies
- they are rarely the starting point of the design
8. Keep service boxes responsibility-oriented #
Good service names:
Inventory ServiceBooking ServiceFeed ServiceSearch IndexerNotification ServiceScheduler
Bad service names:
Everything ServiceDatabase ServiceUser Serviceif it owns unrelated responsibilities- one service box per entity with no behavior
Use verbs or responsibility nouns, not schema nouns.
9. Use one diagram lane per role #
A clean interview diagram usually has 4 lanes:
- clients / entrypoints
- synchronous domain services
- source-of-truth stores
- async workers + projections + external systems
This keeps the board readable.
10. Show ownership, not every network hop #
You do not need to draw:
- service mesh
- load balancer internals
- every internal RPC
- every cache invalidation call
Only show what matters for:
- correctness
- performance
- failure handling
- scaling
11. Use arrows to show semantics, not decoration #
Different arrows should imply something:
- sync request/response
- async event or queue
- projection update
- external side effect
If speaking, say it clearly:
this write is synchronousthis projection update is asynchronousthis external notification is retried
12. Annotate API paths directly on the diagram #
Do not leave the main arrows unnamed.
Every important sync path should be annotated with:
- method + endpoint or operation name
- short intent
- optional consistency or idempotency note
Good examples:
POST /holds [idem key]POST /payments/{id}/captureGET /feed/home [eventual]PUT /docs/{id}/ops [versioned]
If you do not want full HTTP detail, use operation labels:
create_holdconfirm_bookingappend_messagesearch_nearby
Best practice:
- put the path label above the arrow
- put the consistency / idempotency hint in brackets
- label only the 2-4 paths that actually matter
Do not annotate every edge.
The goal is:
- make the critical request paths visible
- tie APIs to boxes
- make later deep dives easier
13. Annotate stores and boxes with ownership and truth #
Each important box should answer one of these visually:
- source of truth
- derived projection
- cache
- async worker
- external side effect
Useful suffixes:
Booking DB [truth]Feed Store [projection]Search Index [projection]Redis Cache [cache]Notification Worker [async]PSP [external]
Useful box notes:
partitioned by user_idstrong write patheventual read modelidempotent consumer
This prevents the classic interview confusion:
- what is authoritative?
- what can be stale?
- what can be rebuilt?
14. Annotate the mechanism on the diagram #
For mechanism-bearing archetypes, the coordination primitive is the most important thing to make visible. Without it, an auction diagram and a CRUD diagram look identical.
Add a small annotation next to the relevant store or arrow:
| Mechanism | What to annotate |
|---|---|
CAS on (state, version) | label the store: DB [CAS on status+version]; label the write arrow: UPDATE WHERE status=? AND version=? |
| Lease | label the claim store: Redis [SETNX + TTL]; add fencing token on the arrow from claim to downstream write |
| Idempotency key | label the service: [idempotency store: (client_id, request_id) -> result] |
| Outbox | draw outbox as a table inside the source DB box, not as a separate box |
| Saga compensation | draw compensation arrows as dashed lines from worker back to each service that must roll back |
| CRDT / OT | label the coordinator: [operation log; state vector per client] |
Rule: if the mechanism is the load-bearing part of the design, it must appear on the diagram. A box labeled Booking Service -> DB with no annotation says nothing about how double booking is prevented.
15. Draw the read path explicitly #
The write path and read path are often different shapes. Draw them separately if they diverge.
Cache hit/miss path #
Only draw the miss path if it matters for the design:
Client -> Cache [HIT: return] [MISS: -> DB -> populate cache -> return]
Read replica #
Label which queries go to replica vs primary:
Write path -> Primary DB
Read path -> Read Replica [bounded lag]
Fan-out on write vs fan-out on read #
These produce fundamentally different diagram shapes:
Fan-out on write:
Write Service -> DB -> Event Bus -> Fanout Worker -> Feed Store (per user)
Read path: Client -> Feed Store [precomputed]
Fan-out on read:
Write Service -> DB
Read path: Client -> Feed Service -> scatter-gather(Follow Store + Post Store) -> merge -> return
State which model you are using and why. Fan-out on write is cheaper to read, expensive to write, and amplifies under high-follower-count users. Fan-out on read is cheaper to write but expensive to read at high follower counts.
Scatter-gather #
When a query must fan out across N shards:
Client -> Query Service -> [Shard 1, Shard 2, ... Shard N] -> merge -> return
Label the merge step and state its latency dependency.
16. Write failure modes next to the path where they occur #
Do not keep all failure talk verbal.
Annotate the diagram with short failure tags near the boxes or edges that own the risk.
Good annotation style:
F1 duplicate holdF2 stale cache readF3 retry after timeoutF4 webhook side effect succeeds but ack lost
Then write a one-line mitigation nearby:
idem key + unique constraintTTL + invalidate on source writeretry with backoffoutbox + reconciliation
Good places to mark failures:
- before commit
- after commit before publish
- during external side effect
- during async projection update
- during lease / hold expiry
This makes the diagram defendable under questioning.
A strong pattern is:
F1near the edgeM1near the mitigation box
Example:
Booking API -> Booking DB : POST /bookF1 duplicate submitM1 idem key
17. Annotate scalability bottlenecks where they originate #
Do not say only “this scales horizontally.”
Mark the stress point on the diagram.
Use short scale tags:
S1 hot inventory rowS2 fanout amplificationS3 hot search shardS4 projection lagS5 cold-start pressureS6 origin read miss storm
Then add the first mitigation beside it:
bucket by event_idpartition by user_idprecompute feedbounded freshness SLAwarm poolrequest coalescing
Useful targeted annotations:
| Annotation | When to use |
|---|---|
[sharded by user_id, 32 partitions] | when partition key choice is the design decision |
[replicated 3x, async] | when replication factor and durability are in scope |
[p99 < 10ms] | when latency SLA is driving the read path choice |
[~100K writes/sec] | when throughput is driving partitioning |
[hot key: celebrity users] | when skew is the specific problem |
Best rule:
- annotate the bottleneck on the edge or box where load concentrates
- annotate the mitigation on the box that absorbs or spreads the load
Never annotate a box with scale numbers just to seem thorough. Only annotate when the number explains a design choice.
18. Use a small, explicit diagram legend #
If the diagram is non-trivial, spend 10 seconds defining notation.
Suggested legend:
solid arrow = sync requestdashed arrow = async event / queue[truth] = source of truth[projection] = derived read modelF# = failure modeS# = scale hotspot
This prevents the interviewer from guessing what your arrows mean.
19. Canonical drawing sequence #
Use this order every time:
- write down main paths
- draw client
- draw main synchronous service
- draw primary store
- draw second core service if the path truly splits responsibilities
- annotate the mechanism on the write path
- draw async queue/outbox
- draw workers
- draw projection stores
- draw read path if it diverges from write path
- draw external systems
- add scale annotation on the bottleneck component only
- circle the part you will deep dive into
This sequence prevents messy diagrams.
20. Default skeletons - product systems #
CRUD / entity system #
Client -> API Service -> Primary DB
-> Cache [read path]
-> Search Indexer -> Search Index [async]
Social / relation system #
Client -> Relation Service -> Relation Store [forward + reverse index]
-> Counter Worker [async] -> Counter Store
-> Feed/Recommendation Service [read path: scatter-gather]
Workflow / transaction system #
Client -> Workflow Service -> Primary DB [CAS on status+version]
-> Outbox [inside DB] -> Queue -> Worker -> External Provider
-> Reconciliation Job
Inventory / hold / booking system #
Client -> Inventory Service -> Inventory DB [CAS on availability]
-> Hold Store [Redis SETNX + TTL] -> Expiry Worker
-> Booking DB [guarded confirm]
Search / feed / ranking system #
Write Service -> Source DB -> Event Bus -> Indexer/Fanout Worker -> Projection Store / Search Index
-> Feed Store (fan-out on write)
Query Service <- Projection Store [read path]
Collaborative / realtime system #
Client A -+
Client B -+-> Operation Coordinator [per document] -> Operation Log DB [append]
Client C -+ -> broadcast
Connected Clients [WebSocket]
Snapshot Store [periodic; rebuilt from log]
File sync system #
Client -> Sync Service -> Namespace DB [CAS on version] -> Conflict Object [on version mismatch]
-> Block Store [S3; content-addressed by SHA-256]
-> Sync Cursor [derived; rebuilt from namespace history]
Delta sync read path: Client -> Sync Service -> Namespace DB [delta from last_version]
Matching / assignment system #
Request State DB [guarded transitions] <-> Assignment Service
Candidate Pool [Redis; eligibility index] <-> Assignment Service
Assignment Record [Lease: SETNX + TTL + fencing token]
Execution State DB [state machine] <- Worker
Crawler / frontier system #
Seed URLs -> Frontier Store [PostgreSQL; dedup by canonical URL]
Worker Pool -> claim via FOR UPDATE SKIP LOCKED -> fetch -> parse
-> discovered URLs -> Frontier Store [dedup]
-> Result Store [content-addressed]
Critical transaction with saga #
Client -> Orchestrator Service -> Local DB [CAS on status+version] + Outbox
-> Service A [idempotent step] <- compensation -+
-> Service B [idempotent step] <- compensation -+
-> External Provider |
Compensation Worker [on failure: replay compensations in reverse order] ------+
21. Default skeletons - infrastructure systems #
Messaging / streaming system #
Producer -> Topic/Partition [append log; partitioned by key]
<- Consumer Group [pull; offset per partition per group]
-> Offset Store [committed offsets]
-> Dead Letter Queue [unprocessable messages]
Admin API -> Topic Metadata [partition count, retention, replication]
Key-value / cache infrastructure #
Client -> Router [consistent hash on key] -> Partition Node [primary]
-> Replica Nodes [async replication]
Eviction Worker [LRU/TTL sweep per node]
Control Plane [partition map; rebalance on node join/leave]
Rate limiter #
Client -> Rate Limiter Service -> Counter Store [Redis: SETNX + sliding window or token bucket]
-> Policy Store [limits per key/tenant]
[ALLOW: pass through] [DENY: 429 + retry-after header]
Coordination / consensus store #
Client -> Leader Node [Raft; all writes go to leader]
-> Follower Nodes [replicated; reads allowed with staleness flag]
Watch Registration -> Leader -> Watch Fanout [notify all watchers on key change]
Lease Store [TTL leases; auto-delete keys on lease expiry]
CDN / edge delivery #
Origin -> Origin Shield [single PoP; buffers all edges from direct origin hits]
-> Regional PoP [mid-tier cache]
-> Edge PoP [serves client]
Client -> Edge PoP [HIT: return] [MISS: -> Regional -> Shield -> Origin -> populate down]
Purge API -> Origin -> propagate invalidation to Shield -> Regional -> Edge [bounded lag]
Control plane + data plane #
Admin API -> Control Plane Store [versioned configs; CAS on version]
-> Propagation Layer [push delta to agents or agents long-poll]
-> Agent [local snapshot; applied_version tracked]
Data Plane [serves traffic using local snapshot; never calls control plane on hot path]
Health Reports -> Control Plane [agents report applied_version + health signals]
Rollout Controller [advance % or rollback based on health signals]
22. Diagram evolution over the interview #
Do not draw everything at once. Evolve the diagram in three passes:
Pass 1 - baseline (first 10 minutes): Draw the sync write path only. One client, one service, one primary store. Annotate the mechanism. This establishes correctness before complexity.
Pass 2 - read path and async (next 10 minutes): Add the read path if it diverges. Add outbox, queue, workers, projection stores. Label async boundaries explicitly.
Pass 3 - deep dive expansion (remaining time): Expand only the part the interviewer wants to go deeper on. Add scale annotations, replicas, caches, external systems. Do not expand the parts not being discussed.
Rule: every box added in pass 2 and pass 3 must be justified by a path or a failure mode. If you cannot state why a box was added, remove it.
23. When to split a service #
A common failure is either splitting too early or too late.
Split a monolith into two services only when at least one of these is true:
- different correctness scopes
- different scaling axes
- different failure tolerance
- different ownership
Do not split because:
- the names are different
- you want to seem thorough
- microservices feel more modern
When you do split, state which of the four reasons applies. This turns a diagram decision into a derived choice.
24. Questions to ask for every box #
For each box in your diagram, be able to answer:
- which path uses this box?
- what data does it own or serve?
- is it source truth or projection?
- is it sync or async?
- why can this not be merged into another box at this scale?
If you cannot answer these, simplify the diagram.
25. What to say while drawing #
Use lines like:
I'll start with the main write path.This service owns the canonical booking state.After the source write commits, I publish to an async pipeline.These read-heavy queries come from projections, not the source store.This worker exists because freshness can lag but correctness cannot.I'm annotating CAS here because this is where the exclusivity invariant is enforced.I'll split this into two services because the read path and write path have different scaling requirements.I'm adding this box now because the deep dive is on fanout, not because every design needs it.
This makes the diagram feel intentional and derived, not assembled from memory.
26. Common mistakes #
- drawing Kafka, Redis, and Elasticsearch before naming the path
- drawing one box per entity with no behavior
- mixing source truth and projections in the same store box
- drawing too many microservices too early without stating why
- failing to show async boundaries
- failing to show which store is canonical
- drawing infrastructure instead of responsibilities
- no mechanism annotation on CAS or lease paths
- drawing the write path and read path as the same arrow
- adding scale annotations everywhere instead of only on the bottleneck
- splitting services speculatively
27. Interview one-liner #
I draw HLDs in three passes: sync write path first with mechanism annotated, then read path and async pipeline, then deep dive expansion on the bottleneck. Every box must answer which path uses it, what truth it owns, and why it cannot be merged. Service splits require an explicit reason.