HLD Diagram Discipline Cheat Sheet #

Status: Archive candidate. Keep as historical reference; prefer system-design-core-index.md and the core notes for day-to-day use.

Prefer the infra-specific or core-index-guided notes instead of treating this as primary.

Use this note to draw cleaner interview diagrams.

The main rule:

Draw services around responsibilities for critical paths, then attach the stores and async pipelines they need.

Do not start from random boxes like User Service, DB, Cache, Kafka.

1. Start from paths, not boxes #

Before drawing anything, identify:

the 2-4 most important user paths
any hidden system paths that matter for correctness

Examples:

hold seat
buy seat
view seat map
expire stale hold

Why:

paths tell you what the system must do
services exist to own responsibilities on those paths

2. Group paths into service responsibilities #

Turn paths into a small set of responsibilities.

Examples:

Inventory Service
Booking Service
Search Service
Feed Service
Notification Worker
Scheduler / Expiry Worker

Good box test:

which path hits this box?
what truth does it own?
is it sync or async?

If you cannot answer those, the box is probably fake.

3. Draw the sync path first #

Always draw the request-response path before async machinery.

Typical order:

Client
API / App Service
Core Domain Service
Primary Store

Example:

Client
Booking API
Inventory/Booking Service
Primary DB

Why:

interviews care first about correctness on the main path
async details are easier once the sync path is clear

4. Put source-of-truth stores directly under the owning service #

For each service, attach the canonical store it owns.

Examples:

Follow Service -> Relation Store
Booking Service -> Booking DB
Inventory Service -> Inventory DB
Document Service -> Document Store

Rule:

source truth should be obvious in the diagram
derived stores should be drawn later and separately

5. Add async edges only after the source truth path is clear #

Now add:

outbox
queue
event bus
worker
scheduler
fanout pipeline

Typical flow:

source write commits
event or outbox record is produced
worker consumes event
worker updates projections / sends notifications / triggers side effects

This is where:

feed fanout
search indexing
counter maintenance
expiry scanning
notification delivery

usually appear.

6. Draw derived stores separately from source truth #

Keep these in a separate visual lane:

cache
search index
feed store
leaderboard store
analytics store
dashboard store

Reason:

source truth and projection should not be visually conflated
this makes consistency tradeoffs easier to explain

Good pattern:

source truth in middle / left
queue in middle
projections on right

7. Add external systems last #

Only after the internal flow is stable, add:

payment provider
email provider
push notification provider
CDN
object storage
auth provider

Why:

external systems are side effects or dependencies
they are rarely the starting point of the design

8. Keep service boxes responsibility-oriented #

Good service names:

Inventory Service
Booking Service
Feed Service
Search Indexer
Notification Service
Scheduler

Bad service names:

Everything Service
Database Service
User Service if it owns unrelated responsibilities
one service box per entity with no behavior

Use verbs or responsibility nouns, not schema nouns.

9. Use one diagram lane per role #

A clean interview diagram usually has 4 lanes:

clients / entrypoints
synchronous domain services
source-of-truth stores
async workers + projections + external systems

This keeps the board readable.

10. Show ownership, not every network hop #

You do not need to draw:

service mesh
load balancer internals
every internal RPC
every cache invalidation call

Only show what matters for:

correctness
performance
failure handling
scaling

11. Use arrows to show semantics, not decoration #

Different arrows should imply something:

sync request/response
async event or queue
projection update
external side effect

If speaking, say it clearly:

this write is synchronous
this projection update is asynchronous
this external notification is retried

12. Annotate API paths directly on the diagram #

Do not leave the main arrows unnamed.

Every important sync path should be annotated with:

method + endpoint or operation name
short intent
optional consistency or idempotency note

Good examples:

POST /holds [idem key]
POST /payments/{id}/capture
GET /feed/home [eventual]
PUT /docs/{id}/ops [versioned]

If you do not want full HTTP detail, use operation labels:

create_hold
confirm_booking
append_message
search_nearby

Best practice:

put the path label above the arrow
put the consistency / idempotency hint in brackets
label only the 2-4 paths that actually matter

Do not annotate every edge.

The goal is:

make the critical request paths visible
tie APIs to boxes
make later deep dives easier

13. Annotate stores and boxes with ownership and truth #

Each important box should answer one of these visually:

source of truth
derived projection
cache
async worker
external side effect

Useful suffixes:

Booking DB [truth]
Feed Store [projection]
Search Index [projection]
Redis Cache [cache]
Notification Worker [async]
PSP [external]

Useful box notes:

partitioned by user_id
strong write path
eventual read model
idempotent consumer

This prevents the classic interview confusion:

what is authoritative?
what can be stale?
what can be rebuilt?

14. Annotate the mechanism on the diagram #

For mechanism-bearing archetypes, the coordination primitive is the most important thing to make visible. Without it, an auction diagram and a CRUD diagram look identical.

Add a small annotation next to the relevant store or arrow:

Mechanism	What to annotate
CAS on `(state, version)`	label the store: `DB [CAS on status+version]`; label the write arrow: `UPDATE WHERE status=? AND version=?`
Lease	label the claim store: `Redis [SETNX + TTL]`; add fencing token on the arrow from claim to downstream write
Idempotency key	label the service: `[idempotency store: (client_id, request_id) -> result]`
Outbox	draw outbox as a table inside the source DB box, not as a separate box
Saga compensation	draw compensation arrows as dashed lines from worker back to each service that must roll back
CRDT / OT	label the coordinator: `[operation log; state vector per client]`

Rule: if the mechanism is the load-bearing part of the design, it must appear on the diagram. A box labeled Booking Service -> DB with no annotation says nothing about how double booking is prevented.

15. Draw the read path explicitly #

The write path and read path are often different shapes. Draw them separately if they diverge.

Cache hit/miss path #

Only draw the miss path if it matters for the design:

Client -> Cache [HIT: return] [MISS: -> DB -> populate cache -> return]

Read replica #

Label which queries go to replica vs primary:

Write path -> Primary DB
Read path  -> Read Replica [bounded lag]

Fan-out on write vs fan-out on read #

These produce fundamentally different diagram shapes:

Fan-out on write:

Write Service -> DB -> Event Bus -> Fanout Worker -> Feed Store (per user)
Read path: Client -> Feed Store [precomputed]

Fan-out on read:

Write Service -> DB
Read path: Client -> Feed Service -> scatter-gather(Follow Store + Post Store) -> merge -> return

State which model you are using and why. Fan-out on write is cheaper to read, expensive to write, and amplifies under high-follower-count users. Fan-out on read is cheaper to write but expensive to read at high follower counts.

Scatter-gather #

When a query must fan out across N shards:

Client -> Query Service -> [Shard 1, Shard 2, ... Shard N] -> merge -> return

Label the merge step and state its latency dependency.

16. Write failure modes next to the path where they occur #

Do not keep all failure talk verbal.

Annotate the diagram with short failure tags near the boxes or edges that own the risk.

Good annotation style:

F1 duplicate hold
F2 stale cache read
F3 retry after timeout
F4 webhook side effect succeeds but ack lost

Then write a one-line mitigation nearby:

idem key + unique constraint
TTL + invalidate on source write
retry with backoff
outbox + reconciliation

Good places to mark failures:

before commit
after commit before publish
during external side effect
during async projection update
during lease / hold expiry

This makes the diagram defendable under questioning.

A strong pattern is:

F1 near the edge
M1 near the mitigation box

Example:

Booking API -> Booking DB : POST /book
F1 duplicate submit
M1 idem key

17. Annotate scalability bottlenecks where they originate #

Do not say only “this scales horizontally.”

Mark the stress point on the diagram.

Use short scale tags:

S1 hot inventory row
S2 fanout amplification
S3 hot search shard
S4 projection lag
S5 cold-start pressure
S6 origin read miss storm

Then add the first mitigation beside it:

bucket by event_id
partition by user_id
precompute feed
bounded freshness SLA
warm pool
request coalescing

Useful targeted annotations:

Annotation	When to use
`[sharded by user_id, 32 partitions]`	when partition key choice is the design decision
`[replicated 3x, async]`	when replication factor and durability are in scope
`[p99 < 10ms]`	when latency SLA is driving the read path choice
`[~100K writes/sec]`	when throughput is driving partitioning
`[hot key: celebrity users]`	when skew is the specific problem

Best rule:

annotate the bottleneck on the edge or box where load concentrates
annotate the mitigation on the box that absorbs or spreads the load

Never annotate a box with scale numbers just to seem thorough. Only annotate when the number explains a design choice.

18. Use a small, explicit diagram legend #

If the diagram is non-trivial, spend 10 seconds defining notation.

Suggested legend:

solid arrow = sync request
dashed arrow = async event / queue
[truth] = source of truth
[projection] = derived read model
F# = failure mode
S# = scale hotspot

This prevents the interviewer from guessing what your arrows mean.

19. Canonical drawing sequence #

Use this order every time:

write down main paths
draw client
draw main synchronous service
draw primary store
draw second core service if the path truly splits responsibilities
annotate the mechanism on the write path
draw async queue/outbox
draw workers
draw projection stores
draw read path if it diverges from write path
draw external systems
add scale annotation on the bottleneck component only
circle the part you will deep dive into

This sequence prevents messy diagrams.

20. Default skeletons - product systems #

CRUD / entity system #

Client -> API Service -> Primary DB
                     -> Cache [read path]
                     -> Search Indexer -> Search Index [async]

Client -> Relation Service -> Relation Store [forward + reverse index]
                          -> Counter Worker [async] -> Counter Store
                          -> Feed/Recommendation Service [read path: scatter-gather]

Workflow / transaction system #

Client -> Workflow Service -> Primary DB [CAS on status+version]
                          -> Outbox [inside DB] -> Queue -> Worker -> External Provider
                                                         -> Reconciliation Job

Inventory / hold / booking system #

Client -> Inventory Service -> Inventory DB [CAS on availability]
                           -> Hold Store [Redis SETNX + TTL] -> Expiry Worker
                           -> Booking DB [guarded confirm]

Search / feed / ranking system #

Write Service -> Source DB -> Event Bus -> Indexer/Fanout Worker -> Projection Store / Search Index
                                                                -> Feed Store (fan-out on write)
Query Service <- Projection Store [read path]

Collaborative / realtime system #

Client A -+
Client B -+-> Operation Coordinator [per document] -> Operation Log DB [append]
Client C -+         -> broadcast
              Connected Clients [WebSocket]
              Snapshot Store [periodic; rebuilt from log]

File sync system #

Client -> Sync Service -> Namespace DB [CAS on version] -> Conflict Object [on version mismatch]
                      -> Block Store [S3; content-addressed by SHA-256]
                      -> Sync Cursor [derived; rebuilt from namespace history]
Delta sync read path: Client -> Sync Service -> Namespace DB [delta from last_version]

Matching / assignment system #

Request State DB [guarded transitions] <-> Assignment Service
Candidate Pool [Redis; eligibility index] <-> Assignment Service
Assignment Record [Lease: SETNX + TTL + fencing token]
Execution State DB [state machine] <- Worker

Crawler / frontier system #

Seed URLs -> Frontier Store [PostgreSQL; dedup by canonical URL]
Worker Pool -> claim via FOR UPDATE SKIP LOCKED -> fetch -> parse
           -> discovered URLs -> Frontier Store [dedup]
           -> Result Store [content-addressed]

Critical transaction with saga #

Client -> Orchestrator Service -> Local DB [CAS on status+version] + Outbox
                              -> Service A [idempotent step] <- compensation -+
                              -> Service B [idempotent step] <- compensation -+
                              -> External Provider                            |
Compensation Worker [on failure: replay compensations in reverse order] ------+

21. Default skeletons - infrastructure systems #

Messaging / streaming system #

Producer -> Topic/Partition [append log; partitioned by key]
         <- Consumer Group [pull; offset per partition per group]
         -> Offset Store [committed offsets]
         -> Dead Letter Queue [unprocessable messages]
Admin API -> Topic Metadata [partition count, retention, replication]

Key-value / cache infrastructure #

Client -> Router [consistent hash on key] -> Partition Node [primary]
                                         -> Replica Nodes [async replication]
Eviction Worker [LRU/TTL sweep per node]
Control Plane [partition map; rebalance on node join/leave]

Rate limiter #

Client -> Rate Limiter Service -> Counter Store [Redis: SETNX + sliding window or token bucket]
                              -> Policy Store [limits per key/tenant]
[ALLOW: pass through] [DENY: 429 + retry-after header]

Coordination / consensus store #

Client -> Leader Node [Raft; all writes go to leader]
        -> Follower Nodes [replicated; reads allowed with staleness flag]
Watch Registration -> Leader -> Watch Fanout [notify all watchers on key change]
Lease Store [TTL leases; auto-delete keys on lease expiry]

CDN / edge delivery #

Origin -> Origin Shield [single PoP; buffers all edges from direct origin hits]
       -> Regional PoP [mid-tier cache]
       -> Edge PoP [serves client]
Client -> Edge PoP [HIT: return] [MISS: -> Regional -> Shield -> Origin -> populate down]
Purge API -> Origin -> propagate invalidation to Shield -> Regional -> Edge [bounded lag]

Control plane + data plane #

Admin API -> Control Plane Store [versioned configs; CAS on version]
          -> Propagation Layer [push delta to agents or agents long-poll]
          -> Agent [local snapshot; applied_version tracked]
Data Plane [serves traffic using local snapshot; never calls control plane on hot path]
Health Reports -> Control Plane [agents report applied_version + health signals]
Rollout Controller [advance % or rollback based on health signals]

22. Diagram evolution over the interview #

Do not draw everything at once. Evolve the diagram in three passes:

Pass 1 - baseline (first 10 minutes): Draw the sync write path only. One client, one service, one primary store. Annotate the mechanism. This establishes correctness before complexity.

Pass 2 - read path and async (next 10 minutes): Add the read path if it diverges. Add outbox, queue, workers, projection stores. Label async boundaries explicitly.

Pass 3 - deep dive expansion (remaining time): Expand only the part the interviewer wants to go deeper on. Add scale annotations, replicas, caches, external systems. Do not expand the parts not being discussed.

Rule: every box added in pass 2 and pass 3 must be justified by a path or a failure mode. If you cannot state why a box was added, remove it.

23. When to split a service #

A common failure is either splitting too early or too late.

Split a monolith into two services only when at least one of these is true:

different correctness scopes
different scaling axes
different failure tolerance
different ownership

Do not split because:

the names are different
you want to seem thorough
microservices feel more modern

When you do split, state which of the four reasons applies. This turns a diagram decision into a derived choice.

24. Questions to ask for every box #

For each box in your diagram, be able to answer:

which path uses this box?
what data does it own or serve?
is it source truth or projection?
is it sync or async?
why can this not be merged into another box at this scale?

If you cannot answer these, simplify the diagram.

25. What to say while drawing #

Use lines like:

I'll start with the main write path.
This service owns the canonical booking state.
After the source write commits, I publish to an async pipeline.
These read-heavy queries come from projections, not the source store.
This worker exists because freshness can lag but correctness cannot.
I'm annotating CAS here because this is where the exclusivity invariant is enforced.
I'll split this into two services because the read path and write path have different scaling requirements.
I'm adding this box now because the deep dive is on fanout, not because every design needs it.

This makes the diagram feel intentional and derived, not assembled from memory.

26. Common mistakes #

drawing Kafka, Redis, and Elasticsearch before naming the path
drawing one box per entity with no behavior
mixing source truth and projections in the same store box
drawing too many microservices too early without stating why
failing to show async boundaries
failing to show which store is canonical
drawing infrastructure instead of responsibilities
no mechanism annotation on CAS or lease paths
drawing the write path and read path as the same arrow
adding scale annotations everywhere instead of only on the bottleneck
splitting services speculatively

27. Interview one-liner #

I draw HLDs in three passes: sync write path first with mechanism annotated, then read path and async pipeline, then deep dive expansion on the bottleneck. Every box must answer which path uses it, what truth it owns, and why it cannot be merged. Service splits require an explicit reason.

HLD Diagram Discipline Cheat Sheet #

1. Start from paths, not boxes #

2. Group paths into service responsibilities #

3. Draw the sync path first #

4. Put source-of-truth stores directly under the owning service #

5. Add async edges only after the source truth path is clear #

6. Draw derived stores separately from source truth #

7. Add external systems last #

8. Keep service boxes responsibility-oriented #

9. Use one diagram lane per role #

10. Show ownership, not every network hop #

11. Use arrows to show semantics, not decoration #

12. Annotate API paths directly on the diagram #

13. Annotate stores and boxes with ownership and truth #

14. Annotate the mechanism on the diagram #

15. Draw the read path explicitly #

Cache hit/miss path #

Read replica #

Fan-out on write vs fan-out on read #

Scatter-gather #

16. Write failure modes next to the path where they occur #

17. Annotate scalability bottlenecks where they originate #

18. Use a small, explicit diagram legend #

19. Canonical drawing sequence #

20. Default skeletons - product systems #

CRUD / entity system #

Social / relation system #

Workflow / transaction system #

Inventory / hold / booking system #

Search / feed / ranking system #

Collaborative / realtime system #

File sync system #

Matching / assignment system #

Crawler / frontier system #

Critical transaction with saga #

21. Default skeletons - infrastructure systems #

Messaging / streaming system #

Key-value / cache infrastructure #

Rate limiter #

Coordination / consensus store #

CDN / edge delivery #

Control plane + data plane #

22. Diagram evolution over the interview #

23. When to split a service #

24. Questions to ask for every box #

25. What to say while drawing #

26. Common mistakes #

27. Interview one-liner #