Infra Interview Flow Cheat Sheet
Infra Interview Flow Cheat Sheet #
Use this note for infrastructure-oriented prompts such as:
- distributed lock service
- scheduler
- queue
- cache
- rate limiter
- config service
- service discovery
- event streaming platform
- metrics or tracing system
Infra interviews are usually less product-flow-oriented and more mechanism-oriented.
So the delivery order is slightly different from product interviews.
The delivery order #
- clarify the contract
- define NFRs
- identify core state and invariants
- expose external API / interface
- present high-level architecture
- deep dive into correctness
- deep dive into scale
- discuss tradeoffs and operational concerns
1. Clarify the contract #
Before drawing anything, state what the system promises.
Examples:
Lock service #
- mutual exclusion?
- lease-based or permanent locks?
- fencing token support?
- best-effort or correctness-critical?
Queue #
- at-most-once, at-least-once, or effectively-once consumer behavior?
- ordering guarantees?
- visibility timeout?
- dead-letter queue?
Scheduler #
- exact fire time or bounded delay?
- one execution per scheduled run?
- retries?
- cron-like recurring jobs?
This step matters more in infra than in product prompts.
What to say out loud:
First I want to define the contract, because the architecture changes a lot depending on whether this is best-effort, at-least-once, or correctness-critical.
2. Define NFRs #
For infra systems, the most important NFRs are usually:
- consistency / correctness
- durability
- latency
- availability
- throughput
- ordering
- freshness
Examples:
Lock service #
- correctness beats availability
- stale holder must not act
- lease operations should be low latency
Metrics system #
- ingest throughput matters most
- bounded loss may be acceptable depending on metric class
- query freshness may lag
Config service #
- control truth must be strong
- local evaluators can lag within bounded propagation
What to say out loud:
I’ll separate correctness-critical paths from high-throughput or eventually consistent paths.
3. Identify core state and invariants #
This is the most important section in infra interviews.
State examples:
LeaseStateClaimStateMembershipStateFrontierStateOffsetSnapshotVersionTokenBucketStateJobRunState
Invariant examples:
- only one valid lock holder at a time
- no stale holder can mutate downstream state
- offset must not advance past durable processing
- config version must apply monotonically
- job run should not execute twice without idempotency or dedup
What to say out loud:
The architecture follows from the invariants. So I’ll first define the few pieces of state whose correctness really matters.
4. Expose the API / interface #
Even infra systems have interfaces.
Examples:
Lock service #
Acquire(lock_id, owner_id, ttl)Renew(lock_id, lease_id, ttl)Release(lock_id, lease_id)
Queue #
Enqueue(message)Receive(batch_size, visibility_timeout)Ack(message_id)Nack(message_id)
Scheduler #
Schedule(job, run_at)Cancel(schedule_id)ClaimRunnable(batch_size)Complete(run_id)
Config service #
PutConfig(key, value, version)GetSnapshot(service, version?)Watch(from_version)
What to say out loud:
I’ll make the external contract concrete with a small set of operations before drawing the internals.
5. Present high-level architecture #
Now show the boxes.
Typical infra boxes:
- clients / SDKs / agents
- API servers
- consensus / metadata store
- queue or log
- worker / executor
- projection or cache
- background reconciler
Examples:
Lock service #
- client library
- lock API
- quorum metadata store
- watch channel
- downstream systems requiring fencing token
Scheduler #
- scheduler API
- schedule store indexed by due time
- scanner
- runnable queue
- workers
- run state store
Queue #
- producer
- broker
- append log / message store
- consumer
- visibility timeout manager
- DLQ
What to say out loud:
Writes go to the canonical coordination or metadata store first, then workers and watchers consume that state to perform the execution path.
6. Deep dive into correctness #
Infra interviews almost always require one serious correctness deep dive.
Good deep-dive topics:
- stale holder fencing
- watch gap on reconnect
- duplicate claim
- offset commit discipline
- out-of-order config apply
- exactly-once illusion vs idempotent processing
- leader failover
Use this structure:
- what can go wrong?
- what truth do I trust?
- what control prevents corruption?
- what repair fixes drift?
Examples:
Lock service #
- stale holder keeps acting after lease expiry
- trust
LeaseStateand epoch - fencing token blocks stale holder
- lease expiry and watch replay repair state
Queue #
- consumer processes message but crashes before ack
- trust message log and visibility timeout state
- at-least-once redelivery
- idempotent consumer or dedup store repairs duplicates
7. Deep dive into scale #
Infra scale questions are usually about:
- hot partitions
- control-plane fanout
- metadata bottlenecks
- large scans
- replay storms
- write amplification
Use:
- what gets hot?
- why?
- first mitigation?
Examples:
Rate limiter #
- hot tenant key
- many requests converge on same budget key
- shard counter or use local token buckets with periodic reconciliation
Scheduler #
- due-time bucket
- many jobs become due at the same minute
- shard by time bucket and scanner range
Config service #
- snapshot publication fanout
- one change fans out to many agents
- versioned pull + watch instead of pure push
8. Discuss tradeoffs and operations #
Infra interviews reward operational maturity.
Mention:
- observability
- reconciliation jobs
- backpressure
- capacity hotspots
- degraded mode behavior
- migrations and rollout discipline
Examples:
- alert on lease churn
- alert on consumer lag
- monitor stale config versions
- track scheduler due delay
- expose lock contention metrics
What to say out loud:
I’d add operational guardrails because these systems fail in subtle ways long before they fully go down.
How infra flow differs from product flow #
Product flow usually starts with: #
- user requirements
- entities
- APIs
Infra flow usually starts with: #
- contract
- invariants
- consistency semantics
- interface
That is the main difference.
Fast mapping table #
| Infra interview section | What you are really deriving |
|---|---|
| Contract | service semantics |
| NFRs | correctness and performance envelope |
| Core state + invariants | canonical truth |
| API / interface | public operations |
| HLD | mechanism + realization |
| Correctness deep dive | failure handling |
| Scale deep dive | hotspot mitigation |
| Tradeoffs | operational maturity |
Example compressed flow #
Example: Distributed Lock Service #
Contract
- lease-based lock
- mutual exclusion
- fencing token required
NFRs
- correctness over availability
- low latency acquire/renew
- durable lease state
Core state
LeaseStateOwnerEpoch
API
AcquireRenewRelease
HLD
- lock API over quorum metadata store
- watches for state changes
- downstream services validate fencing token
Correctness deep dive
- stale holder after lease expiry
- fencing token prevents corruption
Scale deep dive
- hot lock ids
- shard metadata by lock key
Tradeoffs
- strong correctness but lower availability under partition
What not to do #
- do not jump straight to Raft/Paxos before defining the contract
- do not present APIs without invariants
- do not say “exactly once” without explaining how the illusion is realized
- do not treat metrics, queues, locks, and config as the same kind of system
- do not skip operational recovery paths
Interview one-liner #
For infra designs I start from the contract and invariants, because once I know what must always be true, the state model, APIs, and architecture become much easier to justify.