Infra Interview Flow Cheat Sheet #

Use this note for infrastructure-oriented prompts such as:

distributed lock service
scheduler
queue
cache
rate limiter
config service
service discovery
event streaming platform
metrics or tracing system

Infra interviews are usually less product-flow-oriented and more mechanism-oriented.

So the delivery order is slightly different from product interviews.

The delivery order #

clarify the contract
define NFRs
identify core state and invariants
expose external API / interface
present high-level architecture
deep dive into correctness
deep dive into scale
discuss tradeoffs and operational concerns

1. Clarify the contract #

Before drawing anything, state what the system promises.

Examples:

Lock service #

mutual exclusion?
lease-based or permanent locks?
fencing token support?
best-effort or correctness-critical?

Queue #

at-most-once, at-least-once, or effectively-once consumer behavior?
ordering guarantees?
visibility timeout?
dead-letter queue?

Scheduler #

exact fire time or bounded delay?
one execution per scheduled run?
retries?
cron-like recurring jobs?

This step matters more in infra than in product prompts.

What to say out loud:

First I want to define the contract, because the architecture changes a lot depending on whether this is best-effort, at-least-once, or correctness-critical.

2. Define NFRs #

For infra systems, the most important NFRs are usually:

consistency / correctness
durability
latency
availability
throughput
ordering
freshness

Examples:

Lock service #

correctness beats availability
stale holder must not act
lease operations should be low latency

Metrics system #

ingest throughput matters most
bounded loss may be acceptable depending on metric class
query freshness may lag

Config service #

control truth must be strong
local evaluators can lag within bounded propagation

What to say out loud:

I’ll separate correctness-critical paths from high-throughput or eventually consistent paths.

3. Identify core state and invariants #

This is the most important section in infra interviews.

State examples:

LeaseState
ClaimState
MembershipState
FrontierState
Offset
SnapshotVersion
TokenBucketState
JobRunState

Invariant examples:

only one valid lock holder at a time
no stale holder can mutate downstream state
offset must not advance past durable processing
config version must apply monotonically
job run should not execute twice without idempotency or dedup

What to say out loud:

The architecture follows from the invariants. So I’ll first define the few pieces of state whose correctness really matters.

4. Expose the API / interface #

Even infra systems have interfaces.

Examples:

Lock service #

Acquire(lock_id, owner_id, ttl)
Renew(lock_id, lease_id, ttl)
Release(lock_id, lease_id)

Queue #

Enqueue(message)
Receive(batch_size, visibility_timeout)
Ack(message_id)
Nack(message_id)

Scheduler #

Schedule(job, run_at)
Cancel(schedule_id)
ClaimRunnable(batch_size)
Complete(run_id)

Config service #

PutConfig(key, value, version)
GetSnapshot(service, version?)
Watch(from_version)

What to say out loud:

I’ll make the external contract concrete with a small set of operations before drawing the internals.

5. Present high-level architecture #

Now show the boxes.

Typical infra boxes:

clients / SDKs / agents
API servers
consensus / metadata store
queue or log
worker / executor
projection or cache
background reconciler

Examples:

Lock service #

client library
lock API
quorum metadata store
watch channel
downstream systems requiring fencing token

Scheduler #

scheduler API
schedule store indexed by due time
scanner
runnable queue
workers
run state store

Queue #

producer
broker
append log / message store
consumer
visibility timeout manager
DLQ

What to say out loud:

Writes go to the canonical coordination or metadata store first, then workers and watchers consume that state to perform the execution path.

6. Deep dive into correctness #

Infra interviews almost always require one serious correctness deep dive.

Good deep-dive topics:

stale holder fencing
watch gap on reconnect
duplicate claim
offset commit discipline
out-of-order config apply
exactly-once illusion vs idempotent processing
leader failover

Use this structure:

what can go wrong?
what truth do I trust?
what control prevents corruption?
what repair fixes drift?

Examples:

Lock service #

stale holder keeps acting after lease expiry
trust LeaseState and epoch
fencing token blocks stale holder
lease expiry and watch replay repair state

Queue #

consumer processes message but crashes before ack
trust message log and visibility timeout state
at-least-once redelivery
idempotent consumer or dedup store repairs duplicates

7. Deep dive into scale #

Infra scale questions are usually about:

hot partitions
control-plane fanout
metadata bottlenecks
large scans
replay storms
write amplification

Use:

what gets hot?
why?
first mitigation?

Examples:

Rate limiter #

hot tenant key
many requests converge on same budget key
shard counter or use local token buckets with periodic reconciliation

Scheduler #

due-time bucket
many jobs become due at the same minute
shard by time bucket and scanner range

Config service #

snapshot publication fanout
one change fans out to many agents
versioned pull + watch instead of pure push

8. Discuss tradeoffs and operations #

Infra interviews reward operational maturity.

Mention:

observability
reconciliation jobs
backpressure
capacity hotspots
degraded mode behavior
migrations and rollout discipline

Examples:

alert on lease churn
alert on consumer lag
monitor stale config versions
track scheduler due delay
expose lock contention metrics

What to say out loud:

I’d add operational guardrails because these systems fail in subtle ways long before they fully go down.

How infra flow differs from product flow #

Product flow usually starts with: #

user requirements
entities
APIs

Infra flow usually starts with: #

contract
invariants
consistency semantics
interface

That is the main difference.

Fast mapping table #

Infra interview section	What you are really deriving
Contract	service semantics
NFRs	correctness and performance envelope
Core state + invariants	canonical truth
API / interface	public operations
HLD	mechanism + realization
Correctness deep dive	failure handling
Scale deep dive	hotspot mitigation
Tradeoffs	operational maturity

Example compressed flow #

Example: Distributed Lock Service #

Contract
- lease-based lock
- mutual exclusion
- fencing token required
NFRs
- correctness over availability
- low latency acquire/renew
- durable lease state
Core state
- LeaseState
- OwnerEpoch
API
- Acquire
- Renew
- Release
HLD
- lock API over quorum metadata store
- watches for state changes
- downstream services validate fencing token
Correctness deep dive
- stale holder after lease expiry
- fencing token prevents corruption
Scale deep dive
- hot lock ids
- shard metadata by lock key
Tradeoffs
- strong correctness but lower availability under partition

What not to do #

do not jump straight to Raft/Paxos before defining the contract
do not present APIs without invariants
do not say “exactly once” without explaining how the illusion is realized
do not treat metrics, queues, locks, and config as the same kind of system
do not skip operational recovery paths

Interview one-liner #

For infra designs I start from the contract and invariants, because once I know what must always be true, the state model, APIs, and architecture become much easier to justify.