Skip to main content
  1. System Design Components/

Infra Interview Flow Cheat Sheet

Infra Interview Flow Cheat Sheet #

Use this note for infrastructure-oriented prompts such as:

  1. distributed lock service
  2. scheduler
  3. queue
  4. cache
  5. rate limiter
  6. config service
  7. service discovery
  8. event streaming platform
  9. metrics or tracing system

Infra interviews are usually less product-flow-oriented and more mechanism-oriented.

So the delivery order is slightly different from product interviews.


The delivery order #

  1. clarify the contract
  2. define NFRs
  3. identify core state and invariants
  4. expose external API / interface
  5. present high-level architecture
  6. deep dive into correctness
  7. deep dive into scale
  8. discuss tradeoffs and operational concerns

1. Clarify the contract #

Before drawing anything, state what the system promises.

Examples:

Lock service #

  1. mutual exclusion?
  2. lease-based or permanent locks?
  3. fencing token support?
  4. best-effort or correctness-critical?

Queue #

  1. at-most-once, at-least-once, or effectively-once consumer behavior?
  2. ordering guarantees?
  3. visibility timeout?
  4. dead-letter queue?

Scheduler #

  1. exact fire time or bounded delay?
  2. one execution per scheduled run?
  3. retries?
  4. cron-like recurring jobs?

This step matters more in infra than in product prompts.

What to say out loud:

First I want to define the contract, because the architecture changes a lot depending on whether this is best-effort, at-least-once, or correctness-critical.


2. Define NFRs #

For infra systems, the most important NFRs are usually:

  1. consistency / correctness
  2. durability
  3. latency
  4. availability
  5. throughput
  6. ordering
  7. freshness

Examples:

Lock service #

  1. correctness beats availability
  2. stale holder must not act
  3. lease operations should be low latency

Metrics system #

  1. ingest throughput matters most
  2. bounded loss may be acceptable depending on metric class
  3. query freshness may lag

Config service #

  1. control truth must be strong
  2. local evaluators can lag within bounded propagation

What to say out loud:

I’ll separate correctness-critical paths from high-throughput or eventually consistent paths.


3. Identify core state and invariants #

This is the most important section in infra interviews.

State examples:

  1. LeaseState
  2. ClaimState
  3. MembershipState
  4. FrontierState
  5. Offset
  6. SnapshotVersion
  7. TokenBucketState
  8. JobRunState

Invariant examples:

  1. only one valid lock holder at a time
  2. no stale holder can mutate downstream state
  3. offset must not advance past durable processing
  4. config version must apply monotonically
  5. job run should not execute twice without idempotency or dedup

What to say out loud:

The architecture follows from the invariants. So I’ll first define the few pieces of state whose correctness really matters.


4. Expose the API / interface #

Even infra systems have interfaces.

Examples:

Lock service #

  1. Acquire(lock_id, owner_id, ttl)
  2. Renew(lock_id, lease_id, ttl)
  3. Release(lock_id, lease_id)

Queue #

  1. Enqueue(message)
  2. Receive(batch_size, visibility_timeout)
  3. Ack(message_id)
  4. Nack(message_id)

Scheduler #

  1. Schedule(job, run_at)
  2. Cancel(schedule_id)
  3. ClaimRunnable(batch_size)
  4. Complete(run_id)

Config service #

  1. PutConfig(key, value, version)
  2. GetSnapshot(service, version?)
  3. Watch(from_version)

What to say out loud:

I’ll make the external contract concrete with a small set of operations before drawing the internals.


5. Present high-level architecture #

Now show the boxes.

Typical infra boxes:

  1. clients / SDKs / agents
  2. API servers
  3. consensus / metadata store
  4. queue or log
  5. worker / executor
  6. projection or cache
  7. background reconciler

Examples:

Lock service #

  1. client library
  2. lock API
  3. quorum metadata store
  4. watch channel
  5. downstream systems requiring fencing token

Scheduler #

  1. scheduler API
  2. schedule store indexed by due time
  3. scanner
  4. runnable queue
  5. workers
  6. run state store

Queue #

  1. producer
  2. broker
  3. append log / message store
  4. consumer
  5. visibility timeout manager
  6. DLQ

What to say out loud:

Writes go to the canonical coordination or metadata store first, then workers and watchers consume that state to perform the execution path.


6. Deep dive into correctness #

Infra interviews almost always require one serious correctness deep dive.

Good deep-dive topics:

  1. stale holder fencing
  2. watch gap on reconnect
  3. duplicate claim
  4. offset commit discipline
  5. out-of-order config apply
  6. exactly-once illusion vs idempotent processing
  7. leader failover

Use this structure:

  1. what can go wrong?
  2. what truth do I trust?
  3. what control prevents corruption?
  4. what repair fixes drift?

Examples:

Lock service #

  1. stale holder keeps acting after lease expiry
  2. trust LeaseState and epoch
  3. fencing token blocks stale holder
  4. lease expiry and watch replay repair state

Queue #

  1. consumer processes message but crashes before ack
  2. trust message log and visibility timeout state
  3. at-least-once redelivery
  4. idempotent consumer or dedup store repairs duplicates

7. Deep dive into scale #

Infra scale questions are usually about:

  1. hot partitions
  2. control-plane fanout
  3. metadata bottlenecks
  4. large scans
  5. replay storms
  6. write amplification

Use:

  1. what gets hot?
  2. why?
  3. first mitigation?

Examples:

Rate limiter #

  1. hot tenant key
  2. many requests converge on same budget key
  3. shard counter or use local token buckets with periodic reconciliation

Scheduler #

  1. due-time bucket
  2. many jobs become due at the same minute
  3. shard by time bucket and scanner range

Config service #

  1. snapshot publication fanout
  2. one change fans out to many agents
  3. versioned pull + watch instead of pure push

8. Discuss tradeoffs and operations #

Infra interviews reward operational maturity.

Mention:

  1. observability
  2. reconciliation jobs
  3. backpressure
  4. capacity hotspots
  5. degraded mode behavior
  6. migrations and rollout discipline

Examples:

  1. alert on lease churn
  2. alert on consumer lag
  3. monitor stale config versions
  4. track scheduler due delay
  5. expose lock contention metrics

What to say out loud:

I’d add operational guardrails because these systems fail in subtle ways long before they fully go down.


How infra flow differs from product flow #

Product flow usually starts with: #

  1. user requirements
  2. entities
  3. APIs

Infra flow usually starts with: #

  1. contract
  2. invariants
  3. consistency semantics
  4. interface

That is the main difference.


Fast mapping table #

Infra interview sectionWhat you are really deriving
Contractservice semantics
NFRscorrectness and performance envelope
Core state + invariantscanonical truth
API / interfacepublic operations
HLDmechanism + realization
Correctness deep divefailure handling
Scale deep divehotspot mitigation
Tradeoffsoperational maturity

Example compressed flow #

Example: Distributed Lock Service #

  1. Contract

    • lease-based lock
    • mutual exclusion
    • fencing token required
  2. NFRs

    • correctness over availability
    • low latency acquire/renew
    • durable lease state
  3. Core state

    • LeaseState
    • OwnerEpoch
  4. API

    • Acquire
    • Renew
    • Release
  5. HLD

    • lock API over quorum metadata store
    • watches for state changes
    • downstream services validate fencing token
  6. Correctness deep dive

    • stale holder after lease expiry
    • fencing token prevents corruption
  7. Scale deep dive

    • hot lock ids
    • shard metadata by lock key
  8. Tradeoffs

    • strong correctness but lower availability under partition

What not to do #

  1. do not jump straight to Raft/Paxos before defining the contract
  2. do not present APIs without invariants
  3. do not say “exactly once” without explaining how the illusion is realized
  4. do not treat metrics, queues, locks, and config as the same kind of system
  5. do not skip operational recovery paths

Interview one-liner #

For infra designs I start from the contract and invariants, because once I know what must always be true, the state model, APIs, and architecture become much easier to justify.