Skip to main content
  1. System Design Components/

Infra NFR Cheat Sheet

Infra NFR Cheat Sheet #

Use this as the infrastructure-side counterpart to the product NFR notes.

For infra systems, the most important NFRs are usually not just latency and scale. They are often:

  1. correctness
  2. durability
  3. availability
  4. latency
  5. throughput
  6. freshness

Then, depending on the system:

  1. ordering
  2. isolation
  3. recoverability
  4. operability

1. Correctness / Consistency #

Ask:

  1. what must never be violated?
  2. what stale or conflicting state is unacceptable?
  3. what semantics are required: single-writer, no stale holder, monotonic config, per-key ordering?

Examples:

  1. only one valid lock holder at a time
  2. stale config must not override newer config
  3. no duplicate winner on a claim path
  4. offset must not advance beyond durable processing

Good phrasing:

The main correctness invariant is X, and I’m willing to trade availability or latency before violating it.


2. Durability #

Ask:

  1. what committed state must survive crash?
  2. what can be reconstructed and what cannot?

Examples:

  1. committed queue message
  2. lock lease state
  3. config version
  4. job run state
  5. checkpoint / consumer progress

Good phrasing:

Once the system acknowledges X, that state must survive process and node failure.


3. Availability #

Ask:

  1. should the system fail closed or stay available under uncertainty?
  2. which paths can degrade?
  3. is partial service acceptable?

Examples:

  1. lock service may prefer correctness over availability
  2. metrics system may prefer availability over perfect completeness
  3. cache should stay highly available even if origin is degraded

Good phrasing:

For this path I prefer correctness over availability, so under uncertainty I would rather reject than violate the invariant.


4. Latency #

Ask:

  1. which operations are on the hot path?
  2. which are interactive vs background?

Examples:

  1. lock acquire
  2. rate-limit decision
  3. cache lookup
  4. config evaluation
  5. enqueue

Good phrasing:

The hot path needs low p95 latency, while the background repair and projection paths can be slower.


5. Throughput #

Ask:

  1. what is the dominant volume: writes, reads, scans, fanout, ingest?
  2. where will capacity saturate first?

Examples:

  1. queue message ingest
  2. metrics sample ingest
  3. scheduler due-time scan
  4. cache QPS

Good phrasing:

The architecture should optimize the dominant throughput path first, which in this system is X.


6. Freshness / Propagation Lag #

Ask:

  1. how stale can applied or derived state be?
  2. is bounded lag acceptable?
  3. does freshness differ by path?

Examples:

  1. config propagation lag
  2. search index lag
  3. cache invalidation lag
  4. metrics query lag
  5. scheduler fire-time delay

Good phrasing:

I can tolerate bounded lag on the derived path, but not on the commit path.


7. Ordering #

Ask:

  1. is order important?
  2. if yes, is it global, per key, per partition, or causal?

Examples:

  1. queue per-partition ordering
  2. config version monotonicity
  3. operation log apply order
  4. per-stream message ordering

Good phrasing:

I only need ordering per key, not globally, which lets me scale the system much more easily.


8. Isolation / Multi-Tenancy #

Ask:

  1. can one tenant or hot key overload others?
  2. do I need quotas or fairness?

Examples:

  1. rate limiter by tenant
  2. scheduler noisy tenants
  3. queue partition hot customer
  4. config fanout to many clients from one tenant

Good phrasing:

I need per-tenant isolation so one noisy customer does not consume the entire system budget.


9. Recoverability / Replayability #

Ask:

  1. can I rebuild from source truth?
  2. can I replay from a log?
  3. can I reconcile missing side effects?

Examples:

  1. replay queue/log
  2. reindex from source truth
  3. rebuild derived counters
  4. reconciliation scan over workflow state

Good phrasing:

I want recovery to come from canonical truth or replayable history rather than manual repair.


10. Operability #

Ask:

  1. can I observe lag, contention, stale versions, duplicate claims, backlog, and drift?
  2. can operators repair the system safely?

Examples:

  1. consumer lag metrics
  2. lease churn metrics
  3. stale snapshot version distribution
  4. scheduler due-delay metrics
  5. DLQ size

Good phrasing:

This system needs strong observability because subtle correctness drift is more dangerous than obvious outages.


Default NFR subsets by infra system type #

Coordination / lock service #

Prioritize:

  1. correctness
  2. durability
  3. availability
  4. latency
  5. ordering

Typical statement:

Correctness dominates. I’d rather stall than let two holders act as leader.


Queue / log #

Prioritize:

  1. durability
  2. throughput
  3. ordering
  4. availability
  5. replayability

Typical statement:

Throughput and durability dominate; ordering is usually per partition, not global.


Scheduler / delayed jobs #

Prioritize:

  1. correctness
  2. durability
  3. freshness of due execution
  4. throughput
  5. recoverability

Typical statement:

I care that due work runs within a bounded delay and that retries do not create duplicate executions without control.


Cache / CDN / edge delivery #

Prioritize:

  1. latency
  2. availability
  3. freshness
  4. invalidation correctness
  5. throughput

Typical statement:

The main goal is low-latency serving, with bounded staleness and explicit invalidation behavior.


Config / feature flag / policy #

Prioritize:

  1. correctness
  2. freshness / propagation lag
  3. availability
  4. ordering
  5. operability

Typical statement:

Control truth must be strong, and rollout/application must be monotonic even if data-plane snapshots lag slightly.


Metrics / tracing #

Prioritize:

  1. throughput
  2. availability
  3. query latency
  4. freshness
  5. bounded loss or sampling policy

Typical statement:

Ingest throughput dominates; I may accept sampling or bounded loss for low-priority signals, but not for critical alerting paths.


Rate limiter / quota system #

Prioritize:

  1. correctness of budget enforcement
  2. latency
  3. availability
  4. tenant isolation
  5. throughput

Typical statement:

Decision latency must be very low, but I still need predictable budget enforcement, especially for hot tenants.


Simple interview template #

When asked for NFRs in an infra round, say:

  1. the correctness invariant
  2. the durability guarantee
  3. the availability preference
  4. the hot-path latency target
  5. the dominant throughput dimension
  6. any freshness or ordering requirement

Example:

The key invariant is that only one valid lock holder can act at a time. Lease state must be durable once acknowledged. I prefer correctness over availability under partition. Acquire and renew should be low latency. Throughput is modest compared to correctness. Ordering matters only per lock key.


What not to do #

  1. do not give one generic NFR set for all infra systems
  2. do not ignore ordering when the system is log- or version-based
  3. do not skip recoverability for projection or workflow-heavy infra systems
  4. do not say “high availability” without clarifying fail-open vs fail-closed
  5. do not say “exactly once” unless you explain the illusion and tradeoffs

Interview One-Liner #

For infra systems I start with correctness, durability, availability, latency, throughput, and freshness; then I add ordering, isolation, and replayability only when the contract requires them.