Infra NFR Cheat Sheet
Infra NFR Cheat Sheet #
Use this as the infrastructure-side counterpart to the product NFR notes.
For infra systems, the most important NFRs are usually not just latency and scale. They are often:
- correctness
- durability
- availability
- latency
- throughput
- freshness
Then, depending on the system:
- ordering
- isolation
- recoverability
- operability
1. Correctness / Consistency #
Ask:
- what must never be violated?
- what stale or conflicting state is unacceptable?
- what semantics are required: single-writer, no stale holder, monotonic config, per-key ordering?
Examples:
- only one valid lock holder at a time
- stale config must not override newer config
- no duplicate winner on a claim path
- offset must not advance beyond durable processing
Good phrasing:
The main correctness invariant is X, and I’m willing to trade availability or latency before violating it.
2. Durability #
Ask:
- what committed state must survive crash?
- what can be reconstructed and what cannot?
Examples:
- committed queue message
- lock lease state
- config version
- job run state
- checkpoint / consumer progress
Good phrasing:
Once the system acknowledges X, that state must survive process and node failure.
3. Availability #
Ask:
- should the system fail closed or stay available under uncertainty?
- which paths can degrade?
- is partial service acceptable?
Examples:
- lock service may prefer correctness over availability
- metrics system may prefer availability over perfect completeness
- cache should stay highly available even if origin is degraded
Good phrasing:
For this path I prefer correctness over availability, so under uncertainty I would rather reject than violate the invariant.
4. Latency #
Ask:
- which operations are on the hot path?
- which are interactive vs background?
Examples:
- lock acquire
- rate-limit decision
- cache lookup
- config evaluation
- enqueue
Good phrasing:
The hot path needs low p95 latency, while the background repair and projection paths can be slower.
5. Throughput #
Ask:
- what is the dominant volume: writes, reads, scans, fanout, ingest?
- where will capacity saturate first?
Examples:
- queue message ingest
- metrics sample ingest
- scheduler due-time scan
- cache QPS
Good phrasing:
The architecture should optimize the dominant throughput path first, which in this system is X.
6. Freshness / Propagation Lag #
Ask:
- how stale can applied or derived state be?
- is bounded lag acceptable?
- does freshness differ by path?
Examples:
- config propagation lag
- search index lag
- cache invalidation lag
- metrics query lag
- scheduler fire-time delay
Good phrasing:
I can tolerate bounded lag on the derived path, but not on the commit path.
7. Ordering #
Ask:
- is order important?
- if yes, is it global, per key, per partition, or causal?
Examples:
- queue per-partition ordering
- config version monotonicity
- operation log apply order
- per-stream message ordering
Good phrasing:
I only need ordering per key, not globally, which lets me scale the system much more easily.
8. Isolation / Multi-Tenancy #
Ask:
- can one tenant or hot key overload others?
- do I need quotas or fairness?
Examples:
- rate limiter by tenant
- scheduler noisy tenants
- queue partition hot customer
- config fanout to many clients from one tenant
Good phrasing:
I need per-tenant isolation so one noisy customer does not consume the entire system budget.
9. Recoverability / Replayability #
Ask:
- can I rebuild from source truth?
- can I replay from a log?
- can I reconcile missing side effects?
Examples:
- replay queue/log
- reindex from source truth
- rebuild derived counters
- reconciliation scan over workflow state
Good phrasing:
I want recovery to come from canonical truth or replayable history rather than manual repair.
10. Operability #
Ask:
- can I observe lag, contention, stale versions, duplicate claims, backlog, and drift?
- can operators repair the system safely?
Examples:
- consumer lag metrics
- lease churn metrics
- stale snapshot version distribution
- scheduler due-delay metrics
- DLQ size
Good phrasing:
This system needs strong observability because subtle correctness drift is more dangerous than obvious outages.
Default NFR subsets by infra system type #
Coordination / lock service #
Prioritize:
- correctness
- durability
- availability
- latency
- ordering
Typical statement:
Correctness dominates. I’d rather stall than let two holders act as leader.
Queue / log #
Prioritize:
- durability
- throughput
- ordering
- availability
- replayability
Typical statement:
Throughput and durability dominate; ordering is usually per partition, not global.
Scheduler / delayed jobs #
Prioritize:
- correctness
- durability
- freshness of due execution
- throughput
- recoverability
Typical statement:
I care that due work runs within a bounded delay and that retries do not create duplicate executions without control.
Cache / CDN / edge delivery #
Prioritize:
- latency
- availability
- freshness
- invalidation correctness
- throughput
Typical statement:
The main goal is low-latency serving, with bounded staleness and explicit invalidation behavior.
Config / feature flag / policy #
Prioritize:
- correctness
- freshness / propagation lag
- availability
- ordering
- operability
Typical statement:
Control truth must be strong, and rollout/application must be monotonic even if data-plane snapshots lag slightly.
Metrics / tracing #
Prioritize:
- throughput
- availability
- query latency
- freshness
- bounded loss or sampling policy
Typical statement:
Ingest throughput dominates; I may accept sampling or bounded loss for low-priority signals, but not for critical alerting paths.
Rate limiter / quota system #
Prioritize:
- correctness of budget enforcement
- latency
- availability
- tenant isolation
- throughput
Typical statement:
Decision latency must be very low, but I still need predictable budget enforcement, especially for hot tenants.
Simple interview template #
When asked for NFRs in an infra round, say:
- the correctness invariant
- the durability guarantee
- the availability preference
- the hot-path latency target
- the dominant throughput dimension
- any freshness or ordering requirement
Example:
The key invariant is that only one valid lock holder can act at a time. Lease state must be durable once acknowledged. I prefer correctness over availability under partition. Acquire and renew should be low latency. Throughput is modest compared to correctness. Ordering matters only per lock key.
What not to do #
- do not give one generic NFR set for all infra systems
- do not ignore ordering when the system is log- or version-based
- do not skip recoverability for projection or workflow-heavy infra systems
- do not say “high availability” without clarifying fail-open vs fail-closed
- do not say “exactly once” unless you explain the illusion and tradeoffs
Interview One-Liner #
For infra systems I start with correctness, durability, availability, latency, throughput, and freshness; then I add ordering, isolation, and replayability only when the contract requires them.