Infra NFR Cheat Sheet #

Use this as the infrastructure-side counterpart to the product NFR notes.

For infra systems, the most important NFRs are usually not just latency and scale. They are often:

correctness
durability
availability
latency
throughput
freshness

Then, depending on the system:

ordering
isolation
recoverability
operability

1. Correctness / Consistency #

Ask:

what must never be violated?
what stale or conflicting state is unacceptable?
what semantics are required: single-writer, no stale holder, monotonic config, per-key ordering?

Examples:

only one valid lock holder at a time
stale config must not override newer config
no duplicate winner on a claim path
offset must not advance beyond durable processing

Good phrasing:

The main correctness invariant is X, and I’m willing to trade availability or latency before violating it.

2. Durability #

Ask:

what committed state must survive crash?
what can be reconstructed and what cannot?

Examples:

committed queue message
lock lease state
config version
job run state
checkpoint / consumer progress

Good phrasing:

Once the system acknowledges X, that state must survive process and node failure.

3. Availability #

Ask:

should the system fail closed or stay available under uncertainty?
which paths can degrade?
is partial service acceptable?

Examples:

lock service may prefer correctness over availability
metrics system may prefer availability over perfect completeness
cache should stay highly available even if origin is degraded

Good phrasing:

For this path I prefer correctness over availability, so under uncertainty I would rather reject than violate the invariant.

4. Latency #

Ask:

which operations are on the hot path?
which are interactive vs background?

Examples:

lock acquire
rate-limit decision
cache lookup
config evaluation
enqueue

Good phrasing:

The hot path needs low p95 latency, while the background repair and projection paths can be slower.

5. Throughput #

Ask:

what is the dominant volume: writes, reads, scans, fanout, ingest?
where will capacity saturate first?

Examples:

queue message ingest
metrics sample ingest
scheduler due-time scan
cache QPS

Good phrasing:

The architecture should optimize the dominant throughput path first, which in this system is X.

6. Freshness / Propagation Lag #

Ask:

how stale can applied or derived state be?
is bounded lag acceptable?
does freshness differ by path?

Examples:

config propagation lag
search index lag
cache invalidation lag
metrics query lag
scheduler fire-time delay

Good phrasing:

I can tolerate bounded lag on the derived path, but not on the commit path.

7. Ordering #

Ask:

is order important?
if yes, is it global, per key, per partition, or causal?

Examples:

queue per-partition ordering
config version monotonicity
operation log apply order
per-stream message ordering

Good phrasing:

I only need ordering per key, not globally, which lets me scale the system much more easily.

8. Isolation / Multi-Tenancy #

Ask:

can one tenant or hot key overload others?
do I need quotas or fairness?

Examples:

rate limiter by tenant
scheduler noisy tenants
queue partition hot customer
config fanout to many clients from one tenant

Good phrasing:

I need per-tenant isolation so one noisy customer does not consume the entire system budget.

9. Recoverability / Replayability #

Ask:

can I rebuild from source truth?
can I replay from a log?
can I reconcile missing side effects?

Examples:

replay queue/log
reindex from source truth
rebuild derived counters
reconciliation scan over workflow state

Good phrasing:

I want recovery to come from canonical truth or replayable history rather than manual repair.

10. Operability #

Ask:

can I observe lag, contention, stale versions, duplicate claims, backlog, and drift?
can operators repair the system safely?

Examples:

consumer lag metrics
lease churn metrics
stale snapshot version distribution
scheduler due-delay metrics
DLQ size

Good phrasing:

This system needs strong observability because subtle correctness drift is more dangerous than obvious outages.

Default NFR subsets by infra system type #

Coordination / lock service #

Prioritize:

correctness
durability
availability
latency
ordering

Typical statement:

Correctness dominates. I’d rather stall than let two holders act as leader.

Queue / log #

Prioritize:

durability
throughput
ordering
availability
replayability

Typical statement:

Throughput and durability dominate; ordering is usually per partition, not global.

Scheduler / delayed jobs #

Prioritize:

correctness
durability
freshness of due execution
throughput
recoverability

Typical statement:

I care that due work runs within a bounded delay and that retries do not create duplicate executions without control.

Cache / CDN / edge delivery #

Prioritize:

latency
availability
freshness
invalidation correctness
throughput

Typical statement:

The main goal is low-latency serving, with bounded staleness and explicit invalidation behavior.

Config / feature flag / policy #

Prioritize:

correctness
freshness / propagation lag
availability
ordering
operability

Typical statement:

Control truth must be strong, and rollout/application must be monotonic even if data-plane snapshots lag slightly.

Metrics / tracing #

Prioritize:

throughput
availability
query latency
freshness
bounded loss or sampling policy

Typical statement:

Ingest throughput dominates; I may accept sampling or bounded loss for low-priority signals, but not for critical alerting paths.

Rate limiter / quota system #

Prioritize:

correctness of budget enforcement
latency
availability
tenant isolation
throughput

Typical statement:

Decision latency must be very low, but I still need predictable budget enforcement, especially for hot tenants.

Simple interview template #

When asked for NFRs in an infra round, say:

the correctness invariant
the durability guarantee
the availability preference
the hot-path latency target
the dominant throughput dimension
any freshness or ordering requirement

Example:

The key invariant is that only one valid lock holder can act at a time. Lease state must be durable once acknowledged. I prefer correctness over availability under partition. Acquire and renew should be low latency. Throughput is modest compared to correctness. Ordering matters only per lock key.

What not to do #

do not give one generic NFR set for all infra systems
do not ignore ordering when the system is log- or version-based
do not skip recoverability for projection or workflow-heavy infra systems
do not say “high availability” without clarifying fail-open vs fail-closed
do not say “exactly once” unless you explain the illusion and tradeoffs

Interview One-Liner #

For infra systems I start with correctness, durability, availability, latency, throughput, and freshness; then I add ordering, isolation, and replayability only when the contract requires them.