Infra Archetype NFR Practice Worksheet #

Use this as the infra-first counterpart to the broader mixed prompt worksheet.

This note is intentionally organized by I01-I21, not by product prompt.

The goal is:

start from the archetype
calculate the first load-bearing variables
identify the first scaling bottleneck
then run a short cross-cutting NFR pass

This should be used with:

How To Use #

For a prompt:

classify to the dominant infra archetype
steal the sample-prompt row below
calculate the listed variables before drawing boxes
state the first pressure check out loud
finish with the cross-cutting questions in the last column

If the system is hybrid:

pick the dominant row first
then import one adjacent row
do not average them into a generic worksheet

Archetype Worksheet #

Archetype	Sample prompts	First NFRs to calculate	Starter formulas	First pressure check	Cross-cutting follow-up
`I01 Coordination / Consensus Metadata`	metadata service, leader election service, config store	quorum write TPS, watch fanout, election rate, session renew rate	`quorum_write_RPS = metadata_mutations_per_sec`; `watch_delivery_RPS = watchers*updates_per_sec`; `renew_RPS = active_sessions/lease_seconds`	quorum RTT or hot metadata key	fail closed or open under quorum loss; max stale watch gap; compaction/replay window; blast radius of leader failover
`I02 Claim / Lease / Exclusive Ownership`	lock service, lease manager, shard owner election	claim TPS, renew rate, hot-key contention, stale-holder fence checks	`claim_RPS = claim_attempts_per_sec`; `renew_RPS = active_leases/lease_seconds`; `contention_RPS = claim_RPS*hot_key_share`	hot claim key or renew storm	fence token required at which downstream write; duplicate-owner damage if fencing fails; reclaim delay budget; false-expiry tolerance
`I03 Due-Time Release + Claimable Run`	cron scheduler, delayed job queue, reminder service	due burst RPS, runnable backlog, lateness SLA, claim TPS	`due_RPS = jobs_due_in_peak_window/window_seconds`; `backlog_seconds = runnable_queue_depth/worker_claim_rate`; `claim_RPS = due_RPS*(1+retry_factor)`	due-time burstiness or claim backlog	acceptable fire-time delay; duplicate run tolerance; clock-skew budget; replay after scheduler crash
`I04 Frontier Scan + Claimable Run`	web crawler, batch scanner, ETL sweep, compliance scanner	frontier claim rate, scan coverage rate, checkpoint cadence, rediscovery rate	`claim_RPS = workers*batches_per_sec`; `coverage_seconds = frontier_items/scan_rate`; `checkpoint_WPS = workers/checkpoint_interval_sec`	frontier contention or checkpoint lag	progress cursor truth; resumability after crash; duplicate-scan budget; freshness vs scan cost
`I05 Append Log + Consumer Progress`	pub/sub, event bus, durable queue, commit log	append throughput, partition hotness, consumer lag, replay window	`append_Bps = append_RPSavg_record_bytes`; `partition_write_RPS = append_RPShot_partition_share`; `lag_seconds = consumer_backlog/consume_rate`	hot partition or replay lag	ordering scope per partition or key; commit-progress durability; replay cost after consumer loss; backpressure on slow consumers
`I06 Projection / Index / Search Pipeline`	search index, materialized view, feed projector, read model builder	source ingest RPS, projection fanout, index lag, rebuild throughput	`projection_WPS = source_WPS*avg_projection_updates`; `lag_seconds = queued_updates/projector_rate`; `rebuild_seconds = corpus_bytes/rebuild_Bps`	write amplification or projector lag	rebuildability from source truth; freshness bound for queries; backfill isolation from live traffic; correctness under out-of-order events
`I07 Cache / Origin Projection / Edge Delivery`	distributed cache, CDN metadata cache, API response cache	read RPS, hit ratio, miss storm size, invalidation rate, memory footprint	`origin_RPS = read_RPS(1-hit_ratio)`; `miss_burst_RPS = read_RPSsimultaneous_expiry_share`; `memory_bytes = keys*avg_entry_bytes`	hot key or cache miss storm	stale-read tolerance; invalidation propagation bound; stampede control; fail-open vs fail-through on origin error
`I08 Traffic Shaping / Admission Control`	rate limiter, quota manager, overload shedder, concurrency limiter	evaluator RPS, budget update rate, hot-tenant rate, queue/admit depth	`eval_RPS = requests_per_sec`; `budget_WPS = eval_RPSenforced_fraction`; `hot_tenant_RPS = eval_RPStop_tenant_share`; `queue_wait_seconds = queued_requests/admit_rate`	evaluator hot path or hot budget key	fairness unit per tenant/user/request class; fail-open vs fail-closed; budget propagation lag; overload behavior under partial outages
`I09 Sequence / Identifier Generation`	snowflake service, monotonic ticket dispenser, order-number allocator	ID allocation RPS, block lease rate, worker-id pool pressure, skew window	`id_RPS = ids_per_sec`; `block_lease_RPS = id_RPS/block_size`; `worker_pool_util = active_generators/worker_id_pool`; `rollback_risk_window = max_clock_skew_sec`	allocator hotspot or clock rollback	uniqueness vs monotonicity requirement; gap tolerance; allocator failover semantics; epoch rollover handling
`I10 Membership / Presence / Registry`	service registry, presence system, node registry	registration rate, heartbeat fan-in, false-death window, watch fanout	`heartbeat_RPS = members/heartbeat_interval_sec`; `expiry_scan_RPS = members/expiry_scan_interval_sec`; `watch_push_RPS = watchers*membership_changes_per_sec`	heartbeat fan-in or watch fanout	false-death budget; lookup freshness bound; ghost-member cleanup; anti-entropy after missed watches
`I11 Control Plane + Snapshot Distribution`	feature flag platform, config distribution, xDS-like control plane	config mutate RPS, publication fanout, apply lag, snapshot bytes	`fanout_RPS = targetsupdates_per_sec`; `update_Bps = fanout_RPSavg_snapshot_bytes`; `convergence_seconds = targets/apply_ack_rate`	config fanout or slow apply convergence	monotonic version rule; rollback budget; partial rollout detection; target behavior on control-plane partition
`I12 Workflow + External Side Effect`	payment workflow, webhook delivery engine, provisioning workflow	transition TPS, side-effect RPS, retry rate, stuck-workflow cardinality	`transition_RPS = active_entitiestransitions_per_day/86400`; `effect_RPS = transition_RPS(1+retry_factor)`; `stuck_items = workflows_in_nonterminal_state*stuck_fraction`	side-effect latency or retry amplification	idempotency surface; reconciliation scan cadence; exactly-once vs at-least-once effect semantics; poison workflow handling
`I13 Shared Subject Coordination`	collaborative editor, whiteboard, shared cursor state	ops per subject, concurrent editors, broadcast fanout, snapshot cadence	`op_RPS = editors_per_subjectops_per_editor_sec`; `broadcast_RPS = op_RPSactive_subscribers`; `replay_ops = ops_since_snapshot`	single-subject coordinator or replay cost	ordering model per subject; merge/conflict semantics; late join replay budget; subject hotspot isolation
`I14 Immutable Artifact Namespace + Delivery`	artifact registry, blob/object store namespace, image/package distribution	publish RPS, fetch throughput, metadata RPS, GC backlog	`publish_RPS = artifacts_published_per_sec`; `fetch_Bps = downloads_per_sec*avg_artifact_bytes`; `gc_backlog_bytes = unreferenced_bytes_pending_collection`	metadata namespace hotspot or origin bandwidth	immutability guarantee; rollback by pointer or republish; retention/GC safety; cross-region replication lag
`I15 Execution Fleet + Worker Substrate`	serverless runtime, CI runner fleet, GPU job fleet, remote execution	arrival RPS, concurrent executions, worker-slot count, heartbeat rate, cold-start rate	`concurrency = arrival_RPSavg_run_seconds`; `slot_count = concurrency/utilization_target`; `heartbeat_RPS = active_workers/heartbeat_interval_sec`; `cold_start_RPS = launches_per_seccold_start_fraction`	worker saturation, placement contention, or heartbeat fan-in	placement truth and fencing; preemption semantics; reclaim lag after worker loss; admission vs queueing under overload
`I16 Key-Scoped Mutable State / Replicated KV`	key-value store, session store, profile store, cart store	read RPS, write RPS, hot-key share, replication bandwidth, compaction/write amplification	`repl_Bps = write_RPSavg_write_bytesreplication_factor`; `hot_key_RPS = read_RPS*top_key_share`; `storage_write_amp = logical_write_Bps/physical_write_Bps`	leader hotspot, hot key, or replication pressure	consistency level per key; failover read/write semantics; anti-entropy/repair budget; compaction impact on tail latency
`I17 Traffic Steering / Request Mediation Plane`	API gateway, load balancer, service mesh router, WAF	ingress RPS, active connections, route table size, health-check fanout, retry amplification	`conn_count = clientsavg_open_conns`; `health_RPS = backendschecks_per_sec`; `retry_RPS = ingress_RPSretry_fraction`; `policy_eval_RPS = ingress_RPSrules_checked_per_request`	hot VIP, connection table pressure, or retry amplification	route-config freshness; drain semantics; fail-open vs fail-closed policy checks; tail latency under retries and outlier ejection
`I18 Telemetry / Time-Series Pipeline`	metrics system, alerting platform, logs pipeline, infra monitoring	ingest RPS, active series/cardinality, rule eval fanout, query scan bytes, retention bytes	`ingest_RPS = emitterssamples_per_sec`; `active_series = emittersmetrics_per_emitterlabel_cardinality_factor`; `query_scan_Bps = queried_points_per_secbytes_per_point`; `retention_bytes = ingest_Bps*retention_seconds`	high-cardinality blowup, ingest fan-in, or query lag	telemetry must not destabilize workload; sampling/drop policy under overload; alert delay budget; retention vs cost tradeoff
`I19 Replicated Chunk / Block / File Storage Substrate`	distributed file system, block store, chunk store, object-storage substrate	metadata ops RPS, chunk throughput, repair bandwidth, placement skew, rebuild time	`chunk_Bps = io_ops_per_sec*avg_chunk_bytes`; `repair_Bps = lost_replica_bytes/repair_window_sec`; `rebuild_seconds = lost_bytes/effective_repair_Bps`; `metadata_RPS = namespace_ops_per_sec`	metadata hotspot, repair bandwidth, or hot chunk	replica placement policy; durability target after correlated failure; degraded-read performance; background repair budget vs foreground IO
`I20 Computation / Dataflow / DAG Execution`	MapReduce/Spark/Flink-like engine, DAG scheduler, streaming dataflow engine	input throughput, task concurrency, shuffle bytes, checkpoint bytes, output commit rate, watermark lag	`task_concurrency = input_partitionsavg_parallelism_per_partition`; `shuffle_Bps = records_per_secavg_record_bytes*shuffle_fanout`; `checkpoint_Bps = state_bytes/checkpoint_interval_sec`; `watermark_lag_seconds = event_time_now - watermark_time`	shuffle pressure, checkpoint I/O, hot key, or scheduler bottleneck	stale attempt commit guard; exactly-once sink boundary; backpressure behavior; recovery time from latest checkpoint
`I21 Trust Boundary / Cryptographic Proof Substrate`	workload identity platform, artifact signing/provenance, trust-bundle distribution, revocation service	verification RPS, signing RPS, trust-bundle fanout, revocation freshness, audit write throughput	`verify_RPS = protected_requests_per_sec`; `sign_RPS = issued_statements_per_sec`; `trust_bundle_fanout_RPS = verifiersbundle_updates_per_sec`; `audit_WPS = verification_events_per_secaudit_sample_rate`	verifier hot path, stale revocation, or trust-bundle propagation lag	credential TTL; revocation freshness SLA; issuer compromise blast radius; audit retention and tamper evidence

Cross-Cutting NFR Pass #

After you do the archetype row, force this short pass.

1. Correctness / Consistency #

Ask:

what invariant is load-bearing here?
what stale actor, stale version, or duplicate effect is unacceptable?
what is the minimum consistency scope: key, partition, subject, workflow, quorum?

2. Availability / Failure Policy #

Ask:

should this path fail closed or fail open under uncertainty?
what is allowed to degrade independently?
what is the smallest useful partial service mode?

3. Durability / Recoverability #

Ask:

what acknowledged state must survive crash?
what can be rebuilt from source truth or replay?
what is the acceptable loss window for transient buffers, caches, snapshots, or telemetry?

4. Tail Latency / Freshness #

Ask:

which path needs low p95 or p99?
where is bounded staleness acceptable?
is freshness measured in milliseconds, seconds, minutes, or rollout waves?

5. Isolation / Backpressure #

Ask:

can one tenant, hot key, or hot subject overload others?
where does backpressure appear first?
what is the admission or shedding rule when capacity is exhausted?

6. Cost / Repair Budget #

Ask:

what background work grows with success: replay, compaction, repair, scan, watch fanout, checkpointing?
what budget do you reserve for non-foreground work?
what gets worse first when repair falls behind?

Quick Archetype-to-Cross-Cutting Emphasis #

Use these as the first extra questions after the main row.

Archetype	Cross-cutting emphasis
`I01`	correctness, quorum availability, watch freshness
`I02`	fencing correctness, reclaim lag, fail-closed semantics
`I03`	lateness SLA, duplicate-run tolerance, backlog recovery
`I04`	checkpoint durability, coverage freshness, resumability
`I05`	ordering, lag, replay cost, slow-consumer backpressure
`I06`	freshness, rebuildability, backfill isolation
`I07`	stale tolerance, stampede control, origin protection
`I08`	fairness, overload policy, fail-open vs fail-closed
`I09`	uniqueness, monotonicity, allocator failover
`I10`	false death, lookup freshness, watch recovery
`I11`	monotonic apply, rollout safety, rollback speed
`I12`	idempotency, reconciliation, poison-work handling
`I13`	per-subject ordering, hotspot isolation, replay budget
`I14`	immutability, retention/GC safety, fetch bandwidth
`I15`	placement fencing, cold starts, reclaim lag, admission policy
`I16`	consistency level, replication lag, compaction cost
`I17`	route freshness, retry amplification, drain behavior
`I18`	cardinality control, alert delay, drop policy under overload
`I19`	durability target, repair bandwidth, degraded-mode performance

Variable Dimensions And Estimation Rules #

Use this section when the interviewer gives you only partial inputs.

For each archetype:

variable dimensions tell you what dimensional form the variable should take
estimation rules tell you how to derive a plausible estimate quickly

`I01 Coordination / Consensus Metadata` #

Scale units:
- quorum_write_RPS: writes/second
- watch_delivery_RPS: watch events/second
- renew_RPS: renewals/second
- election_rate: elections/hour or elections/day
Decision rules:
- start from number of operators, controllers, or automation jobs that mutate metadata
- if prompt says N clients watch config or membership, assume watch fanout is N * updates_per_sec
- if lease/session TTL is given, derive renew rate as active_sessions / TTL_seconds
- if failover is said to be rare, model election rate as low steady-state background plus burst during incidents

`I02 Claim / Lease / Exclusive Ownership` #

Scale units:
- claim_RPS: claim attempts/second
- renew_RPS: renewals/second
- contention_RPS: contended claim attempts/second on hottest key or shard
- reclaim_delay: seconds
Decision rules:
- estimate total claim attempts from workers, jobs, or contenders entering the system per second
- if TTL is provided, renew rate is active_leases / TTL_seconds
- if prompt says a small subset of resources are popular, apply a hot-key share to total claim volume
- set reclaim delay from lease TTL + detection lag + reassignment lag

`I03 Due-Time Release + Claimable Run` #

Scale units:
- due_RPS: newly due jobs/second
- backlog_seconds: seconds of runnable backlog
- claim_RPS: claims/second
- lateness_SLA: seconds or minutes
Decision rules:
- convert jobs/day or jobs/hour into average rate, then separately estimate the peak due bucket
- if due times cluster on minute boundaries, model a peak-to-average multiplier rather than using average only
- compute backlog in time, not just count, using queue_depth / effective_claim_or_execute_rate
- if interviewer says reminders should feel near real time, keep lateness in single-digit seconds; if it is batch, use minutes

`I04 Frontier Scan + Claimable Run` #

Scale units:
- claim_RPS: frontier claims/second
- coverage_seconds: seconds or hours to revisit the full frontier
- checkpoint_WPS: checkpoints/second
- rediscovery_rate: items rediscovered/second or rediscovery ratio
Decision rules:
- derive claim rate from worker count and batch size/frequency
- derive coverage time from total frontier items / effective scan rate
- checkpoint frequency should scale with amount of work you can afford to replay after crash
- if prompt emphasizes freshness, push coverage interval down; if it emphasizes cost, allow longer revisit intervals

`I05 Append Log + Consumer Progress` #

Scale units:
- append_Bps: bytes/second
- append_RPS: records/second
- partition_write_RPS: writes/second on hottest partition
- lag_seconds: seconds
- replay_window: hours or days
Decision rules:
- start from producer count and per-producer event rate
- estimate bytes separately from record count because batching/compression can change bottlenecks
- if key distribution is skewed, apply a hot-partition share rather than assuming uniform spread
- use consumer backlog / consumer throughput to express lag in seconds since that maps better to SLAs

`I06 Projection / Index / Search Pipeline` #

Scale units:
- source_WPS: source writes/second
- projection_WPS: derived writes/second
- lag_seconds: seconds or minutes
- rebuild_seconds: minutes, hours, or days
Decision rules:
- identify the canonical source-of-truth mutation rate first
- multiply by average number of downstream index or projection updates per source mutation
- if prompt involves fanout by follower/subscriber/tag, model projection amplification explicitly
- set rebuild time from total corpus size and realistic sustained rebuild bandwidth, not peak hardware bandwidth

`I07 Cache / Origin Projection / Edge Delivery` #

Scale units:
- read_RPS: reads/second
- origin_RPS: cache misses/second reaching origin
- miss_burst_RPS: miss burst requests/second
- invalidation_rate: invalidations/second
- memory_bytes: bytes
Decision rules:
- derive read rate from active clients times per-client request rate
- derive origin load from read_RPS * (1 - hit_ratio)
- if TTL expiry is synchronized, estimate miss burst separately from steady-state miss rate
- size memory from hot working set, not total corpus, unless prompt says full-cache mirror

`I08 Traffic Shaping / Admission Control` #

Scale units:
- eval_RPS: decisions/second
- budget_WPS: budget mutations/second
- hot_tenant_RPS: requests/second for hottest tenant
- queue_wait_seconds: seconds
Decision rules:
- start with total incoming requests on the guarded path
- only a fraction of requests may mutate budget state; separate evaluation from mutation
- if prompt is multi-tenant, explicitly estimate top-tenant share
- if system queues before admit/reject, express overload in queue-wait time rather than raw queue length

`I09 Sequence / Identifier Generation` #

Scale units:
- id_RPS: IDs/second
- block_lease_RPS: block leases/second
- worker_pool_util: fraction or percent
- rollback_risk_window: milliseconds or seconds
Decision rules:
- derive ID rate from operations that need new IDs, not from all requests
- if using range leasing, convert allocation rate into coordinator lease rate via id_RPS / block_size
- if generators need unique worker IDs, compare active generators to available ID slots
- if prompt requires monotonicity, ask or assume a maximum tolerable clock rollback window

`I10 Membership / Presence / Registry` #

Scale units:
- heartbeat_RPS: heartbeats/second
- expiry_scan_RPS: expiry checks/second
- watch_push_RPS: membership updates delivered/second
- false_death_window: seconds
Decision rules:
- derive heartbeat rate from member count and heartbeat interval
- if expiry scanning is centralized, estimate check volume per scan interval
- watch push load is watchers * meaningful membership changes per second
- set false-death window from heartbeat interval, missed-heartbeat threshold, and network jitter budget

`I11 Control Plane + Snapshot Distribution` #

Scale units:
- fanout_RPS: target updates/second
- update_Bps: bytes/second distributed
- convergence_seconds: seconds or minutes
- config_mutate_RPS: control writes/second
Decision rules:
- derive mutate rate from humans, deploy controllers, or automation systems changing truth
- turn rollout size into fanout by dividing targets by desired rollout interval
- multiply target update rate by average snapshot or delta size to get distribution bandwidth
- derive convergence from fleet size and realistic ack/apply throughput, not ideal broadcast speed

`I12 Workflow + External Side Effect` #

Scale units:
- transition_RPS: state transitions/second
- effect_RPS: side-effect attempts/second
- retry_rate: retries/second or retry fraction
- stuck_items: count
Decision rules:
- estimate transition rate from active entities and transitions per entity per unit time
- side-effect rate is usually transitions * (1 + retry_factor) rather than equal to transition rate
- estimate stuck-work count from timeout rate, provider failure rate, or reconciliation lag
- if prompt is payment/provisioning/booking, assume idempotency and retries are first-class, not edge cases

`I13 Shared Subject Coordination` #

Scale units:
- op_RPS: operations/second per subject
- broadcast_RPS: fanout deliveries/second
- replay_ops: operations
- snapshot_cadence: seconds, minutes, or ops between snapshots
Decision rules:
- derive per-subject op rate from concurrent editors times per-editor operation frequency
- fanout is per-subject op rate times active subscribers, not total system users
- size replay budget by the maximum join latency or reconnect cost you can tolerate
- if prompt has hot rooms/docs/canvases, model hottest-subject load separately from average subject load

`I14 Immutable Artifact Namespace + Delivery` #

Scale units:
- publish_RPS: publishes/second
- fetch_Bps: bytes/second
- metadata_RPS: metadata ops/second
- gc_backlog_bytes: bytes
Decision rules:
- estimate publish rate from build, release, or upload frequency
- estimate fetch throughput from download concurrency and average artifact size
- separate metadata ops from bulk bytes because metadata hotspots often break first
- derive GC backlog from publish churn times retention window before unreachable data can be removed

`I15 Execution Fleet + Worker Substrate` #

Scale units:
- arrival_RPS: executions/second arriving
- concurrency: concurrent running executions
- slot_count: worker slots
- heartbeat_RPS: heartbeats/second
- cold_start_RPS: cold starts/second
Decision rules:
- derive arrival rate from incoming jobs, invocations, or tasks
- derive concurrency using Little’s Law: arrival_rate * average_run_time
- derive slot count by dividing concurrency by target utilization, not by theoretical max
- estimate heartbeat fan-in from active workers and interval
- if prompt is bursty serverless or CI, estimate cold starts separately from average launches

`I16 Key-Scoped Mutable State / Replicated KV` #

Scale units:
- read_RPS: reads/second
- write_RPS: writes/second
- hot_key_RPS: hottest-key reads or writes/second
- repl_Bps: replication bytes/second
- storage_write_amp: ratio
Decision rules:
- estimate read/write rate from active clients and per-client operation rate
- apply a skew factor for hottest keys rather than assuming uniform traffic
- derive replication bandwidth from write rate, average mutation size, and replica count
- if storage engine details matter, separate logical writes from physical storage amplification

`I17 Traffic Steering / Request Mediation Plane` #

Scale units:
- ingress_RPS: requests/second
- conn_count: active connections
- health_RPS: health checks/second
- retry_RPS: retries/second
- policy_eval_RPS: policy evaluations/second
Decision rules:
- derive ingress from clients or upstream services and per-client request rate
- derive active connections from open-session model, not from request rate alone
- derive health-check load from backend count and health-check cadence
- estimate retries as a fraction of ingress under normal and degraded conditions separately

`I18 Telemetry / Time-Series Pipeline` #

Scale units:
- ingest_RPS: samples or events/second
- active_series: count
- query_scan_Bps: bytes/second scanned
- retention_bytes: bytes
- alert_delay: seconds
Decision rules:
- start from emitter count and per-emitter metrics/log events rate
- derive active series from metric names times label combinations, not just host count
- query scan cost should be estimated from points touched per query and concurrent dashboards/alerts
- retention bytes comes from ingest throughput times retention duration after compression assumptions

`I19 Replicated Chunk / Block / File Storage Substrate` #

Scale units:
- metadata_RPS: metadata operations/second
- chunk_Bps: chunk/block bytes/second
- repair_Bps: repair bytes/second
- rebuild_seconds: seconds, hours, or days
- placement_skew: fraction or percent imbalance
Decision rules:
- estimate metadata rate from namespace ops like create/open/list/rename/attach
- estimate chunk throughput from foreground IO volume and average block/chunk size
- derive repair bandwidth from durability target and time-to-repair requirement after replica loss
- if prompt mentions rack/AZ awareness, explicitly model placement skew and correlated failure domains

Capacity Estimation From NFR Targets #

This is the sizing section.

Use it when the interviewer asks:

how many nodes or shards do you need?
how much bandwidth or storage do you need?
how many workers, consumers, or replicas does the NFR imply?

For each archetype:

capacity units tell you what you are sizing
capacity rules tell you how to derive a first-cut number from the NFR target

`I01 Coordination / Consensus Metadata` #

Capacity units:
- quorum groups
- metadata partitions
- watch fanout replicas
- network bandwidth for watch delivery
Capacity rules:
- required_quorum_groups = ceil(quorum_write_RPS / sustainable_write_RPS_per_group)
- required_watch_replicas = ceil(watch_delivery_RPS / sustainable_watch_events_per_replica)
- if watch fanout load dominates but write rate is low, size separate fanout replicas rather than more quorum writers
- if one metadata domain exceeds per-group latency or write target, split into another partition rather than stretching one quorum group

`I02 Claim / Lease / Exclusive Ownership` #

Capacity units:
- lease-service partitions
- renew-handling replicas
- reclaim workers
Capacity rules:
- required_partitions = ceil(claim_RPS / sustainable_claim_RPS_per_partition)
- required_renew_capacity = ceil(renew_RPS / sustainable_renew_RPS_per_replica)
- required_reclaim_workers = ceil(expired_claims_per_sec / reclaims_per_worker_sec)
- if hottest-key contention exceeds single-partition capacity, no amount of replica scaling fixes it; size around smaller claim domains instead

`I03 Due-Time Release + Claimable Run` #

Capacity units:
- releaser/scanner workers
- runnable queue partitions
- execution workers
Capacity rules:
- required_releasers = ceil(due_RPS / jobs_released_per_worker_sec)
- required_workers = ceil(due_RPS / jobs_completed_per_worker_sec)
- required_queue_partitions = ceil(claim_RPS / sustainable_claim_RPS_per_partition)
- if lateness SLA is L, keep backlog_seconds < L; otherwise add releaser throughput or execution throughput depending on which stage is saturated

`I04 Frontier Scan + Claimable Run` #

Capacity units:
- scan workers
- frontier shards
- checkpoint write throughput
Capacity rules:
- required_scan_workers = ceil(frontier_items / target_coverage_seconds / items_scanned_per_worker_sec)
- required_frontier_shards = ceil(claim_RPS / sustainable_claim_RPS_per_shard)
- required_checkpoint_WPS = workers / checkpoint_interval_sec
- if target coverage interval tightens, worker count scales roughly inversely with allowed revisit time

`I05 Append Log + Consumer Progress` #

Capacity units:
- partitions
- brokers
- consumer workers
- storage bytes
Capacity rules:
- required_partitions = max(ceil(append_RPS / target_records_per_partition_sec), ceil(append_Bps / target_bytes_per_partition_sec))
- required_brokers = ceil(required_partitions / target_partitions_per_broker)
- required_consumers = ceil(append_RPS / records_processed_per_consumer_sec)
- required_storage = append_Bps * retention_seconds * replication_factor
- if lag SLA is L, consumer throughput must exceed ingest enough that steady-state lag_seconds < L

`I06 Projection / Index / Search Pipeline` #

Capacity units:
- projector/indexer workers
- index shards
- rebuild lanes
- query replicas
Capacity rules:
- required_projectors = ceil(projection_WPS / writes_applied_per_projector_sec)
- required_query_replicas = ceil(query_RPS / queries_served_per_replica_sec)
- required_shards = ceil(index_bytes / target_bytes_per_shard) or ceil(query_fanout / acceptable_shard_fanout)
- if freshness SLA is F, provision projector throughput so queued_updates / projector_rate < F

`I07 Cache / Origin Projection / Edge Delivery` #

Capacity units:
- cache nodes
- memory bytes
- POPs or cache tiers
- origin capacity behind misses
Capacity rules:
- required_memory = hot_working_set_bytes / target_memory_utilization
- required_cache_nodes = ceil(required_memory / usable_memory_per_node)
- required_origin_RPS = read_RPS * (1 - hit_ratio_target)
- if miss-storm peak is the real NFR, size origin and cache fill path for miss_burst_RPS, not average miss rate

`I08 Traffic Shaping / Admission Control` #

Capacity units:
- evaluator replicas
- policy distribution replicas
- budget-store partitions
- queue slots
Capacity rules:
- required_evaluators = ceil(eval_RPS / decisions_per_replica_sec)
- required_budget_partitions = ceil(budget_WPS / budget_updates_per_partition_sec)
- required_queue_slots = queue_wait_SLA_seconds * admit_rate
- if fail-closed decision latency must stay under p99, evaluator capacity must be sized from peak eval_RPS, not average

`I09 Sequence / Identifier Generation` #

Capacity units:
- allocator replicas
- worker-id space
- lease block size
Capacity rules:
- required_allocator_RPS = id_RPS / block_size
- required_allocator_replicas = ceil(required_allocator_RPS / leases_per_allocator_sec)
- required_worker_id_space >= peak_concurrent_generators
- if allocator path is hot, the first capacity move is larger block size because it lowers central lease traffic directly

`I10 Membership / Presence / Registry` #

Capacity units:
- registry write partitions
- watch fanout replicas
- read replicas or caches
Capacity rules:
- required_registry_capacity = ceil(heartbeat_RPS / heartbeats_processed_per_replica_sec)
- required_watch_replicas = ceil(watch_push_RPS / pushes_per_replica_sec)
- required_read_replicas = ceil(lookup_RPS / lookups_per_replica_sec) when lookup path is separate
- if freshness NFR is loose, heartbeat interval can be increased and directly lowers write capacity needed

`I11 Control Plane + Snapshot Distribution` #

Capacity units:
- control-plane writers
- fanout workers
- distribution bandwidth
- rollout waves
Capacity rules:
- required_fanout_workers = ceil(fanout_RPS / updates_pushed_per_worker_sec)
- required_bandwidth_Bps = fanout_RPS * avg_snapshot_or_delta_bytes
- max_targets_per_wave = rollout_SLA_seconds * apply_ack_rate
- if convergence SLA is tight, size by peak wave rather than by average update cadence

`I12 Workflow + External Side Effect` #

Capacity units:
- workflow workers
- side-effect workers
- reconciliation scanners
- idempotency-store throughput
Capacity rules:
- required_workers = ceil(effect_RPS / effects_processed_per_worker_sec)
- required_reconcilers = ceil(stuck_items / target_reconciliation_window_seconds / items_scanned_per_reconciler_sec)
- required_idempotency_store_RPS = effect_RPS * idempotency_reads_writes_per_effect
- if retry amplification dominates, size from peak retry scenario, not clean-path effect rate

`I13 Shared Subject Coordination` #

Capacity units:
- subject coordinators
- broadcast workers
- snapshot storage/write throughput
Capacity rules:
- required_subject_capacity = hottest_subject_op_RPS / ops_ordered_per_coordinator_sec
- required_broadcast_workers = ceil(broadcast_RPS / fanout_events_per_worker_sec)
- required_snapshot_WPS = hot_subjects / snapshot_interval_sec
- size from hottest subject, not average subject, if the NFR is per-document or per-room responsiveness

`I14 Immutable Artifact Namespace + Delivery` #

Capacity units:
- metadata partitions
- origin storage bandwidth
- edge cache/mirror nodes
- GC workers
Capacity rules:
- required_metadata_partitions = ceil(metadata_RPS / metadata_ops_per_partition_sec)
- required_origin_Bps = fetch_Bps * miss_ratio_to_origin
- required_edge_nodes = ceil(fetch_Bps / sustainable_edge_Bps_per_node)
- required_gc_workers = ceil(gc_backlog_bytes / target_gc_window_seconds / bytes_reclaimed_per_worker_sec)

`I15 Execution Fleet + Worker Substrate` #

Capacity units:
- worker slots
- scheduler replicas
- warm pool instances
- heartbeat-processing capacity
Capacity rules:
- required_slots = concurrency / target_utilization
- required_scheduler_replicas = ceil(arrival_RPS / placements_per_scheduler_sec)
- required_warm_pool = cold_start_sensitive_arrival_RPS * warm_window_seconds
- required_heartbeat_capacity = ceil(heartbeat_RPS / heartbeats_processed_per_replica_sec)
- if queueing SLA is tight, size slots from peak burst concurrency, not average concurrency

`I16 Key-Scoped Mutable State / Replicated KV` #

Capacity units:
- shards
- replicas
- disk/network throughput
- cache capacity for hot keys
Capacity rules:
- required_shards = max(ceil(write_RPS / writes_per_shard_sec), ceil(data_bytes / target_bytes_per_shard))
- required_replicas = durability_target_implied_replica_count
- required_repl_bandwidth = repl_Bps
- required_hot_key_cache = hot_key_working_set_bytes if hot reads dominate
- if p99 write latency is capped tightly, size shard count from hottest shard write rate, not average shard rate

`I17 Traffic Steering / Request Mediation Plane` #

Capacity units:
- proxy/gateway instances
- connection table entries
- health-check workers
- route/policy distribution replicas
Capacity rules:
- required_instances = max(ceil(ingress_RPS / requests_per_instance_sec), ceil(conn_count / connections_per_instance))
- required_health_capacity = ceil(health_RPS / checks_per_worker_sec)
- required_policy_capacity = ceil(policy_eval_RPS / policy_evals_per_replica_sec)
- if p99 latency is the main NFR, size from connection-heavy and retry-heavy peak, not clean-path request average

`I18 Telemetry / Time-Series Pipeline` #

Capacity units:
- ingest shards
- query replicas
- storage bytes
- rollup workers
Capacity rules:
- required_ingest_shards = ceil(ingest_RPS / samples_per_shard_sec)
- required_query_replicas = ceil(query_scan_Bps / bytes_scanned_per_replica_sec)
- required_storage = retention_bytes
- required_rollup_workers = ceil(active_series / series_aggregated_per_worker_sec)
- if cost ceiling is part of the NFR, solve for max ingest or retention under that storage budget before sizing hardware

`I19 Replicated Chunk / Block / File Storage Substrate` #

Capacity units:
- metadata partitions
- storage nodes/disks
- repair workers/bandwidth
- replica bytes
Capacity rules:
- required_metadata_partitions = ceil(metadata_RPS / metadata_ops_per_partition_sec)
- required_storage_nodes = ceil(chunk_Bps / usable_Bps_per_node) and separately ceil(total_stored_bytes / usable_bytes_per_node)
- required_repair_Bps = lost_replica_bytes / target_repair_window_seconds
- required_total_storage = logical_data_bytes * replication_factor / usable_storage_fraction
- if durability NFR says repair within T after one-node loss, size repair bandwidth directly from that target rather than from foreground IO average

`I20 Computation / Dataflow / DAG Execution` #

Capacity units:
- scheduler throughput
- worker task slots
- shuffle bandwidth/storage
- checkpoint bandwidth/storage
- sink commit throughput
Capacity rules:
- required_task_slots = ceil(task_concurrency / target_slot_utilization)
- required_scheduler_capacity = ceil(task_launch_RPS / launches_per_scheduler_sec)
- required_shuffle_Bps = shuffle_Bps
- required_checkpoint_Bps = checkpoint_Bps
- required_sink_commit_capacity = ceil(output_commit_RPS / commits_per_committer_sec)
- if correctness depends on exactly-once output, size from checkpoint and commit boundaries, not only from clean-path operator throughput

`I21 Trust Boundary / Cryptographic Proof Substrate` #

Capacity units:
- verifier replicas/cache capacity
- signer/HSM throughput
- trust-bundle distribution fanout
- revocation publication latency
- audit log partitions
Capacity rules:
- required_verifiers = ceil(verify_RPS / verifications_per_verifier_sec)
- required_signers = ceil(sign_RPS / signatures_per_signer_sec)
- required_bundle_fanout_workers = ceil(trust_bundle_fanout_RPS / bundle_updates_per_worker_sec)
- required_audit_partitions = ceil(audit_WPS / writes_per_partition_sec)
- if stale revocation is the critical risk, size freshness from revocation propagation SLA rather than average issuer throughput

Interview One-Liner #

For infra prompts, I would first classify the dominant archetype, compute the first load-bearing variables for that shape, name the first bottleneck, and then run a short cross-cutting pass over correctness, availability, durability, freshness, isolation, and repair budget.

Infra Archetype NFR Practice Worksheet #

How To Use #

Archetype Worksheet #

Cross-Cutting NFR Pass #

1. Correctness / Consistency #

2. Availability / Failure Policy #

3. Durability / Recoverability #

4. Tail Latency / Freshness #

5. Isolation / Backpressure #

6. Cost / Repair Budget #

Quick Archetype-to-Cross-Cutting Emphasis #

Variable Dimensions And Estimation Rules #

I01 Coordination / Consensus Metadata #

I02 Claim / Lease / Exclusive Ownership #

I03 Due-Time Release + Claimable Run #

I04 Frontier Scan + Claimable Run #

I05 Append Log + Consumer Progress #

I06 Projection / Index / Search Pipeline #

I07 Cache / Origin Projection / Edge Delivery #

I08 Traffic Shaping / Admission Control #

I09 Sequence / Identifier Generation #

I10 Membership / Presence / Registry #

I11 Control Plane + Snapshot Distribution #

I12 Workflow + External Side Effect #

I13 Shared Subject Coordination #

I14 Immutable Artifact Namespace + Delivery #

I15 Execution Fleet + Worker Substrate #

I16 Key-Scoped Mutable State / Replicated KV #

I17 Traffic Steering / Request Mediation Plane #

I18 Telemetry / Time-Series Pipeline #

I19 Replicated Chunk / Block / File Storage Substrate #

Capacity Estimation From NFR Targets #

I01 Coordination / Consensus Metadata #

I02 Claim / Lease / Exclusive Ownership #

I03 Due-Time Release + Claimable Run #

I04 Frontier Scan + Claimable Run #

I05 Append Log + Consumer Progress #

I06 Projection / Index / Search Pipeline #

I07 Cache / Origin Projection / Edge Delivery #

I08 Traffic Shaping / Admission Control #

I09 Sequence / Identifier Generation #

I10 Membership / Presence / Registry #

I11 Control Plane + Snapshot Distribution #

I12 Workflow + External Side Effect #

I13 Shared Subject Coordination #

I14 Immutable Artifact Namespace + Delivery #

I15 Execution Fleet + Worker Substrate #

I16 Key-Scoped Mutable State / Replicated KV #

I17 Traffic Steering / Request Mediation Plane #

I18 Telemetry / Time-Series Pipeline #

I19 Replicated Chunk / Block / File Storage Substrate #

I20 Computation / Dataflow / DAG Execution #

I21 Trust Boundary / Cryptographic Proof Substrate #

Interview One-Liner #

`I01 Coordination / Consensus Metadata` #

`I02 Claim / Lease / Exclusive Ownership` #

`I03 Due-Time Release + Claimable Run` #

`I04 Frontier Scan + Claimable Run` #

`I05 Append Log + Consumer Progress` #

`I06 Projection / Index / Search Pipeline` #

`I07 Cache / Origin Projection / Edge Delivery` #

`I08 Traffic Shaping / Admission Control` #

`I09 Sequence / Identifier Generation` #

`I10 Membership / Presence / Registry` #

`I11 Control Plane + Snapshot Distribution` #

`I12 Workflow + External Side Effect` #

`I13 Shared Subject Coordination` #

`I14 Immutable Artifact Namespace + Delivery` #

`I15 Execution Fleet + Worker Substrate` #

`I16 Key-Scoped Mutable State / Replicated KV` #

`I17 Traffic Steering / Request Mediation Plane` #

`I18 Telemetry / Time-Series Pipeline` #

`I19 Replicated Chunk / Block / File Storage Substrate` #

`I01 Coordination / Consensus Metadata` #

`I02 Claim / Lease / Exclusive Ownership` #

`I03 Due-Time Release + Claimable Run` #

`I04 Frontier Scan + Claimable Run` #

`I05 Append Log + Consumer Progress` #

`I06 Projection / Index / Search Pipeline` #

`I07 Cache / Origin Projection / Edge Delivery` #

`I08 Traffic Shaping / Admission Control` #

`I09 Sequence / Identifier Generation` #

`I10 Membership / Presence / Registry` #

`I11 Control Plane + Snapshot Distribution` #

`I12 Workflow + External Side Effect` #

`I13 Shared Subject Coordination` #

`I14 Immutable Artifact Namespace + Delivery` #

`I15 Execution Fleet + Worker Substrate` #

`I16 Key-Scoped Mutable State / Replicated KV` #

`I17 Traffic Steering / Request Mediation Plane` #

`I18 Telemetry / Time-Series Pipeline` #

`I19 Replicated Chunk / Block / File Storage Substrate` #

`I20 Computation / Dataflow / DAG Execution` #

`I21 Trust Boundary / Cryptographic Proof Substrate` #