Infra Archetype NFR Practice Worksheet
Infra Archetype NFR Practice Worksheet #
Use this as the infra-first counterpart to the broader mixed prompt worksheet.
This note is intentionally organized by I01-I21, not by product prompt.
The goal is:
- start from the archetype
- calculate the first load-bearing variables
- identify the first scaling bottleneck
- then run a short cross-cutting NFR pass
This should be used with:
- infra-primitive-families-for-path-generation.md
- infra-archetype-taxonomy-reference.md
- infra-nfr-cheat-sheet.md
How To Use #
For a prompt:
- classify to the dominant infra archetype
- steal the sample-prompt row below
- calculate the listed variables before drawing boxes
- state the first pressure check out loud
- finish with the cross-cutting questions in the last column
If the system is hybrid:
- pick the dominant row first
- then import one adjacent row
- do not average them into a generic worksheet
Archetype Worksheet #
| Archetype | Sample prompts | First NFRs to calculate | Starter formulas | First pressure check | Cross-cutting follow-up |
|---|---|---|---|---|---|
I01 Coordination / Consensus Metadata | metadata service, leader election service, config store | quorum write TPS, watch fanout, election rate, session renew rate | quorum_write_RPS = metadata_mutations_per_sec; watch_delivery_RPS = watchers*updates_per_sec; renew_RPS = active_sessions/lease_seconds | quorum RTT or hot metadata key | fail closed or open under quorum loss; max stale watch gap; compaction/replay window; blast radius of leader failover |
I02 Claim / Lease / Exclusive Ownership | lock service, lease manager, shard owner election | claim TPS, renew rate, hot-key contention, stale-holder fence checks | claim_RPS = claim_attempts_per_sec; renew_RPS = active_leases/lease_seconds; contention_RPS = claim_RPS*hot_key_share | hot claim key or renew storm | fence token required at which downstream write; duplicate-owner damage if fencing fails; reclaim delay budget; false-expiry tolerance |
I03 Due-Time Release + Claimable Run | cron scheduler, delayed job queue, reminder service | due burst RPS, runnable backlog, lateness SLA, claim TPS | due_RPS = jobs_due_in_peak_window/window_seconds; backlog_seconds = runnable_queue_depth/worker_claim_rate; claim_RPS = due_RPS*(1+retry_factor) | due-time burstiness or claim backlog | acceptable fire-time delay; duplicate run tolerance; clock-skew budget; replay after scheduler crash |
I04 Frontier Scan + Claimable Run | web crawler, batch scanner, ETL sweep, compliance scanner | frontier claim rate, scan coverage rate, checkpoint cadence, rediscovery rate | claim_RPS = workers*batches_per_sec; coverage_seconds = frontier_items/scan_rate; checkpoint_WPS = workers/checkpoint_interval_sec | frontier contention or checkpoint lag | progress cursor truth; resumability after crash; duplicate-scan budget; freshness vs scan cost |
I05 Append Log + Consumer Progress | pub/sub, event bus, durable queue, commit log | append throughput, partition hotness, consumer lag, replay window | append_Bps = append_RPS*avg_record_bytes; partition_write_RPS = append_RPS*hot_partition_share; lag_seconds = consumer_backlog/consume_rate | hot partition or replay lag | ordering scope per partition or key; commit-progress durability; replay cost after consumer loss; backpressure on slow consumers |
I06 Projection / Index / Search Pipeline | search index, materialized view, feed projector, read model builder | source ingest RPS, projection fanout, index lag, rebuild throughput | projection_WPS = source_WPS*avg_projection_updates; lag_seconds = queued_updates/projector_rate; rebuild_seconds = corpus_bytes/rebuild_Bps | write amplification or projector lag | rebuildability from source truth; freshness bound for queries; backfill isolation from live traffic; correctness under out-of-order events |
I07 Cache / Origin Projection / Edge Delivery | distributed cache, CDN metadata cache, API response cache | read RPS, hit ratio, miss storm size, invalidation rate, memory footprint | origin_RPS = read_RPS*(1-hit_ratio); miss_burst_RPS = read_RPS*simultaneous_expiry_share; memory_bytes = keys*avg_entry_bytes | hot key or cache miss storm | stale-read tolerance; invalidation propagation bound; stampede control; fail-open vs fail-through on origin error |
I08 Traffic Shaping / Admission Control | rate limiter, quota manager, overload shedder, concurrency limiter | evaluator RPS, budget update rate, hot-tenant rate, queue/admit depth | eval_RPS = requests_per_sec; budget_WPS = eval_RPS*enforced_fraction; hot_tenant_RPS = eval_RPS*top_tenant_share; queue_wait_seconds = queued_requests/admit_rate | evaluator hot path or hot budget key | fairness unit per tenant/user/request class; fail-open vs fail-closed; budget propagation lag; overload behavior under partial outages |
I09 Sequence / Identifier Generation | snowflake service, monotonic ticket dispenser, order-number allocator | ID allocation RPS, block lease rate, worker-id pool pressure, skew window | id_RPS = ids_per_sec; block_lease_RPS = id_RPS/block_size; worker_pool_util = active_generators/worker_id_pool; rollback_risk_window = max_clock_skew_sec | allocator hotspot or clock rollback | uniqueness vs monotonicity requirement; gap tolerance; allocator failover semantics; epoch rollover handling |
I10 Membership / Presence / Registry | service registry, presence system, node registry | registration rate, heartbeat fan-in, false-death window, watch fanout | heartbeat_RPS = members/heartbeat_interval_sec; expiry_scan_RPS = members/expiry_scan_interval_sec; watch_push_RPS = watchers*membership_changes_per_sec | heartbeat fan-in or watch fanout | false-death budget; lookup freshness bound; ghost-member cleanup; anti-entropy after missed watches |
I11 Control Plane + Snapshot Distribution | feature flag platform, config distribution, xDS-like control plane | config mutate RPS, publication fanout, apply lag, snapshot bytes | fanout_RPS = targets*updates_per_sec; update_Bps = fanout_RPS*avg_snapshot_bytes; convergence_seconds = targets/apply_ack_rate | config fanout or slow apply convergence | monotonic version rule; rollback budget; partial rollout detection; target behavior on control-plane partition |
I12 Workflow + External Side Effect | payment workflow, webhook delivery engine, provisioning workflow | transition TPS, side-effect RPS, retry rate, stuck-workflow cardinality | transition_RPS = active_entities*transitions_per_day/86400; effect_RPS = transition_RPS*(1+retry_factor); stuck_items = workflows_in_nonterminal_state*stuck_fraction | side-effect latency or retry amplification | idempotency surface; reconciliation scan cadence; exactly-once vs at-least-once effect semantics; poison workflow handling |
I13 Shared Subject Coordination | collaborative editor, whiteboard, shared cursor state | ops per subject, concurrent editors, broadcast fanout, snapshot cadence | op_RPS = editors_per_subject*ops_per_editor_sec; broadcast_RPS = op_RPS*active_subscribers; replay_ops = ops_since_snapshot | single-subject coordinator or replay cost | ordering model per subject; merge/conflict semantics; late join replay budget; subject hotspot isolation |
I14 Immutable Artifact Namespace + Delivery | artifact registry, blob/object store namespace, image/package distribution | publish RPS, fetch throughput, metadata RPS, GC backlog | publish_RPS = artifacts_published_per_sec; fetch_Bps = downloads_per_sec*avg_artifact_bytes; gc_backlog_bytes = unreferenced_bytes_pending_collection | metadata namespace hotspot or origin bandwidth | immutability guarantee; rollback by pointer or republish; retention/GC safety; cross-region replication lag |
I15 Execution Fleet + Worker Substrate | serverless runtime, CI runner fleet, GPU job fleet, remote execution | arrival RPS, concurrent executions, worker-slot count, heartbeat rate, cold-start rate | concurrency = arrival_RPS*avg_run_seconds; slot_count = concurrency/utilization_target; heartbeat_RPS = active_workers/heartbeat_interval_sec; cold_start_RPS = launches_per_sec*cold_start_fraction | worker saturation, placement contention, or heartbeat fan-in | placement truth and fencing; preemption semantics; reclaim lag after worker loss; admission vs queueing under overload |
I16 Key-Scoped Mutable State / Replicated KV | key-value store, session store, profile store, cart store | read RPS, write RPS, hot-key share, replication bandwidth, compaction/write amplification | repl_Bps = write_RPS*avg_write_bytes*replication_factor; hot_key_RPS = read_RPS*top_key_share; storage_write_amp = logical_write_Bps/physical_write_Bps | leader hotspot, hot key, or replication pressure | consistency level per key; failover read/write semantics; anti-entropy/repair budget; compaction impact on tail latency |
I17 Traffic Steering / Request Mediation Plane | API gateway, load balancer, service mesh router, WAF | ingress RPS, active connections, route table size, health-check fanout, retry amplification | conn_count = clients*avg_open_conns; health_RPS = backends*checks_per_sec; retry_RPS = ingress_RPS*retry_fraction; policy_eval_RPS = ingress_RPS*rules_checked_per_request | hot VIP, connection table pressure, or retry amplification | route-config freshness; drain semantics; fail-open vs fail-closed policy checks; tail latency under retries and outlier ejection |
I18 Telemetry / Time-Series Pipeline | metrics system, alerting platform, logs pipeline, infra monitoring | ingest RPS, active series/cardinality, rule eval fanout, query scan bytes, retention bytes | ingest_RPS = emitters*samples_per_sec; active_series = emitters*metrics_per_emitter*label_cardinality_factor; query_scan_Bps = queried_points_per_sec*bytes_per_point; retention_bytes = ingest_Bps*retention_seconds | high-cardinality blowup, ingest fan-in, or query lag | telemetry must not destabilize workload; sampling/drop policy under overload; alert delay budget; retention vs cost tradeoff |
I19 Replicated Chunk / Block / File Storage Substrate | distributed file system, block store, chunk store, object-storage substrate | metadata ops RPS, chunk throughput, repair bandwidth, placement skew, rebuild time | chunk_Bps = io_ops_per_sec*avg_chunk_bytes; repair_Bps = lost_replica_bytes/repair_window_sec; rebuild_seconds = lost_bytes/effective_repair_Bps; metadata_RPS = namespace_ops_per_sec | metadata hotspot, repair bandwidth, or hot chunk | replica placement policy; durability target after correlated failure; degraded-read performance; background repair budget vs foreground IO |
I20 Computation / Dataflow / DAG Execution | MapReduce/Spark/Flink-like engine, DAG scheduler, streaming dataflow engine | input throughput, task concurrency, shuffle bytes, checkpoint bytes, output commit rate, watermark lag | task_concurrency = input_partitions*avg_parallelism_per_partition; shuffle_Bps = records_per_sec*avg_record_bytes*shuffle_fanout; checkpoint_Bps = state_bytes/checkpoint_interval_sec; watermark_lag_seconds = event_time_now - watermark_time | shuffle pressure, checkpoint I/O, hot key, or scheduler bottleneck | stale attempt commit guard; exactly-once sink boundary; backpressure behavior; recovery time from latest checkpoint |
I21 Trust Boundary / Cryptographic Proof Substrate | workload identity platform, artifact signing/provenance, trust-bundle distribution, revocation service | verification RPS, signing RPS, trust-bundle fanout, revocation freshness, audit write throughput | verify_RPS = protected_requests_per_sec; sign_RPS = issued_statements_per_sec; trust_bundle_fanout_RPS = verifiers*bundle_updates_per_sec; audit_WPS = verification_events_per_sec*audit_sample_rate | verifier hot path, stale revocation, or trust-bundle propagation lag | credential TTL; revocation freshness SLA; issuer compromise blast radius; audit retention and tamper evidence |
Cross-Cutting NFR Pass #
After you do the archetype row, force this short pass.
1. Correctness / Consistency #
Ask:
- what invariant is load-bearing here?
- what stale actor, stale version, or duplicate effect is unacceptable?
- what is the minimum consistency scope: key, partition, subject, workflow, quorum?
2. Availability / Failure Policy #
Ask:
- should this path fail closed or fail open under uncertainty?
- what is allowed to degrade independently?
- what is the smallest useful partial service mode?
3. Durability / Recoverability #
Ask:
- what acknowledged state must survive crash?
- what can be rebuilt from source truth or replay?
- what is the acceptable loss window for transient buffers, caches, snapshots, or telemetry?
4. Tail Latency / Freshness #
Ask:
- which path needs low
p95orp99? - where is bounded staleness acceptable?
- is freshness measured in milliseconds, seconds, minutes, or rollout waves?
5. Isolation / Backpressure #
Ask:
- can one tenant, hot key, or hot subject overload others?
- where does backpressure appear first?
- what is the admission or shedding rule when capacity is exhausted?
6. Cost / Repair Budget #
Ask:
- what background work grows with success: replay, compaction, repair, scan, watch fanout, checkpointing?
- what budget do you reserve for non-foreground work?
- what gets worse first when repair falls behind?
Quick Archetype-to-Cross-Cutting Emphasis #
Use these as the first extra questions after the main row.
| Archetype | Cross-cutting emphasis |
|---|---|
I01 | correctness, quorum availability, watch freshness |
I02 | fencing correctness, reclaim lag, fail-closed semantics |
I03 | lateness SLA, duplicate-run tolerance, backlog recovery |
I04 | checkpoint durability, coverage freshness, resumability |
I05 | ordering, lag, replay cost, slow-consumer backpressure |
I06 | freshness, rebuildability, backfill isolation |
I07 | stale tolerance, stampede control, origin protection |
I08 | fairness, overload policy, fail-open vs fail-closed |
I09 | uniqueness, monotonicity, allocator failover |
I10 | false death, lookup freshness, watch recovery |
I11 | monotonic apply, rollout safety, rollback speed |
I12 | idempotency, reconciliation, poison-work handling |
I13 | per-subject ordering, hotspot isolation, replay budget |
I14 | immutability, retention/GC safety, fetch bandwidth |
I15 | placement fencing, cold starts, reclaim lag, admission policy |
I16 | consistency level, replication lag, compaction cost |
I17 | route freshness, retry amplification, drain behavior |
I18 | cardinality control, alert delay, drop policy under overload |
I19 | durability target, repair bandwidth, degraded-mode performance |
Variable Dimensions And Estimation Rules #
Use this section when the interviewer gives you only partial inputs.
For each archetype:
variable dimensionstell you what dimensional form the variable should takeestimation rulestell you how to derive a plausible estimate quickly
I01 Coordination / Consensus Metadata #
- Scale units:
quorum_write_RPS: writes/secondwatch_delivery_RPS: watch events/secondrenew_RPS: renewals/secondelection_rate: elections/hour or elections/day
- Decision rules:
- start from number of operators, controllers, or automation jobs that mutate metadata
- if prompt says
Nclients watch config or membership, assume watch fanout isN * updates_per_sec - if lease/session TTL is given, derive renew rate as
active_sessions / TTL_seconds - if failover is said to be rare, model election rate as low steady-state background plus burst during incidents
I02 Claim / Lease / Exclusive Ownership #
- Scale units:
claim_RPS: claim attempts/secondrenew_RPS: renewals/secondcontention_RPS: contended claim attempts/second on hottest key or shardreclaim_delay: seconds
- Decision rules:
- estimate total claim attempts from workers, jobs, or contenders entering the system per second
- if TTL is provided, renew rate is
active_leases / TTL_seconds - if prompt says a small subset of resources are popular, apply a hot-key share to total claim volume
- set reclaim delay from
lease TTL + detection lag + reassignment lag
I03 Due-Time Release + Claimable Run #
- Scale units:
due_RPS: newly due jobs/secondbacklog_seconds: seconds of runnable backlogclaim_RPS: claims/secondlateness_SLA: seconds or minutes
- Decision rules:
- convert jobs/day or jobs/hour into average rate, then separately estimate the peak due bucket
- if due times cluster on minute boundaries, model a peak-to-average multiplier rather than using average only
- compute backlog in time, not just count, using
queue_depth / effective_claim_or_execute_rate - if interviewer says
reminders should feel near real time, keep lateness in single-digit seconds; if it is batch, use minutes
I04 Frontier Scan + Claimable Run #
- Scale units:
claim_RPS: frontier claims/secondcoverage_seconds: seconds or hours to revisit the full frontiercheckpoint_WPS: checkpoints/secondrediscovery_rate: items rediscovered/second or rediscovery ratio
- Decision rules:
- derive claim rate from worker count and batch size/frequency
- derive coverage time from
total frontier items / effective scan rate - checkpoint frequency should scale with amount of work you can afford to replay after crash
- if prompt emphasizes freshness, push coverage interval down; if it emphasizes cost, allow longer revisit intervals
I05 Append Log + Consumer Progress #
- Scale units:
append_Bps: bytes/secondappend_RPS: records/secondpartition_write_RPS: writes/second on hottest partitionlag_seconds: secondsreplay_window: hours or days
- Decision rules:
- start from producer count and per-producer event rate
- estimate bytes separately from record count because batching/compression can change bottlenecks
- if key distribution is skewed, apply a hot-partition share rather than assuming uniform spread
- use
consumer backlog / consumer throughputto express lag in seconds since that maps better to SLAs
I06 Projection / Index / Search Pipeline #
- Scale units:
source_WPS: source writes/secondprojection_WPS: derived writes/secondlag_seconds: seconds or minutesrebuild_seconds: minutes, hours, or days
- Decision rules:
- identify the canonical source-of-truth mutation rate first
- multiply by average number of downstream index or projection updates per source mutation
- if prompt involves fanout by follower/subscriber/tag, model projection amplification explicitly
- set rebuild time from total corpus size and realistic sustained rebuild bandwidth, not peak hardware bandwidth
I07 Cache / Origin Projection / Edge Delivery #
- Scale units:
read_RPS: reads/secondorigin_RPS: cache misses/second reaching originmiss_burst_RPS: miss burst requests/secondinvalidation_rate: invalidations/secondmemory_bytes: bytes
- Decision rules:
- derive read rate from active clients times per-client request rate
- derive origin load from
read_RPS * (1 - hit_ratio) - if TTL expiry is synchronized, estimate miss burst separately from steady-state miss rate
- size memory from hot working set, not total corpus, unless prompt says full-cache mirror
I08 Traffic Shaping / Admission Control #
- Scale units:
eval_RPS: decisions/secondbudget_WPS: budget mutations/secondhot_tenant_RPS: requests/second for hottest tenantqueue_wait_seconds: seconds
- Decision rules:
- start with total incoming requests on the guarded path
- only a fraction of requests may mutate budget state; separate evaluation from mutation
- if prompt is multi-tenant, explicitly estimate top-tenant share
- if system queues before admit/reject, express overload in queue-wait time rather than raw queue length
I09 Sequence / Identifier Generation #
- Scale units:
id_RPS: IDs/secondblock_lease_RPS: block leases/secondworker_pool_util: fraction or percentrollback_risk_window: milliseconds or seconds
- Decision rules:
- derive ID rate from operations that need new IDs, not from all requests
- if using range leasing, convert allocation rate into coordinator lease rate via
id_RPS / block_size - if generators need unique worker IDs, compare active generators to available ID slots
- if prompt requires monotonicity, ask or assume a maximum tolerable clock rollback window
I10 Membership / Presence / Registry #
- Scale units:
heartbeat_RPS: heartbeats/secondexpiry_scan_RPS: expiry checks/secondwatch_push_RPS: membership updates delivered/secondfalse_death_window: seconds
- Decision rules:
- derive heartbeat rate from member count and heartbeat interval
- if expiry scanning is centralized, estimate check volume per scan interval
- watch push load is
watchers * meaningful membership changes per second - set false-death window from heartbeat interval, missed-heartbeat threshold, and network jitter budget
I11 Control Plane + Snapshot Distribution #
- Scale units:
fanout_RPS: target updates/secondupdate_Bps: bytes/second distributedconvergence_seconds: seconds or minutesconfig_mutate_RPS: control writes/second
- Decision rules:
- derive mutate rate from humans, deploy controllers, or automation systems changing truth
- turn rollout size into fanout by dividing targets by desired rollout interval
- multiply target update rate by average snapshot or delta size to get distribution bandwidth
- derive convergence from fleet size and realistic ack/apply throughput, not ideal broadcast speed
I12 Workflow + External Side Effect #
- Scale units:
transition_RPS: state transitions/secondeffect_RPS: side-effect attempts/secondretry_rate: retries/second or retry fractionstuck_items: count
- Decision rules:
- estimate transition rate from active entities and transitions per entity per unit time
- side-effect rate is usually
transitions * (1 + retry_factor)rather than equal to transition rate - estimate stuck-work count from timeout rate, provider failure rate, or reconciliation lag
- if prompt is payment/provisioning/booking, assume idempotency and retries are first-class, not edge cases
I13 Shared Subject Coordination #
- Scale units:
op_RPS: operations/second per subjectbroadcast_RPS: fanout deliveries/secondreplay_ops: operationssnapshot_cadence: seconds, minutes, or ops between snapshots
- Decision rules:
- derive per-subject op rate from concurrent editors times per-editor operation frequency
- fanout is per-subject op rate times active subscribers, not total system users
- size replay budget by the maximum join latency or reconnect cost you can tolerate
- if prompt has hot rooms/docs/canvases, model hottest-subject load separately from average subject load
I14 Immutable Artifact Namespace + Delivery #
- Scale units:
publish_RPS: publishes/secondfetch_Bps: bytes/secondmetadata_RPS: metadata ops/secondgc_backlog_bytes: bytes
- Decision rules:
- estimate publish rate from build, release, or upload frequency
- estimate fetch throughput from download concurrency and average artifact size
- separate metadata ops from bulk bytes because metadata hotspots often break first
- derive GC backlog from publish churn times retention window before unreachable data can be removed
I15 Execution Fleet + Worker Substrate #
- Scale units:
arrival_RPS: executions/second arrivingconcurrency: concurrent running executionsslot_count: worker slotsheartbeat_RPS: heartbeats/secondcold_start_RPS: cold starts/second
- Decision rules:
- derive arrival rate from incoming jobs, invocations, or tasks
- derive concurrency using Littleās Law:
arrival_rate * average_run_time - derive slot count by dividing concurrency by target utilization, not by theoretical max
- estimate heartbeat fan-in from active workers and interval
- if prompt is bursty serverless or CI, estimate cold starts separately from average launches
I16 Key-Scoped Mutable State / Replicated KV #
- Scale units:
read_RPS: reads/secondwrite_RPS: writes/secondhot_key_RPS: hottest-key reads or writes/secondrepl_Bps: replication bytes/secondstorage_write_amp: ratio
- Decision rules:
- estimate read/write rate from active clients and per-client operation rate
- apply a skew factor for hottest keys rather than assuming uniform traffic
- derive replication bandwidth from write rate, average mutation size, and replica count
- if storage engine details matter, separate logical writes from physical storage amplification
I17 Traffic Steering / Request Mediation Plane #
- Scale units:
ingress_RPS: requests/secondconn_count: active connectionshealth_RPS: health checks/secondretry_RPS: retries/secondpolicy_eval_RPS: policy evaluations/second
- Decision rules:
- derive ingress from clients or upstream services and per-client request rate
- derive active connections from open-session model, not from request rate alone
- derive health-check load from backend count and health-check cadence
- estimate retries as a fraction of ingress under normal and degraded conditions separately
I18 Telemetry / Time-Series Pipeline #
- Scale units:
ingest_RPS: samples or events/secondactive_series: countquery_scan_Bps: bytes/second scannedretention_bytes: bytesalert_delay: seconds
- Decision rules:
- start from emitter count and per-emitter metrics/log events rate
- derive active series from metric names times label combinations, not just host count
- query scan cost should be estimated from points touched per query and concurrent dashboards/alerts
- retention bytes comes from ingest throughput times retention duration after compression assumptions
I19 Replicated Chunk / Block / File Storage Substrate #
- Scale units:
metadata_RPS: metadata operations/secondchunk_Bps: chunk/block bytes/secondrepair_Bps: repair bytes/secondrebuild_seconds: seconds, hours, or daysplacement_skew: fraction or percent imbalance
- Decision rules:
- estimate metadata rate from namespace ops like create/open/list/rename/attach
- estimate chunk throughput from foreground IO volume and average block/chunk size
- derive repair bandwidth from durability target and time-to-repair requirement after replica loss
- if prompt mentions rack/AZ awareness, explicitly model placement skew and correlated failure domains
Capacity Estimation From NFR Targets #
This is the sizing section.
Use it when the interviewer asks:
- how many nodes or shards do you need?
- how much bandwidth or storage do you need?
- how many workers, consumers, or replicas does the NFR imply?
For each archetype:
capacity unitstell you what you are sizingcapacity rulestell you how to derive a first-cut number from the NFR target
I01 Coordination / Consensus Metadata #
- Capacity units:
- quorum groups
- metadata partitions
- watch fanout replicas
- network bandwidth for watch delivery
- Capacity rules:
required_quorum_groups = ceil(quorum_write_RPS / sustainable_write_RPS_per_group)required_watch_replicas = ceil(watch_delivery_RPS / sustainable_watch_events_per_replica)- if watch fanout load dominates but write rate is low, size separate fanout replicas rather than more quorum writers
- if one metadata domain exceeds per-group latency or write target, split into another partition rather than stretching one quorum group
I02 Claim / Lease / Exclusive Ownership #
- Capacity units:
- lease-service partitions
- renew-handling replicas
- reclaim workers
- Capacity rules:
required_partitions = ceil(claim_RPS / sustainable_claim_RPS_per_partition)required_renew_capacity = ceil(renew_RPS / sustainable_renew_RPS_per_replica)required_reclaim_workers = ceil(expired_claims_per_sec / reclaims_per_worker_sec)- if hottest-key contention exceeds single-partition capacity, no amount of replica scaling fixes it; size around smaller claim domains instead
I03 Due-Time Release + Claimable Run #
- Capacity units:
- releaser/scanner workers
- runnable queue partitions
- execution workers
- Capacity rules:
required_releasers = ceil(due_RPS / jobs_released_per_worker_sec)required_workers = ceil(due_RPS / jobs_completed_per_worker_sec)required_queue_partitions = ceil(claim_RPS / sustainable_claim_RPS_per_partition)- if lateness SLA is
L, keepbacklog_seconds < L; otherwise add releaser throughput or execution throughput depending on which stage is saturated
I04 Frontier Scan + Claimable Run #
- Capacity units:
- scan workers
- frontier shards
- checkpoint write throughput
- Capacity rules:
required_scan_workers = ceil(frontier_items / target_coverage_seconds / items_scanned_per_worker_sec)required_frontier_shards = ceil(claim_RPS / sustainable_claim_RPS_per_shard)required_checkpoint_WPS = workers / checkpoint_interval_sec- if target coverage interval tightens, worker count scales roughly inversely with allowed revisit time
I05 Append Log + Consumer Progress #
- Capacity units:
- partitions
- brokers
- consumer workers
- storage bytes
- Capacity rules:
required_partitions = max(ceil(append_RPS / target_records_per_partition_sec), ceil(append_Bps / target_bytes_per_partition_sec))required_brokers = ceil(required_partitions / target_partitions_per_broker)required_consumers = ceil(append_RPS / records_processed_per_consumer_sec)required_storage = append_Bps * retention_seconds * replication_factor- if lag SLA is
L, consumer throughput must exceed ingest enough that steady-statelag_seconds < L
I06 Projection / Index / Search Pipeline #
- Capacity units:
- projector/indexer workers
- index shards
- rebuild lanes
- query replicas
- Capacity rules:
required_projectors = ceil(projection_WPS / writes_applied_per_projector_sec)required_query_replicas = ceil(query_RPS / queries_served_per_replica_sec)required_shards = ceil(index_bytes / target_bytes_per_shard)orceil(query_fanout / acceptable_shard_fanout)- if freshness SLA is
F, provision projector throughput soqueued_updates / projector_rate < F
I07 Cache / Origin Projection / Edge Delivery #
- Capacity units:
- cache nodes
- memory bytes
- POPs or cache tiers
- origin capacity behind misses
- Capacity rules:
required_memory = hot_working_set_bytes / target_memory_utilizationrequired_cache_nodes = ceil(required_memory / usable_memory_per_node)required_origin_RPS = read_RPS * (1 - hit_ratio_target)- if miss-storm peak is the real NFR, size origin and cache fill path for
miss_burst_RPS, not average miss rate
I08 Traffic Shaping / Admission Control #
- Capacity units:
- evaluator replicas
- policy distribution replicas
- budget-store partitions
- queue slots
- Capacity rules:
required_evaluators = ceil(eval_RPS / decisions_per_replica_sec)required_budget_partitions = ceil(budget_WPS / budget_updates_per_partition_sec)required_queue_slots = queue_wait_SLA_seconds * admit_rate- if fail-closed decision latency must stay under
p99, evaluator capacity must be sized from peakeval_RPS, not average
I09 Sequence / Identifier Generation #
- Capacity units:
- allocator replicas
- worker-id space
- lease block size
- Capacity rules:
required_allocator_RPS = id_RPS / block_sizerequired_allocator_replicas = ceil(required_allocator_RPS / leases_per_allocator_sec)required_worker_id_space >= peak_concurrent_generators- if allocator path is hot, the first capacity move is larger block size because it lowers central lease traffic directly
I10 Membership / Presence / Registry #
- Capacity units:
- registry write partitions
- watch fanout replicas
- read replicas or caches
- Capacity rules:
required_registry_capacity = ceil(heartbeat_RPS / heartbeats_processed_per_replica_sec)required_watch_replicas = ceil(watch_push_RPS / pushes_per_replica_sec)required_read_replicas = ceil(lookup_RPS / lookups_per_replica_sec)when lookup path is separate- if freshness NFR is loose, heartbeat interval can be increased and directly lowers write capacity needed
I11 Control Plane + Snapshot Distribution #
- Capacity units:
- control-plane writers
- fanout workers
- distribution bandwidth
- rollout waves
- Capacity rules:
required_fanout_workers = ceil(fanout_RPS / updates_pushed_per_worker_sec)required_bandwidth_Bps = fanout_RPS * avg_snapshot_or_delta_bytesmax_targets_per_wave = rollout_SLA_seconds * apply_ack_rate- if convergence SLA is tight, size by peak wave rather than by average update cadence
I12 Workflow + External Side Effect #
- Capacity units:
- workflow workers
- side-effect workers
- reconciliation scanners
- idempotency-store throughput
- Capacity rules:
required_workers = ceil(effect_RPS / effects_processed_per_worker_sec)required_reconcilers = ceil(stuck_items / target_reconciliation_window_seconds / items_scanned_per_reconciler_sec)required_idempotency_store_RPS = effect_RPS * idempotency_reads_writes_per_effect- if retry amplification dominates, size from peak retry scenario, not clean-path effect rate
I13 Shared Subject Coordination #
- Capacity units:
- subject coordinators
- broadcast workers
- snapshot storage/write throughput
- Capacity rules:
required_subject_capacity = hottest_subject_op_RPS / ops_ordered_per_coordinator_secrequired_broadcast_workers = ceil(broadcast_RPS / fanout_events_per_worker_sec)required_snapshot_WPS = hot_subjects / snapshot_interval_sec- size from hottest subject, not average subject, if the NFR is per-document or per-room responsiveness
I14 Immutable Artifact Namespace + Delivery #
- Capacity units:
- metadata partitions
- origin storage bandwidth
- edge cache/mirror nodes
- GC workers
- Capacity rules:
required_metadata_partitions = ceil(metadata_RPS / metadata_ops_per_partition_sec)required_origin_Bps = fetch_Bps * miss_ratio_to_originrequired_edge_nodes = ceil(fetch_Bps / sustainable_edge_Bps_per_node)required_gc_workers = ceil(gc_backlog_bytes / target_gc_window_seconds / bytes_reclaimed_per_worker_sec)
I15 Execution Fleet + Worker Substrate #
- Capacity units:
- worker slots
- scheduler replicas
- warm pool instances
- heartbeat-processing capacity
- Capacity rules:
required_slots = concurrency / target_utilizationrequired_scheduler_replicas = ceil(arrival_RPS / placements_per_scheduler_sec)required_warm_pool = cold_start_sensitive_arrival_RPS * warm_window_secondsrequired_heartbeat_capacity = ceil(heartbeat_RPS / heartbeats_processed_per_replica_sec)- if queueing SLA is tight, size slots from peak burst concurrency, not average concurrency
I16 Key-Scoped Mutable State / Replicated KV #
- Capacity units:
- shards
- replicas
- disk/network throughput
- cache capacity for hot keys
- Capacity rules:
required_shards = max(ceil(write_RPS / writes_per_shard_sec), ceil(data_bytes / target_bytes_per_shard))required_replicas = durability_target_implied_replica_countrequired_repl_bandwidth = repl_Bpsrequired_hot_key_cache = hot_key_working_set_bytesif hot reads dominate- if p99 write latency is capped tightly, size shard count from hottest shard write rate, not average shard rate
I17 Traffic Steering / Request Mediation Plane #
- Capacity units:
- proxy/gateway instances
- connection table entries
- health-check workers
- route/policy distribution replicas
- Capacity rules:
required_instances = max(ceil(ingress_RPS / requests_per_instance_sec), ceil(conn_count / connections_per_instance))required_health_capacity = ceil(health_RPS / checks_per_worker_sec)required_policy_capacity = ceil(policy_eval_RPS / policy_evals_per_replica_sec)- if p99 latency is the main NFR, size from connection-heavy and retry-heavy peak, not clean-path request average
I18 Telemetry / Time-Series Pipeline #
- Capacity units:
- ingest shards
- query replicas
- storage bytes
- rollup workers
- Capacity rules:
required_ingest_shards = ceil(ingest_RPS / samples_per_shard_sec)required_query_replicas = ceil(query_scan_Bps / bytes_scanned_per_replica_sec)required_storage = retention_bytesrequired_rollup_workers = ceil(active_series / series_aggregated_per_worker_sec)- if cost ceiling is part of the NFR, solve for max ingest or retention under that storage budget before sizing hardware
I19 Replicated Chunk / Block / File Storage Substrate #
- Capacity units:
- metadata partitions
- storage nodes/disks
- repair workers/bandwidth
- replica bytes
- Capacity rules:
required_metadata_partitions = ceil(metadata_RPS / metadata_ops_per_partition_sec)required_storage_nodes = ceil(chunk_Bps / usable_Bps_per_node)and separatelyceil(total_stored_bytes / usable_bytes_per_node)required_repair_Bps = lost_replica_bytes / target_repair_window_secondsrequired_total_storage = logical_data_bytes * replication_factor / usable_storage_fraction- if durability NFR says
repair within T after one-node loss, size repair bandwidth directly from that target rather than from foreground IO average
I20 Computation / Dataflow / DAG Execution #
- Capacity units:
- scheduler throughput
- worker task slots
- shuffle bandwidth/storage
- checkpoint bandwidth/storage
- sink commit throughput
- Capacity rules:
required_task_slots = ceil(task_concurrency / target_slot_utilization)required_scheduler_capacity = ceil(task_launch_RPS / launches_per_scheduler_sec)required_shuffle_Bps = shuffle_Bpsrequired_checkpoint_Bps = checkpoint_Bpsrequired_sink_commit_capacity = ceil(output_commit_RPS / commits_per_committer_sec)- if correctness depends on exactly-once output, size from checkpoint and commit boundaries, not only from clean-path operator throughput
I21 Trust Boundary / Cryptographic Proof Substrate #
- Capacity units:
- verifier replicas/cache capacity
- signer/HSM throughput
- trust-bundle distribution fanout
- revocation publication latency
- audit log partitions
- Capacity rules:
required_verifiers = ceil(verify_RPS / verifications_per_verifier_sec)required_signers = ceil(sign_RPS / signatures_per_signer_sec)required_bundle_fanout_workers = ceil(trust_bundle_fanout_RPS / bundle_updates_per_worker_sec)required_audit_partitions = ceil(audit_WPS / writes_per_partition_sec)- if stale revocation is the critical risk, size freshness from revocation propagation SLA rather than average issuer throughput
Interview One-Liner #
For infra prompts, I would first classify the dominant archetype, compute the first load-bearing variables for that shape, name the first bottleneck, and then run a short cross-cutting pass over correctness, availability, durability, freshness, isolation, and repair budget.