Metrics Monitoring Platform (Prometheus-class: TSDB, Scrape, PromQL) #

This note models a Prometheus-class monitoring platform where scrapers pull metric samples from targets, samples are stored in a time-series database, users query with label selectors and range functions, and rules/alerts are continuously evaluated over stored series.

Step 1 - Normalize #

Assume the baseline prompt is:

design a metrics monitoring platform
scrapers pull metrics from many targets
metric samples are stored in a TSDB
users query metrics with label filters and PromQL-like functions
rules and alerts are evaluated continuously
system scales across many targets and tenants

Normalize into state-affecting paths.

Requirement	Actor	Operation	State touched	Priority
Scraper discovers or refreshes scrape target	System	overwrite state	`S1` `update target` `ScrapeTargetState`	C1
Scraper ingests metric sample batch	System	append event	`S1` `create target` `MetricSample`	C1
System updates series metadata / index	System	state transition	`S1` `update target` `SeriesMetadata`	C1
User queries time series / labels	Client	read source	`S1` `read source target` `MetricSample`	R1
Rule engine evaluates recording or alert rule	System	read source	`S1` `read source target` `MetricSample`	C1
System writes recording rule output	System	append event	`S1` `create target` `RecordedSample`	C1
System updates alert instance state	System	state transition	`S1` `update target` `AlertState`	C1
System compacts / merges TSDB blocks	System	async process	`S1` `hidden write target` `TSDBBlockState`	C1
System routes tenant/shard to current owner	System	read source	`S1` `read source target` `PartitionMap`	C1
System reassigns shard ownership after node failure	System	state transition	`S1` `update target` `PartitionOwnership`	C1
User reads dashboards / status	Client	read projection	`S1` `read projection target` `MonitoringStatusView`	R2

Notes on normalization #

Important choices:

scrape target refresh is overwrite state
- current target set/health is current-value control state
sample ingestion is append event
- metric samples are immutable points in time
series metadata update is state transition
- label/index lifecycle evolves as new series appear or become inactive
rule evaluation is a read path
- it reads TSDB source state
recording-rule output is append-only
- derived samples are still immutable time-series points
alert lifecycle is a state transition
- alert instances move through inactive/pending/firing/resolved
compaction is explicit because TSDB storage lifecycle is core

This system is a hybrid of:

time-series append store
label index
continuous rule evaluation

Step 2 - Critical Path Selection #

Requirement	Priority class	Why
Discover / refresh scrape target state	C1	stale target state can break scrape coverage
Ingest metric samples	C1	primary product truth is the sample stream
Update series metadata / index	C1	queries depend on correct label-to-series mapping
Query time series	R1	core serving path
Evaluate recording / alert rules	C1	correctness of derived metrics and alerts depends on this path
Write recording rule output	C1	derived metrics become stored truth for downstream queries
Update alert state	C1	alert firing/resolution correctness depends on current rule evaluation
Compact TSDB blocks	C1	storage lifecycle and query correctness depend on safe compaction
Route to shard owner	C1	wrong routing can split ingestion/query truth
Reassign shard ownership	C1	failover must preserve ingestion/query/storage correctness
Dashboards / status	R2	operational only

Baseline critical paths #

Main C1 paths:

P1 refresh scrape target state
P2 ingest metric samples
P3 update series metadata
P4 evaluate rules
P5 write recording-rule output
P6 update alert state
P7 compact TSDB blocks
P8 route to shard owner
P9 reassign shard ownership

Main R1 path:

P10 query time series / labels

This design is driven by:

immutable sample ingestion
correct label/index maintenance
TSDB block lifecycle
rule and alert state derived from authoritative sample history

Step 3 - Primary State Extraction #

For a Prometheus-class system, the minimal primary state is scrape target state, metric samples, series metadata/index state, alert state, TSDB block lifecycle state, and routing/ownership state.

Candidate object label	Candidate source	Candidate needed for C1/R1?	Candidate decomposition action	Class	Primary?	Owner	Evolution	Scope kind	Scope value
ScrapeTargetState	direct noun	Yes	keep as candidate	process	Yes	service	overwrite	instance	target_id
MetricSample	direct noun	Yes	keep as candidate	event	Yes	service	append-only	instance	series_id + timestamp
SeriesMetadata	hidden write target	Yes	keep as candidate	entity	Yes	service	overwrite	instance	series_id
AlertState	lifecycle object	Yes	keep as candidate	process	Yes	service	state machine	instance	alert_key
TSDBBlockState	hidden write target	Yes	keep as candidate	process	Yes	service	state machine	instance	block_id
PartitionOwnership	hidden write target	Yes	keep as candidate	process	Yes	service	state machine	instance	shard_id
PartitionMap	hidden write target	Yes	keep as candidate	entity	Yes	service	overwrite	collection	tenant/shard map
MonitoringStatusView	derived read model	No	reject as UI artifact	projection	No	derived	overwrite	collection	tenant or cluster

Important modeling choices #

`ScrapeTargetState` #

Primary because:

target discovery and health determine what gets scraped
current endpoint labels, intervals, and scrape health are control truth

`MetricSample` #

Primary because:

sample ingestion is the core immutable fact stream

`SeriesMetadata` #

Primary because:

PromQL-style reads depend on label matching and series lookup
captures label set, fingerprint/series id, retention status, and head/block placement

`AlertState` #

Primary because:

alert instances have lifecycle across evaluations

`TSDBBlockState` #

Primary because:

head/block/compaction lifecycle determines storage truth and queryability

Minimal strict primary set #

The strongest minimal set is:

ScrapeTargetState
MetricSample
SeriesMetadata
AlertState
TSDBBlockState
PartitionOwnership
PartitionMap

Step 4 - Hard Invariants #

For a Prometheus-class monitoring platform, the hard invariants are about durable sample append semantics, correct label-to-series mapping, safe TSDB compaction, and valid alert lifecycle transitions.

Path	Tier	Type	Invariant statement
`P1` refresh scrape target state	HARD	ordering	Scrape-target revisions are ordered by monotonic target-state version within target scope.
`P2` ingest metric samples	HARD	uniqueness	Key `(series_id, timestamp)` maps to at most one logical outcome `stored sample value` within series scope.
`P2` ingest metric samples	HARD	ordering	Samples for `series_id` are ordered by timestamp within the authoritative ingestion scope.
`P3` update series metadata	HARD	accounting	`SeriesMetadata(series_id)` corresponds to exactly one canonical label set and storage placement for that series scope.
`P4` evaluate rules	HARD	freshness	Rule evaluation reads authoritative sample history and metadata within configured evaluation consistency bound.
`P5` write recording-rule output	HARD	uniqueness	Key `(recorded_series_id, timestamp)` maps to at most one logical outcome `recorded sample` within recorded-series scope.
`P6` update alert state	HARD	eligibility	Action `advance_alert_state` is valid only if current rule evaluation result and current `AlertState` allow the transition at decision time.
`P7` compact TSDB blocks	HARD	accounting	Compacted block contents equal the union of authoritative source blocks modulo dedup/retention rules, and source-to-destination block transition preserves queryable truth.
`P8` route to shard owner	HARD	uniqueness	Key `shard_id` maps to at most one logical outcome `current authoritative owner` within `shard_id`.
`P9` reassign shard ownership	HARD	eligibility	Action `reassign_shard` is valid only if `current owner is failed or relinquished and candidate owner is eligible and sufficiently current` on `shard_id` at decision time.
`P10` query time series	HARD	freshness	Query path reflects authoritative head plus block data within configured query consistency bound.

What matters most #

1. Samples are immutable time points #

Once accepted, sample points are append-only facts.

2. Label mapping must be canonical #

One label set must resolve to one canonical series identity within a shard/tenant scope.

3. Compaction must preserve query truth #

Block lifecycle changes cannot lose or double-count stored samples.

4. Alert state is derived but persistent #

Alerts are not just pure query output; they have ongoing lifecycle state.

Step 5 - Execution Context #

For the strict baseline monitoring platform:

Field	Value	Why
Topology	single service distributed	one logical monitoring system spread across scrapers, ingesters, storage, and queriers
Write coordination scope	per object scope	correctness is per target, series, alert, block, and shard ownership scope
Read consistency target	bounded stale allowed	query serving often tolerates small bounded lag between ingestion and query visibility
Holder model	node	shard ownership is held by ingesters/storage nodes
Compensation acceptable?	No	wrong sample ingestion or block compaction cannot be safely repaired by compensation

Derived implications #

holder_may_crash = true
- scrapers, ingesters, or shard owners can fail mid-ingestion
cross_service_write = false
- baseline keeps scrape state, TSDB metadata, alerts, and ownership in one logical service
bounded_staleness_allowed = true
- query paths can tolerate some ingestion-to-query lag
cross_service_atomicity_required = false
- no multi-service transaction across unrelated services in baseline
exclusive_claim_required = true
- shard ownership must be exclusive
guarded_by_current_state = true
- alert and compaction transitions depend on current state

What this implies #

This pushes us toward:

one authoritative owner per ingestion/storage shard
append-oriented head block plus immutable blocks
label index maintained by shard owner
rule and alert state derived from authoritative samples

Step 6 - Deterministic Mechanism Selection #

Path	Write shape	Base mechanism	Required companions
`P1` refresh scrape target state	overwrite current value	CAS on version	target-state version
`P2` ingest metric samples	append-only event	append log / TSDB head append	duplicate-sample policy, per-series timestamp discipline
`P3` update series metadata	overwrite current value	single writer per shard or CAS on version	canonical label hashing
`P4` evaluate rules	read source	direct source read	evaluation schedule
`P5` write recording-rule output	append-only event	append log / TSDB head append	recorded-series identity
`P6` update alert state	guarded state transition	CAS on `(state, version)`	evaluation timestamp/version
`P7` compact TSDB blocks	guarded state transition	single-writer compaction with atomic block swap	block manifest/version
`P8` route to shard owner	exclusive claim	lease	fencing token, heartbeat
`P9` reassign shard ownership	guarded state transition	CAS on `(state, version)`	fencing token, shard catch-up check

Why these fit #

Scrape target state #

This is current-value control state, so overwrite fits.

Sample ingestion #

Samples are immutable points, so append-only fits naturally.

Series metadata #

Current label/index mapping is current-value state, typically maintained by one shard owner.

Alert state #

Alerts transition through current lifecycle states, so guarded state transition fits.

TSDB compaction #

Compaction is a stateful storage lifecycle change that must preserve current query truth, so guarded transition fits.

Canonical substrate implied #

The baseline now points to:

sharded scrape/ingest/storage service
one owner per tenant or series shard
append-oriented head block and immutable persisted blocks
label index and query engine over authoritative shard data
periodic rule evaluation plus alert lifecycle state

Step 7 - Read Model / Source of Truth #

For a Prometheus-class system, truth is mostly direct source state. Dashboards and UIs are derived.

Concept	Truth	Read path	Rebuild path
`C1` scrape target configuration/state	`ScrapeTargetState`	read source directly	authoritative target store
`C2` metric samples	`MetricSample`	read source directly	authoritative TSDB head and blocks
`C3` series labels / metadata	`SeriesMetadata`	read source directly	authoritative index metadata
`C4` alert lifecycle	`AlertState`	read source directly	authoritative alert-state store
`C5` TSDB block lifecycle	`TSDBBlockState`	read source directly	authoritative block manifest/state store
`C6` shard ownership	`PartitionOwnership`	read source directly	authoritative ownership store
`C7` shard routing map	`PartitionMap`	read source directly	authoritative routing metadata
`C8` dashboards / status	derived from samples, alerts, and block state	materialized view	recompute from authoritative state

Important point #

For the core semantics:

queries read authoritative head plus block data
series lookup reads authoritative label metadata
alert state reads authoritative evaluation and prior lifecycle state
dashboards are projections

Step 8 - Failure Handling #

Path	Retry	Competing writers	Crash after commit	Publish failure	Stale holder
`P1` refresh scrape target state	retry with target-state version	stale refresh loses CAS	committed target state survives crash if persisted	n/a	n/a
`P2` ingest metric samples	scraper retries may resend same samples; dedup policy required for duplicates	shard owner serializes authoritative series append for a shard	committed samples survive crash if WAL/head persisted	partial batch append recovered from WAL/head replay	stale owner blocked by fencing token
`P3` update series metadata	retry with metadata version	concurrent new-series mapping resolved by canonical label identity	committed metadata survives crash if persisted	n/a	stale owner blocked by fencing token
`P4` evaluate rules	evaluation can rerun over same time window	multiple evaluators should be partitioned or fenced by shard ownership	evaluation crash only delays derived outputs	recording/alert output may retry	stale evaluator blocked by ownership/version discipline
`P5` write recording-rule output	retry may produce duplicate writes unless `(series,timestamp)` dedup policy enforced	competing writers should be avoided by single evaluator ownership	committed recorded samples survive crash if persisted	partial append recovered from WAL/head replay	stale evaluator blocked by ownership/version discipline
`P6` update alert state	retry with state version/evaluation timestamp	stale evaluation loses guarded transition	committed alert state survives crash if persisted	downstream notification send may retry separately	stale evaluator blocked by version/token
`P7` compact TSDB blocks	compaction can retry from source manifests	only one compactor should own a block set at a time	if crash before atomic swap, source blocks remain authoritative	partial destination block ignored until manifest commit	stale compactor blocked by block manifest/version
`P8` route to shard owner	retry after refreshing shard map	only one valid owner should exist	if owner changed, refreshed map points to new owner	n/a	stale owner rejected by fencing token
`P9` reassign shard ownership	retry failover transition safely	only one reassignment wins current ownership state	promoted owner crash triggers later reassignment	n/a	old owner fenced and must not continue serving
`P10` query time series	query retry safe	many readers coexist	node crash drops query only	n/a	stale read bounded by configured query freshness

What matters most #

1. Sample dedup policy #

Scrape retries or HA scrapers can re-send the same (series, timestamp) sample. The system needs a defined dedup policy.

2. WAL/head before durable visibility #

Ingested samples should survive process crash via WAL/head recovery before compaction.

3. Compaction must use atomic block replacement #

New compacted blocks should only become visible when their manifest/state is fully committed.

4. Alert evaluation and notification are separate #

Alert-state correctness is inside this system; downstream notification delivery is a separate side effect.

Step 9 - Scale Adjustments #

Hotspot	Type	First response
hot tenants or very high sample ingestion rate	write throughput hotspot	shard by tenant/series space and add more ingesters
high-cardinality labels / index growth	memory hotspot	limit cardinality, compress index, and isolate abusive tenants
expensive range queries / PromQL joins	read hotspot	add query frontends, caching, and precomputed recording rules
compaction bandwidth / storage IO	write throughput hotspot	stagger compaction and separate hot head storage from cold blocks
rule-evaluation storms	contention hotspot	partition rule groups and stagger evaluation schedules
dashboard load	read hotspot	query caching and derived views for common panels

What scales well #

This system scales by:

sharding ingestion and storage by tenant or series space
keeping samples append-oriented
separating head ingestion from immutable block storage
using recording rules to precompute expensive reads

What fails first #

Usually:

high-cardinality explosion
query fanout across too many shards
compaction IO bottlenecks
rule evaluation spikes

Canonical design conclusion #

The mechanical outcome is:

primary state:
- ScrapeTargetState
- MetricSample
- SeriesMetadata
- AlertState
- TSDBBlockState
- PartitionOwnership
- PartitionMap
critical invariants:
- immutable sample append semantics
- canonical label-to-series mapping
- safe block compaction preserving query truth
- alert state valid only for current evaluation result and lifecycle state
- exclusive shard ownership for ingestion/storage
mechanisms:
- append log
- single writer per shard
- guarded alert and compaction transitions
- fenced shard ownership
reads:
- direct authoritative reads for samples, labels, and alerts
- projections only for dashboards and cluster status

Polished interview answer #

I’d design the monitoring platform as a sharded scrape-and-store system with one authoritative owner per tenant or series shard. Scrapers refresh target state and append immutable samples into a TSDB head plus WAL, while shard owners maintain canonical series metadata and later compact head data into immutable blocks. Queries read both head and persisted blocks through a label index, recording rules append derived samples back into storage, and alert instances move through a guarded lifecycle based on periodic rule evaluation. The main scaling levers are more ingestion shards, strong controls on label cardinality, precomputed recording rules, and separate head-versus-block storage paths.

Concrete Substrate #

I’ll choose a sharded scrape/ingest/storage service with per-shard TSDB heads plus immutable persisted blocks as the concrete baseline, because it matches the mechanics we derived:

scrape target state as current-value control data
append-only sample ingestion
canonical per-shard series indexing
guarded alert lifecycle and block compaction
one owner per ingestion/storage shard

Concrete tech family:

scraper/ingester/query services in Go
per-shard TSDB head with WAL
immutable compacted blocks in local disk or object storage
metadata/control:
- etcd or internal metadata quorum for shard ownership/routing
query frontend for fanout and PromQL execution

Each shard owner stores:

current scrape target state
TSDB head and WAL
series label index / metadata
alert state for owned rule groups
block manifest / compaction state

Persisted block store stores:

immutable compressed sample blocks
postings / index metadata
block manifests

Operation Layer #

1. Refresh scrape target #

API

internal discovery refresh

Initiator

system

Entry point

discovery/scrape manager

Authoritative decider

shard owner for target scope

Precondition

target source update or refresh cycle due

Transition

overwrite ScrapeTargetState(target_id)

2. Ingest sample batch #

API

internal scrape append

Initiator

scraper

Entry point

shard owner / ingester

Authoritative decider

shard owner for tenant/series space

Precondition

target currently assigned and sample batch parsed

Transition

append MetricSamples into WAL/head
create or refresh SeriesMetadata as needed

Response

success / partial failure

3. Query time series #

API

Query(query_expr, time_range, tenant)

Initiator

user/client

Entry point

query frontend

Authoritative decider

relevant shard owners plus query engine

Precondition

none

Transition

none

Response

time-series vectors / matrices / scalars

4. Evaluate rule and update alert state #

API

internal rule-evaluation loop

Initiator

system

Entry point

rule evaluator

Authoritative decider

shard owner for rule group

Precondition

evaluation interval due

Transition

read current query result
append RecordedSample if recording rule
CAS-update AlertState if alert rule

5. Compact TSDB blocks #

API

internal compaction loop

Initiator

system

Entry point

shard owner / compactor

Authoritative decider

compaction owner for block set

Precondition

source blocks eligible for compaction

Transition

read source blocks
create destination compacted block
atomically update TSDBBlockState / manifest to swap visibility

Entry Point vs Decider vs Responder #

Path	Entry point	Authoritative decider	Physical responder	Logical responder
scrape/ingest	scraper / ingester	shard owner	ingester node	monitoring platform
query	query frontend	shard owners + query engine	query frontend	monitoring platform
rule evaluation	rule evaluator	shard owner	evaluator node	monitoring platform
compaction	compactor / shard owner	compaction owner	storage node	monitoring platform
shard failover	follower / coordination layer	shard quorum / lease store	new leader / control plane	monitoring platform

Concrete HLD #

Main components:

discovery / scrape manager
- tracks targets and schedules scrapes
ingestion shard owners
- authoritative owners for TSDB head, series metadata, and block lifecycle
query frontend / PromQL engine
- fans queries out to relevant shards
rule evaluator
- runs recording and alert rules on schedule
TSDB block store
- stores immutable compacted blocks
metadata/control service
- tracks shard ownership and routing

Short Interview Version #

I’d build the monitoring platform as a sharded scrape-and-store system with one authoritative owner per tenant or series shard. Scrapers refresh target state and append immutable samples into a TSDB head plus WAL, while shard owners maintain canonical series metadata and later compact head data into immutable blocks. Queries read both head and compacted blocks through a label index, recording rules append derived samples back into storage, and alert instances move through a guarded lifecycle based on periodic rule evaluation. The main scaling levers are more ingestion shards, strong controls on label cardinality, precomputed recording rules, and separate head-versus-block storage paths.

Metrics Monitoring Platform (Prometheus-class: TSDB, Scrape, PromQL) #

Step 1 - Normalize #

Notes on normalization #

Step 2 - Critical Path Selection #

Baseline critical paths #

Step 3 - Primary State Extraction #

Important modeling choices #

ScrapeTargetState #

MetricSample #

SeriesMetadata #

AlertState #

TSDBBlockState #

Minimal strict primary set #

Step 4 - Hard Invariants #

What matters most #

1. Samples are immutable time points #

2. Label mapping must be canonical #

3. Compaction must preserve query truth #

4. Alert state is derived but persistent #

Step 5 - Execution Context #

Derived implications #

What this implies #

Step 6 - Deterministic Mechanism Selection #

Why these fit #

Scrape target state #

Sample ingestion #

Series metadata #

Alert state #

TSDB compaction #

Canonical substrate implied #

Step 7 - Read Model / Source of Truth #

Important point #

Step 8 - Failure Handling #

What matters most #

1. Sample dedup policy #

2. WAL/head before durable visibility #

3. Compaction must use atomic block replacement #

4. Alert evaluation and notification are separate #

Step 9 - Scale Adjustments #

What scales well #

What fails first #

Canonical design conclusion #

Polished interview answer #

Concrete Substrate #

Operation Layer #

1. Refresh scrape target #

2. Ingest sample batch #

3. Query time series #

4. Evaluate rule and update alert state #

5. Compact TSDB blocks #

Entry Point vs Decider vs Responder #

Concrete HLD #

Short Interview Version #

`ScrapeTargetState` #

`MetricSample` #

`SeriesMetadata` #

`AlertState` #

`TSDBBlockState` #