- My Development Notes/
- System Design Components/
- Metrics Monitoring Platform (Prometheus-class: TSDB, Scrape, PromQL)/
Metrics Monitoring Platform (Prometheus-class: TSDB, Scrape, PromQL)
Metrics Monitoring Platform (Prometheus-class: TSDB, Scrape, PromQL) #
This note models a Prometheus-class monitoring platform where scrapers pull metric samples from targets, samples are stored in a time-series database, users query with label selectors and range functions, and rules/alerts are continuously evaluated over stored series.
Step 1 - Normalize #
Assume the baseline prompt is:
- design a metrics monitoring platform
- scrapers pull metrics from many targets
- metric samples are stored in a TSDB
- users query metrics with label filters and PromQL-like functions
- rules and alerts are evaluated continuously
- system scales across many targets and tenants
Normalize into state-affecting paths.
| Requirement | Actor | Operation | State touched | Priority |
|---|---|---|---|---|
| Scraper discovers or refreshes scrape target | System | overwrite state | S1update targetScrapeTargetState | C1 |
| Scraper ingests metric sample batch | System | append event | S1create targetMetricSample | C1 |
| System updates series metadata / index | System | state transition | S1update targetSeriesMetadata | C1 |
| User queries time series / labels | Client | read source | S1read source targetMetricSample | R1 |
| Rule engine evaluates recording or alert rule | System | read source | S1read source targetMetricSample | C1 |
| System writes recording rule output | System | append event | S1create targetRecordedSample | C1 |
| System updates alert instance state | System | state transition | S1update targetAlertState | C1 |
| System compacts / merges TSDB blocks | System | async process | S1hidden write targetTSDBBlockState | C1 |
| System routes tenant/shard to current owner | System | read source | S1read source targetPartitionMap | C1 |
| System reassigns shard ownership after node failure | System | state transition | S1update targetPartitionOwnership | C1 |
| User reads dashboards / status | Client | read projection | S1read projection targetMonitoringStatusView | R2 |
Notes on normalization #
Important choices:
- scrape target refresh is
overwrite state- current target set/health is current-value control state
- sample ingestion is
append event- metric samples are immutable points in time
- series metadata update is
state transition- label/index lifecycle evolves as new series appear or become inactive
- rule evaluation is a read path
- it reads TSDB source state
- recording-rule output is append-only
- derived samples are still immutable time-series points
- alert lifecycle is a state transition
- alert instances move through inactive/pending/firing/resolved
- compaction is explicit because TSDB storage lifecycle is core
This system is a hybrid of:
time-series append storelabel indexcontinuous rule evaluation
Step 2 - Critical Path Selection #
| Requirement | Priority class | Why |
|---|---|---|
| Discover / refresh scrape target state | C1 | stale target state can break scrape coverage |
| Ingest metric samples | C1 | primary product truth is the sample stream |
| Update series metadata / index | C1 | queries depend on correct label-to-series mapping |
| Query time series | R1 | core serving path |
| Evaluate recording / alert rules | C1 | correctness of derived metrics and alerts depends on this path |
| Write recording rule output | C1 | derived metrics become stored truth for downstream queries |
| Update alert state | C1 | alert firing/resolution correctness depends on current rule evaluation |
| Compact TSDB blocks | C1 | storage lifecycle and query correctness depend on safe compaction |
| Route to shard owner | C1 | wrong routing can split ingestion/query truth |
| Reassign shard ownership | C1 | failover must preserve ingestion/query/storage correctness |
| Dashboards / status | R2 | operational only |
Baseline critical paths #
Main C1 paths:
P1refresh scrape target stateP2ingest metric samplesP3update series metadataP4evaluate rulesP5write recording-rule outputP6update alert stateP7compact TSDB blocksP8route to shard ownerP9reassign shard ownership
Main R1 path:
P10query time series / labels
This design is driven by:
- immutable sample ingestion
- correct label/index maintenance
- TSDB block lifecycle
- rule and alert state derived from authoritative sample history
Step 3 - Primary State Extraction #
For a Prometheus-class system, the minimal primary state is scrape target state, metric samples, series metadata/index state, alert state, TSDB block lifecycle state, and routing/ownership state.
| Candidate object label | Candidate source | Candidate needed for C1/R1? | Candidate decomposition action | Class | Primary? | Owner | Evolution | Scope kind | Scope value |
|---|---|---|---|---|---|---|---|---|---|
| ScrapeTargetState | direct noun | Yes | keep as candidate | process | Yes | service | overwrite | instance | target_id |
| MetricSample | direct noun | Yes | keep as candidate | event | Yes | service | append-only | instance | series_id + timestamp |
| SeriesMetadata | hidden write target | Yes | keep as candidate | entity | Yes | service | overwrite | instance | series_id |
| AlertState | lifecycle object | Yes | keep as candidate | process | Yes | service | state machine | instance | alert_key |
| TSDBBlockState | hidden write target | Yes | keep as candidate | process | Yes | service | state machine | instance | block_id |
| PartitionOwnership | hidden write target | Yes | keep as candidate | process | Yes | service | state machine | instance | shard_id |
| PartitionMap | hidden write target | Yes | keep as candidate | entity | Yes | service | overwrite | collection | tenant/shard map |
| MonitoringStatusView | derived read model | No | reject as UI artifact | projection | No | derived | overwrite | collection | tenant or cluster |
Important modeling choices #
ScrapeTargetState #
Primary because:
- target discovery and health determine what gets scraped
- current endpoint labels, intervals, and scrape health are control truth
MetricSample #
Primary because:
- sample ingestion is the core immutable fact stream
SeriesMetadata #
Primary because:
- PromQL-style reads depend on label matching and series lookup
- captures label set, fingerprint/series id, retention status, and head/block placement
AlertState #
Primary because:
- alert instances have lifecycle across evaluations
TSDBBlockState #
Primary because:
- head/block/compaction lifecycle determines storage truth and queryability
Minimal strict primary set #
The strongest minimal set is:
ScrapeTargetStateMetricSampleSeriesMetadataAlertStateTSDBBlockStatePartitionOwnershipPartitionMap
Step 4 - Hard Invariants #
For a Prometheus-class monitoring platform, the hard invariants are about durable sample append semantics, correct label-to-series mapping, safe TSDB compaction, and valid alert lifecycle transitions.
| Path | Tier | Type | Invariant statement |
|---|---|---|---|
P1 refresh scrape target state | HARD | ordering | Scrape-target revisions are ordered by monotonic target-state version within target scope. |
P2 ingest metric samples | HARD | uniqueness | Key (series_id, timestamp) maps to at most one logical outcome stored sample value within series scope. |
P2 ingest metric samples | HARD | ordering | Samples for series_id are ordered by timestamp within the authoritative ingestion scope. |
P3 update series metadata | HARD | accounting | SeriesMetadata(series_id) corresponds to exactly one canonical label set and storage placement for that series scope. |
P4 evaluate rules | HARD | freshness | Rule evaluation reads authoritative sample history and metadata within configured evaluation consistency bound. |
P5 write recording-rule output | HARD | uniqueness | Key (recorded_series_id, timestamp) maps to at most one logical outcome recorded sample within recorded-series scope. |
P6 update alert state | HARD | eligibility | Action advance_alert_state is valid only if current rule evaluation result and current AlertState allow the transition at decision time. |
P7 compact TSDB blocks | HARD | accounting | Compacted block contents equal the union of authoritative source blocks modulo dedup/retention rules, and source-to-destination block transition preserves queryable truth. |
P8 route to shard owner | HARD | uniqueness | Key shard_id maps to at most one logical outcome current authoritative owner within shard_id. |
P9 reassign shard ownership | HARD | eligibility | Action reassign_shard is valid only if current owner is failed or relinquished and candidate owner is eligible and sufficiently current on shard_id at decision time. |
P10 query time series | HARD | freshness | Query path reflects authoritative head plus block data within configured query consistency bound. |
What matters most #
1. Samples are immutable time points #
Once accepted, sample points are append-only facts.
2. Label mapping must be canonical #
One label set must resolve to one canonical series identity within a shard/tenant scope.
3. Compaction must preserve query truth #
Block lifecycle changes cannot lose or double-count stored samples.
4. Alert state is derived but persistent #
Alerts are not just pure query output; they have ongoing lifecycle state.
Step 5 - Execution Context #
For the strict baseline monitoring platform:
| Field | Value | Why |
|---|---|---|
| Topology | single service distributed | one logical monitoring system spread across scrapers, ingesters, storage, and queriers |
| Write coordination scope | per object scope | correctness is per target, series, alert, block, and shard ownership scope |
| Read consistency target | bounded stale allowed | query serving often tolerates small bounded lag between ingestion and query visibility |
| Holder model | node | shard ownership is held by ingesters/storage nodes |
| Compensation acceptable? | No | wrong sample ingestion or block compaction cannot be safely repaired by compensation |
Derived implications #
holder_may_crash = true- scrapers, ingesters, or shard owners can fail mid-ingestion
cross_service_write = false- baseline keeps scrape state, TSDB metadata, alerts, and ownership in one logical service
bounded_staleness_allowed = true- query paths can tolerate some ingestion-to-query lag
cross_service_atomicity_required = false- no multi-service transaction across unrelated services in baseline
exclusive_claim_required = true- shard ownership must be exclusive
guarded_by_current_state = true- alert and compaction transitions depend on current state
What this implies #
This pushes us toward:
- one authoritative owner per ingestion/storage shard
- append-oriented head block plus immutable blocks
- label index maintained by shard owner
- rule and alert state derived from authoritative samples
Step 6 - Deterministic Mechanism Selection #
| Path | Write shape | Base mechanism | Required companions |
|---|---|---|---|
P1 refresh scrape target state | overwrite current value | CAS on version | target-state version |
P2 ingest metric samples | append-only event | append log / TSDB head append | duplicate-sample policy, per-series timestamp discipline |
P3 update series metadata | overwrite current value | single writer per shard or CAS on version | canonical label hashing |
P4 evaluate rules | read source | direct source read | evaluation schedule |
P5 write recording-rule output | append-only event | append log / TSDB head append | recorded-series identity |
P6 update alert state | guarded state transition | CAS on (state, version) | evaluation timestamp/version |
P7 compact TSDB blocks | guarded state transition | single-writer compaction with atomic block swap | block manifest/version |
P8 route to shard owner | exclusive claim | lease | fencing token, heartbeat |
P9 reassign shard ownership | guarded state transition | CAS on (state, version) | fencing token, shard catch-up check |
Why these fit #
Scrape target state #
This is current-value control state, so overwrite fits.
Sample ingestion #
Samples are immutable points, so append-only fits naturally.
Series metadata #
Current label/index mapping is current-value state, typically maintained by one shard owner.
Alert state #
Alerts transition through current lifecycle states, so guarded state transition fits.
TSDB compaction #
Compaction is a stateful storage lifecycle change that must preserve current query truth, so guarded transition fits.
Canonical substrate implied #
The baseline now points to:
- sharded scrape/ingest/storage service
- one owner per tenant or series shard
- append-oriented head block and immutable persisted blocks
- label index and query engine over authoritative shard data
- periodic rule evaluation plus alert lifecycle state
Step 7 - Read Model / Source of Truth #
For a Prometheus-class system, truth is mostly direct source state. Dashboards and UIs are derived.
| Concept | Truth | Read path | Rebuild path |
|---|---|---|---|
C1 scrape target configuration/state | ScrapeTargetState | read source directly | authoritative target store |
C2 metric samples | MetricSample | read source directly | authoritative TSDB head and blocks |
C3 series labels / metadata | SeriesMetadata | read source directly | authoritative index metadata |
C4 alert lifecycle | AlertState | read source directly | authoritative alert-state store |
C5 TSDB block lifecycle | TSDBBlockState | read source directly | authoritative block manifest/state store |
C6 shard ownership | PartitionOwnership | read source directly | authoritative ownership store |
C7 shard routing map | PartitionMap | read source directly | authoritative routing metadata |
C8 dashboards / status | derived from samples, alerts, and block state | materialized view | recompute from authoritative state |
Important point #
For the core semantics:
- queries read authoritative head plus block data
- series lookup reads authoritative label metadata
- alert state reads authoritative evaluation and prior lifecycle state
- dashboards are projections
Step 8 - Failure Handling #
| Path | Retry | Competing writers | Crash after commit | Publish failure | Stale holder |
|---|---|---|---|---|---|
P1 refresh scrape target state | retry with target-state version | stale refresh loses CAS | committed target state survives crash if persisted | n/a | n/a |
P2 ingest metric samples | scraper retries may resend same samples; dedup policy required for duplicates | shard owner serializes authoritative series append for a shard | committed samples survive crash if WAL/head persisted | partial batch append recovered from WAL/head replay | stale owner blocked by fencing token |
P3 update series metadata | retry with metadata version | concurrent new-series mapping resolved by canonical label identity | committed metadata survives crash if persisted | n/a | stale owner blocked by fencing token |
P4 evaluate rules | evaluation can rerun over same time window | multiple evaluators should be partitioned or fenced by shard ownership | evaluation crash only delays derived outputs | recording/alert output may retry | stale evaluator blocked by ownership/version discipline |
P5 write recording-rule output | retry may produce duplicate writes unless (series,timestamp) dedup policy enforced | competing writers should be avoided by single evaluator ownership | committed recorded samples survive crash if persisted | partial append recovered from WAL/head replay | stale evaluator blocked by ownership/version discipline |
P6 update alert state | retry with state version/evaluation timestamp | stale evaluation loses guarded transition | committed alert state survives crash if persisted | downstream notification send may retry separately | stale evaluator blocked by version/token |
P7 compact TSDB blocks | compaction can retry from source manifests | only one compactor should own a block set at a time | if crash before atomic swap, source blocks remain authoritative | partial destination block ignored until manifest commit | stale compactor blocked by block manifest/version |
P8 route to shard owner | retry after refreshing shard map | only one valid owner should exist | if owner changed, refreshed map points to new owner | n/a | stale owner rejected by fencing token |
P9 reassign shard ownership | retry failover transition safely | only one reassignment wins current ownership state | promoted owner crash triggers later reassignment | n/a | old owner fenced and must not continue serving |
P10 query time series | query retry safe | many readers coexist | node crash drops query only | n/a | stale read bounded by configured query freshness |
What matters most #
1. Sample dedup policy #
Scrape retries or HA scrapers can re-send the same (series, timestamp) sample. The system needs a defined dedup policy.
2. WAL/head before durable visibility #
Ingested samples should survive process crash via WAL/head recovery before compaction.
3. Compaction must use atomic block replacement #
New compacted blocks should only become visible when their manifest/state is fully committed.
4. Alert evaluation and notification are separate #
Alert-state correctness is inside this system; downstream notification delivery is a separate side effect.
Step 9 - Scale Adjustments #
| Hotspot | Type | First response |
|---|---|---|
| hot tenants or very high sample ingestion rate | write throughput hotspot | shard by tenant/series space and add more ingesters |
| high-cardinality labels / index growth | memory hotspot | limit cardinality, compress index, and isolate abusive tenants |
| expensive range queries / PromQL joins | read hotspot | add query frontends, caching, and precomputed recording rules |
| compaction bandwidth / storage IO | write throughput hotspot | stagger compaction and separate hot head storage from cold blocks |
| rule-evaluation storms | contention hotspot | partition rule groups and stagger evaluation schedules |
| dashboard load | read hotspot | query caching and derived views for common panels |
What scales well #
This system scales by:
- sharding ingestion and storage by tenant or series space
- keeping samples append-oriented
- separating head ingestion from immutable block storage
- using recording rules to precompute expensive reads
What fails first #
Usually:
- high-cardinality explosion
- query fanout across too many shards
- compaction IO bottlenecks
- rule evaluation spikes
Canonical design conclusion #
The mechanical outcome is:
- primary state:
ScrapeTargetStateMetricSampleSeriesMetadataAlertStateTSDBBlockStatePartitionOwnershipPartitionMap
- critical invariants:
- immutable sample append semantics
- canonical label-to-series mapping
- safe block compaction preserving query truth
- alert state valid only for current evaluation result and lifecycle state
- exclusive shard ownership for ingestion/storage
- mechanisms:
append logsingle writer per shard- guarded alert and compaction transitions
- fenced shard ownership
- reads:
- direct authoritative reads for samples, labels, and alerts
- projections only for dashboards and cluster status
Polished interview answer #
I’d design the monitoring platform as a sharded scrape-and-store system with one authoritative owner per tenant or series shard. Scrapers refresh target state and append immutable samples into a TSDB head plus WAL, while shard owners maintain canonical series metadata and later compact head data into immutable blocks. Queries read both head and persisted blocks through a label index, recording rules append derived samples back into storage, and alert instances move through a guarded lifecycle based on periodic rule evaluation. The main scaling levers are more ingestion shards, strong controls on label cardinality, precomputed recording rules, and separate head-versus-block storage paths.
Concrete Substrate #
I’ll choose a sharded scrape/ingest/storage service with per-shard TSDB heads plus immutable persisted blocks as the concrete baseline, because it matches the mechanics we derived:
- scrape target state as current-value control data
- append-only sample ingestion
- canonical per-shard series indexing
- guarded alert lifecycle and block compaction
- one owner per ingestion/storage shard
Concrete tech family:
- scraper/ingester/query services in
Go - per-shard TSDB head with WAL
- immutable compacted blocks in local disk or object storage
- metadata/control:
etcdor internal metadata quorum for shard ownership/routing
- query frontend for fanout and PromQL execution
Each shard owner stores:
- current scrape target state
- TSDB head and WAL
- series label index / metadata
- alert state for owned rule groups
- block manifest / compaction state
Persisted block store stores:
- immutable compressed sample blocks
- postings / index metadata
- block manifests
Operation Layer #
1. Refresh scrape target #
API
- internal discovery refresh
Initiator
- system
Entry point
- discovery/scrape manager
Authoritative decider
- shard owner for target scope
Precondition
- target source update or refresh cycle due
Transition
- overwrite
ScrapeTargetState(target_id)
2. Ingest sample batch #
API
- internal scrape append
Initiator
- scraper
Entry point
- shard owner / ingester
Authoritative decider
- shard owner for tenant/series space
Precondition
- target currently assigned and sample batch parsed
Transition
- append
MetricSamples into WAL/head - create or refresh
SeriesMetadataas needed
Response
- success / partial failure
3. Query time series #
API
Query(query_expr, time_range, tenant)
Initiator
- user/client
Entry point
- query frontend
Authoritative decider
- relevant shard owners plus query engine
Precondition
- none
Transition
- none
Response
- time-series vectors / matrices / scalars
4. Evaluate rule and update alert state #
API
- internal rule-evaluation loop
Initiator
- system
Entry point
- rule evaluator
Authoritative decider
- shard owner for rule group
Precondition
- evaluation interval due
Transition
- read current query result
- append
RecordedSampleif recording rule - CAS-update
AlertStateif alert rule
5. Compact TSDB blocks #
API
- internal compaction loop
Initiator
- system
Entry point
- shard owner / compactor
Authoritative decider
- compaction owner for block set
Precondition
- source blocks eligible for compaction
Transition
- read source blocks
- create destination compacted block
- atomically update
TSDBBlockState/ manifest to swap visibility
Entry Point vs Decider vs Responder #
| Path | Entry point | Authoritative decider | Physical responder | Logical responder |
|---|---|---|---|---|
| scrape/ingest | scraper / ingester | shard owner | ingester node | monitoring platform |
| query | query frontend | shard owners + query engine | query frontend | monitoring platform |
| rule evaluation | rule evaluator | shard owner | evaluator node | monitoring platform |
| compaction | compactor / shard owner | compaction owner | storage node | monitoring platform |
| shard failover | follower / coordination layer | shard quorum / lease store | new leader / control plane | monitoring platform |
Concrete HLD #
Main components:
- discovery / scrape manager
- tracks targets and schedules scrapes
- ingestion shard owners
- authoritative owners for TSDB head, series metadata, and block lifecycle
- query frontend / PromQL engine
- fans queries out to relevant shards
- rule evaluator
- runs recording and alert rules on schedule
- TSDB block store
- stores immutable compacted blocks
- metadata/control service
- tracks shard ownership and routing
Short Interview Version #
I’d build the monitoring platform as a sharded scrape-and-store system with one authoritative owner per tenant or series shard. Scrapers refresh target state and append immutable samples into a TSDB head plus WAL, while shard owners maintain canonical series metadata and later compact head data into immutable blocks. Queries read both head and compacted blocks through a label index, recording rules append derived samples back into storage, and alert instances move through a guarded lifecycle based on periodic rule evaluation. The main scaling levers are more ingestion shards, strong controls on label cardinality, precomputed recording rules, and separate head-versus-block storage paths.