Skip to main content
  1. System Design Components/

Metrics Monitoring Platform (Prometheus-class: TSDB, Scrape, PromQL)

Metrics Monitoring Platform (Prometheus-class: TSDB, Scrape, PromQL) #

This note models a Prometheus-class monitoring platform where scrapers pull metric samples from targets, samples are stored in a time-series database, users query with label selectors and range functions, and rules/alerts are continuously evaluated over stored series.


Step 1 - Normalize #

Assume the baseline prompt is:

  • design a metrics monitoring platform
  • scrapers pull metrics from many targets
  • metric samples are stored in a TSDB
  • users query metrics with label filters and PromQL-like functions
  • rules and alerts are evaluated continuously
  • system scales across many targets and tenants

Normalize into state-affecting paths.

RequirementActorOperationState touchedPriority
Scraper discovers or refreshes scrape targetSystemoverwrite stateS1
update target
ScrapeTargetState
C1
Scraper ingests metric sample batchSystemappend eventS1
create target
MetricSample
C1
System updates series metadata / indexSystemstate transitionS1
update target
SeriesMetadata
C1
User queries time series / labelsClientread sourceS1
read source target
MetricSample
R1
Rule engine evaluates recording or alert ruleSystemread sourceS1
read source target
MetricSample
C1
System writes recording rule outputSystemappend eventS1
create target
RecordedSample
C1
System updates alert instance stateSystemstate transitionS1
update target
AlertState
C1
System compacts / merges TSDB blocksSystemasync processS1
hidden write target
TSDBBlockState
C1
System routes tenant/shard to current ownerSystemread sourceS1
read source target
PartitionMap
C1
System reassigns shard ownership after node failureSystemstate transitionS1
update target
PartitionOwnership
C1
User reads dashboards / statusClientread projectionS1
read projection target
MonitoringStatusView
R2

Notes on normalization #

Important choices:

  • scrape target refresh is overwrite state
    • current target set/health is current-value control state
  • sample ingestion is append event
    • metric samples are immutable points in time
  • series metadata update is state transition
    • label/index lifecycle evolves as new series appear or become inactive
  • rule evaluation is a read path
    • it reads TSDB source state
  • recording-rule output is append-only
    • derived samples are still immutable time-series points
  • alert lifecycle is a state transition
    • alert instances move through inactive/pending/firing/resolved
  • compaction is explicit because TSDB storage lifecycle is core

This system is a hybrid of:

  • time-series append store
  • label index
  • continuous rule evaluation

Step 2 - Critical Path Selection #

RequirementPriority classWhy
Discover / refresh scrape target stateC1stale target state can break scrape coverage
Ingest metric samplesC1primary product truth is the sample stream
Update series metadata / indexC1queries depend on correct label-to-series mapping
Query time seriesR1core serving path
Evaluate recording / alert rulesC1correctness of derived metrics and alerts depends on this path
Write recording rule outputC1derived metrics become stored truth for downstream queries
Update alert stateC1alert firing/resolution correctness depends on current rule evaluation
Compact TSDB blocksC1storage lifecycle and query correctness depend on safe compaction
Route to shard ownerC1wrong routing can split ingestion/query truth
Reassign shard ownershipC1failover must preserve ingestion/query/storage correctness
Dashboards / statusR2operational only

Baseline critical paths #

Main C1 paths:

  • P1 refresh scrape target state
  • P2 ingest metric samples
  • P3 update series metadata
  • P4 evaluate rules
  • P5 write recording-rule output
  • P6 update alert state
  • P7 compact TSDB blocks
  • P8 route to shard owner
  • P9 reassign shard ownership

Main R1 path:

  • P10 query time series / labels

This design is driven by:

  • immutable sample ingestion
  • correct label/index maintenance
  • TSDB block lifecycle
  • rule and alert state derived from authoritative sample history

Step 3 - Primary State Extraction #

For a Prometheus-class system, the minimal primary state is scrape target state, metric samples, series metadata/index state, alert state, TSDB block lifecycle state, and routing/ownership state.

Candidate object labelCandidate sourceCandidate needed for C1/R1?Candidate decomposition actionClassPrimary?OwnerEvolutionScope kindScope value
ScrapeTargetStatedirect nounYeskeep as candidateprocessYesserviceoverwriteinstancetarget_id
MetricSampledirect nounYeskeep as candidateeventYesserviceappend-onlyinstanceseries_id + timestamp
SeriesMetadatahidden write targetYeskeep as candidateentityYesserviceoverwriteinstanceseries_id
AlertStatelifecycle objectYeskeep as candidateprocessYesservicestate machineinstancealert_key
TSDBBlockStatehidden write targetYeskeep as candidateprocessYesservicestate machineinstanceblock_id
PartitionOwnershiphidden write targetYeskeep as candidateprocessYesservicestate machineinstanceshard_id
PartitionMaphidden write targetYeskeep as candidateentityYesserviceoverwritecollectiontenant/shard map
MonitoringStatusViewderived read modelNoreject as UI artifactprojectionNoderivedoverwritecollectiontenant or cluster

Important modeling choices #

ScrapeTargetState #

Primary because:

  • target discovery and health determine what gets scraped
  • current endpoint labels, intervals, and scrape health are control truth

MetricSample #

Primary because:

  • sample ingestion is the core immutable fact stream

SeriesMetadata #

Primary because:

  • PromQL-style reads depend on label matching and series lookup
  • captures label set, fingerprint/series id, retention status, and head/block placement

AlertState #

Primary because:

  • alert instances have lifecycle across evaluations

TSDBBlockState #

Primary because:

  • head/block/compaction lifecycle determines storage truth and queryability

Minimal strict primary set #

The strongest minimal set is:

  • ScrapeTargetState
  • MetricSample
  • SeriesMetadata
  • AlertState
  • TSDBBlockState
  • PartitionOwnership
  • PartitionMap

Step 4 - Hard Invariants #

For a Prometheus-class monitoring platform, the hard invariants are about durable sample append semantics, correct label-to-series mapping, safe TSDB compaction, and valid alert lifecycle transitions.

PathTierTypeInvariant statement
P1 refresh scrape target stateHARDorderingScrape-target revisions are ordered by monotonic target-state version within target scope.
P2 ingest metric samplesHARDuniquenessKey (series_id, timestamp) maps to at most one logical outcome stored sample value within series scope.
P2 ingest metric samplesHARDorderingSamples for series_id are ordered by timestamp within the authoritative ingestion scope.
P3 update series metadataHARDaccountingSeriesMetadata(series_id) corresponds to exactly one canonical label set and storage placement for that series scope.
P4 evaluate rulesHARDfreshnessRule evaluation reads authoritative sample history and metadata within configured evaluation consistency bound.
P5 write recording-rule outputHARDuniquenessKey (recorded_series_id, timestamp) maps to at most one logical outcome recorded sample within recorded-series scope.
P6 update alert stateHARDeligibilityAction advance_alert_state is valid only if current rule evaluation result and current AlertState allow the transition at decision time.
P7 compact TSDB blocksHARDaccountingCompacted block contents equal the union of authoritative source blocks modulo dedup/retention rules, and source-to-destination block transition preserves queryable truth.
P8 route to shard ownerHARDuniquenessKey shard_id maps to at most one logical outcome current authoritative owner within shard_id.
P9 reassign shard ownershipHARDeligibilityAction reassign_shard is valid only if current owner is failed or relinquished and candidate owner is eligible and sufficiently current on shard_id at decision time.
P10 query time seriesHARDfreshnessQuery path reflects authoritative head plus block data within configured query consistency bound.

What matters most #

1. Samples are immutable time points #

Once accepted, sample points are append-only facts.

2. Label mapping must be canonical #

One label set must resolve to one canonical series identity within a shard/tenant scope.

3. Compaction must preserve query truth #

Block lifecycle changes cannot lose or double-count stored samples.

4. Alert state is derived but persistent #

Alerts are not just pure query output; they have ongoing lifecycle state.


Step 5 - Execution Context #

For the strict baseline monitoring platform:

FieldValueWhy
Topologysingle service distributedone logical monitoring system spread across scrapers, ingesters, storage, and queriers
Write coordination scopeper object scopecorrectness is per target, series, alert, block, and shard ownership scope
Read consistency targetbounded stale allowedquery serving often tolerates small bounded lag between ingestion and query visibility
Holder modelnodeshard ownership is held by ingesters/storage nodes
Compensation acceptable?Nowrong sample ingestion or block compaction cannot be safely repaired by compensation

Derived implications #

  • holder_may_crash = true

    • scrapers, ingesters, or shard owners can fail mid-ingestion
  • cross_service_write = false

    • baseline keeps scrape state, TSDB metadata, alerts, and ownership in one logical service
  • bounded_staleness_allowed = true

    • query paths can tolerate some ingestion-to-query lag
  • cross_service_atomicity_required = false

    • no multi-service transaction across unrelated services in baseline
  • exclusive_claim_required = true

    • shard ownership must be exclusive
  • guarded_by_current_state = true

    • alert and compaction transitions depend on current state

What this implies #

This pushes us toward:

  • one authoritative owner per ingestion/storage shard
  • append-oriented head block plus immutable blocks
  • label index maintained by shard owner
  • rule and alert state derived from authoritative samples

Step 6 - Deterministic Mechanism Selection #

PathWrite shapeBase mechanismRequired companions
P1 refresh scrape target stateoverwrite current valueCAS on versiontarget-state version
P2 ingest metric samplesappend-only eventappend log / TSDB head appendduplicate-sample policy, per-series timestamp discipline
P3 update series metadataoverwrite current valuesingle writer per shard or CAS on versioncanonical label hashing
P4 evaluate rulesread sourcedirect source readevaluation schedule
P5 write recording-rule outputappend-only eventappend log / TSDB head appendrecorded-series identity
P6 update alert stateguarded state transitionCAS on (state, version)evaluation timestamp/version
P7 compact TSDB blocksguarded state transitionsingle-writer compaction with atomic block swapblock manifest/version
P8 route to shard ownerexclusive claimleasefencing token, heartbeat
P9 reassign shard ownershipguarded state transitionCAS on (state, version)fencing token, shard catch-up check

Why these fit #

Scrape target state #

This is current-value control state, so overwrite fits.

Sample ingestion #

Samples are immutable points, so append-only fits naturally.

Series metadata #

Current label/index mapping is current-value state, typically maintained by one shard owner.

Alert state #

Alerts transition through current lifecycle states, so guarded state transition fits.

TSDB compaction #

Compaction is a stateful storage lifecycle change that must preserve current query truth, so guarded transition fits.

Canonical substrate implied #

The baseline now points to:

  • sharded scrape/ingest/storage service
  • one owner per tenant or series shard
  • append-oriented head block and immutable persisted blocks
  • label index and query engine over authoritative shard data
  • periodic rule evaluation plus alert lifecycle state

Step 7 - Read Model / Source of Truth #

For a Prometheus-class system, truth is mostly direct source state. Dashboards and UIs are derived.

ConceptTruthRead pathRebuild path
C1 scrape target configuration/stateScrapeTargetStateread source directlyauthoritative target store
C2 metric samplesMetricSampleread source directlyauthoritative TSDB head and blocks
C3 series labels / metadataSeriesMetadataread source directlyauthoritative index metadata
C4 alert lifecycleAlertStateread source directlyauthoritative alert-state store
C5 TSDB block lifecycleTSDBBlockStateread source directlyauthoritative block manifest/state store
C6 shard ownershipPartitionOwnershipread source directlyauthoritative ownership store
C7 shard routing mapPartitionMapread source directlyauthoritative routing metadata
C8 dashboards / statusderived from samples, alerts, and block statematerialized viewrecompute from authoritative state

Important point #

For the core semantics:

  • queries read authoritative head plus block data
  • series lookup reads authoritative label metadata
  • alert state reads authoritative evaluation and prior lifecycle state
  • dashboards are projections

Step 8 - Failure Handling #

PathRetryCompeting writersCrash after commitPublish failureStale holder
P1 refresh scrape target stateretry with target-state versionstale refresh loses CAScommitted target state survives crash if persistedn/an/a
P2 ingest metric samplesscraper retries may resend same samples; dedup policy required for duplicatesshard owner serializes authoritative series append for a shardcommitted samples survive crash if WAL/head persistedpartial batch append recovered from WAL/head replaystale owner blocked by fencing token
P3 update series metadataretry with metadata versionconcurrent new-series mapping resolved by canonical label identitycommitted metadata survives crash if persistedn/astale owner blocked by fencing token
P4 evaluate rulesevaluation can rerun over same time windowmultiple evaluators should be partitioned or fenced by shard ownershipevaluation crash only delays derived outputsrecording/alert output may retrystale evaluator blocked by ownership/version discipline
P5 write recording-rule outputretry may produce duplicate writes unless (series,timestamp) dedup policy enforcedcompeting writers should be avoided by single evaluator ownershipcommitted recorded samples survive crash if persistedpartial append recovered from WAL/head replaystale evaluator blocked by ownership/version discipline
P6 update alert stateretry with state version/evaluation timestampstale evaluation loses guarded transitioncommitted alert state survives crash if persisteddownstream notification send may retry separatelystale evaluator blocked by version/token
P7 compact TSDB blockscompaction can retry from source manifestsonly one compactor should own a block set at a timeif crash before atomic swap, source blocks remain authoritativepartial destination block ignored until manifest commitstale compactor blocked by block manifest/version
P8 route to shard ownerretry after refreshing shard maponly one valid owner should existif owner changed, refreshed map points to new ownern/astale owner rejected by fencing token
P9 reassign shard ownershipretry failover transition safelyonly one reassignment wins current ownership statepromoted owner crash triggers later reassignmentn/aold owner fenced and must not continue serving
P10 query time seriesquery retry safemany readers coexistnode crash drops query onlyn/astale read bounded by configured query freshness

What matters most #

1. Sample dedup policy #

Scrape retries or HA scrapers can re-send the same (series, timestamp) sample. The system needs a defined dedup policy.

2. WAL/head before durable visibility #

Ingested samples should survive process crash via WAL/head recovery before compaction.

3. Compaction must use atomic block replacement #

New compacted blocks should only become visible when their manifest/state is fully committed.

4. Alert evaluation and notification are separate #

Alert-state correctness is inside this system; downstream notification delivery is a separate side effect.


Step 9 - Scale Adjustments #

HotspotTypeFirst response
hot tenants or very high sample ingestion ratewrite throughput hotspotshard by tenant/series space and add more ingesters
high-cardinality labels / index growthmemory hotspotlimit cardinality, compress index, and isolate abusive tenants
expensive range queries / PromQL joinsread hotspotadd query frontends, caching, and precomputed recording rules
compaction bandwidth / storage IOwrite throughput hotspotstagger compaction and separate hot head storage from cold blocks
rule-evaluation stormscontention hotspotpartition rule groups and stagger evaluation schedules
dashboard loadread hotspotquery caching and derived views for common panels

What scales well #

This system scales by:

  • sharding ingestion and storage by tenant or series space
  • keeping samples append-oriented
  • separating head ingestion from immutable block storage
  • using recording rules to precompute expensive reads

What fails first #

Usually:

  • high-cardinality explosion
  • query fanout across too many shards
  • compaction IO bottlenecks
  • rule evaluation spikes

Canonical design conclusion #

The mechanical outcome is:

  • primary state:
    • ScrapeTargetState
    • MetricSample
    • SeriesMetadata
    • AlertState
    • TSDBBlockState
    • PartitionOwnership
    • PartitionMap
  • critical invariants:
    • immutable sample append semantics
    • canonical label-to-series mapping
    • safe block compaction preserving query truth
    • alert state valid only for current evaluation result and lifecycle state
    • exclusive shard ownership for ingestion/storage
  • mechanisms:
    • append log
    • single writer per shard
    • guarded alert and compaction transitions
    • fenced shard ownership
  • reads:
    • direct authoritative reads for samples, labels, and alerts
    • projections only for dashboards and cluster status

Polished interview answer #

I’d design the monitoring platform as a sharded scrape-and-store system with one authoritative owner per tenant or series shard. Scrapers refresh target state and append immutable samples into a TSDB head plus WAL, while shard owners maintain canonical series metadata and later compact head data into immutable blocks. Queries read both head and persisted blocks through a label index, recording rules append derived samples back into storage, and alert instances move through a guarded lifecycle based on periodic rule evaluation. The main scaling levers are more ingestion shards, strong controls on label cardinality, precomputed recording rules, and separate head-versus-block storage paths.


Concrete Substrate #

I’ll choose a sharded scrape/ingest/storage service with per-shard TSDB heads plus immutable persisted blocks as the concrete baseline, because it matches the mechanics we derived:

  • scrape target state as current-value control data
  • append-only sample ingestion
  • canonical per-shard series indexing
  • guarded alert lifecycle and block compaction
  • one owner per ingestion/storage shard

Concrete tech family:

  • scraper/ingester/query services in Go
  • per-shard TSDB head with WAL
  • immutable compacted blocks in local disk or object storage
  • metadata/control:
    • etcd or internal metadata quorum for shard ownership/routing
  • query frontend for fanout and PromQL execution

Each shard owner stores:

  • current scrape target state
  • TSDB head and WAL
  • series label index / metadata
  • alert state for owned rule groups
  • block manifest / compaction state

Persisted block store stores:

  • immutable compressed sample blocks
  • postings / index metadata
  • block manifests

Operation Layer #

1. Refresh scrape target #

API

  • internal discovery refresh

Initiator

  • system

Entry point

  • discovery/scrape manager

Authoritative decider

  • shard owner for target scope

Precondition

  • target source update or refresh cycle due

Transition

  • overwrite ScrapeTargetState(target_id)

2. Ingest sample batch #

API

  • internal scrape append

Initiator

  • scraper

Entry point

  • shard owner / ingester

Authoritative decider

  • shard owner for tenant/series space

Precondition

  • target currently assigned and sample batch parsed

Transition

  • append MetricSamples into WAL/head
  • create or refresh SeriesMetadata as needed

Response

  • success / partial failure

3. Query time series #

API

  • Query(query_expr, time_range, tenant)

Initiator

  • user/client

Entry point

  • query frontend

Authoritative decider

  • relevant shard owners plus query engine

Precondition

  • none

Transition

  • none

Response

  • time-series vectors / matrices / scalars

4. Evaluate rule and update alert state #

API

  • internal rule-evaluation loop

Initiator

  • system

Entry point

  • rule evaluator

Authoritative decider

  • shard owner for rule group

Precondition

  • evaluation interval due

Transition

  • read current query result
  • append RecordedSample if recording rule
  • CAS-update AlertState if alert rule

5. Compact TSDB blocks #

API

  • internal compaction loop

Initiator

  • system

Entry point

  • shard owner / compactor

Authoritative decider

  • compaction owner for block set

Precondition

  • source blocks eligible for compaction

Transition

  • read source blocks
  • create destination compacted block
  • atomically update TSDBBlockState / manifest to swap visibility

Entry Point vs Decider vs Responder #

PathEntry pointAuthoritative deciderPhysical responderLogical responder
scrape/ingestscraper / ingestershard owneringester nodemonitoring platform
queryquery frontendshard owners + query enginequery frontendmonitoring platform
rule evaluationrule evaluatorshard ownerevaluator nodemonitoring platform
compactioncompactor / shard ownercompaction ownerstorage nodemonitoring platform
shard failoverfollower / coordination layershard quorum / lease storenew leader / control planemonitoring platform

Concrete HLD #

Main components:

  • discovery / scrape manager
    • tracks targets and schedules scrapes
  • ingestion shard owners
    • authoritative owners for TSDB head, series metadata, and block lifecycle
  • query frontend / PromQL engine
    • fans queries out to relevant shards
  • rule evaluator
    • runs recording and alert rules on schedule
  • TSDB block store
    • stores immutable compacted blocks
  • metadata/control service
    • tracks shard ownership and routing

Short Interview Version #

I’d build the monitoring platform as a sharded scrape-and-store system with one authoritative owner per tenant or series shard. Scrapers refresh target state and append immutable samples into a TSDB head plus WAL, while shard owners maintain canonical series metadata and later compact head data into immutable blocks. Queries read both head and compacted blocks through a label index, recording rules append derived samples back into storage, and alert instances move through a guarded lifecycle based on periodic rule evaluation. The main scaling levers are more ingestion shards, strong controls on label cardinality, precomputed recording rules, and separate head-versus-block storage paths.