Table of Contents

Metrics Monitoring: System Design #

A distributed metrics monitoring platform in the style of Datadog, Grafana Cloud, or AWS CloudWatch. Covers the scrape/push ingestion duality, TSDB storage internals, PromQL query execution, alerting, and long-term retention. Grounded in the Prometheus TSDB and scrape engine, VictoriaMetrics storage internals, and the OpenTelemetry Collector pipeline.

1. Requirements #

Functional Requirements #

Metric ingestion — pull (scrape) and push (OTLP, remote write) models; support for counters, gauges, histograms, summaries
Storage — store time series with configurable retention (15 days local; years in long-term store)
Querying — ad-hoc and dashboard queries over time ranges; instant and range queries
Alerting — evaluate PromQL expressions on a schedule; fire alerts when conditions hold for a configurable duration
Dashboards — read-path API for Grafana-compatible visualization
Label-based routing — relabeling rules to transform, drop, or re-route series during ingestion
Federation — aggregate metrics across multiple Prometheus instances
Remote write / read — export metrics to long-term storage backends (Thanos, VictoriaMetrics, Cortex)

Non-Functional Requirements #

Throughput — ingest up to 1M samples/sec per node; hundreds of millions of active series across a cluster
Write latency — P99 append < 10ms; WAL flush < 1ms
Query latency — P99 instant query < 500ms; range query over 24h < 5s
Availability — 99.9% for ingest (writes are local, failures are isolated); 99.9% for query
Cardinality — support millions of unique label combinations per metric name without OOM
Retention — local TSDB: 15 days; long-term: years (object storage or dedicated TSDB)
Durability — WAL ensures at-most 2 hours of data loss on crash (head block recovery window)

Capacity Estimation #

Metric	Estimate
Active time series per node	2–10M
Ingest rate per node	~500K–1M samples/sec
Sample size on disk (compressed)	~1.3 bytes/sample (XOR + delta encoding)
Retention window	15 days local
Storage per node	10M series × 1 sample/15s × 86400 × 15 days × 1.3B ≈ 750GB
Label cardinality	Up to 100M unique label value combinations across all series
Rule evaluation	1K–100K rules evaluated every 15–60 seconds

The XOR encoding figure comes directly from the Gorilla paper (Pelkonen et al., 2015) and is implemented in Prometheus’s XORChunk.

2. Core Entities #

TimeSeries #

A time series is uniquely identified by its full label set — metric name plus all key=value labels:

http_requests_total{method="GET", handler="/api/v1/query", job="prometheus", instance="localhost:9090"}

The metric name is stored as the special label __name__. Two time series with identical label sets but different metric names are different series.

Sample #

A (timestamp, value) pair attached to a time series. Prometheus uses millisecond timestamps:

// From storage/interface.go
Append(ref SeriesRef, l labels.Labels, t int64, v float64) (SeriesRef, error)
//                                      ^^^^ ms timestamp  ^^^^ float64 value

Four sample types exist in the OpenMetrics spec: float64 (the original), native histogram buckets, float histograms, and exemplars. The Appender interface handles all via separate methods (AppendHistogram, AppendExemplar).

MetricName (VictoriaMetrics) #

VictoriaMetrics separates the metric name group from the label tags:

// From VictoriaMetrics/lib/storage/metric_name.go
type MetricName struct {
    MetricGroup []byte  // the __name__ label
    Tags []Tag          // sorted key=value pairs; must be sorted for canonical comparison
}

The canonical form has tags sorted by key — two MetricName values with the same labels in different order are deduplicated into one series.

TSID (VictoriaMetrics) #

The internal identifier for a time series in VictoriaMetrics:

// From VictoriaMetrics/lib/storage/tsid.go
type TSID struct {
    MetricGroupID uint64  // hash of metric name (__name__)
    JobID         uint32  // hash of job label
    InstanceID    uint32  // hash of instance label
    MetricID      uint64  // unique ID for the full label set (primary key)
}

MetricID is the authoritative identifier — the other fields are optimizations for grouping related series on disk (same metric name → nearby MetricIDs → better compression). The inverted index maps (label key, label value) → []MetricID to enable label-based queries.

Chunk #

A compressed block of samples for a single time series. The Prometheus XORChunk uses Gorilla XOR encoding:

// From tsdb/chunkenc/xor.go
type XORChunk struct {
    b bstream  // bit stream
}
// Stores the first timestamp and value verbatim;
// subsequent timestamps as delta-of-deltas (varint),
// subsequent values as XOR with the previous value (leading/trailing zero optimization).

A chunk holds ~120 samples by default (SamplesPerChunk), covering ~30 minutes at the standard 15-second scrape interval. Compression ratio: ~1.3 bytes/sample for float metrics.

memSeries #

The in-memory representation of one time series in the Prometheus Head:

// From tsdb/head.go
type memSeries struct {
    ref  chunks.HeadSeriesRef  // stable ID within the head block
    lset labels.Labels          // full label set

    mmappedChunks []*mmappedChunk  // completed chunks mmap'd to disk
    headChunks    *memChunk        // active in-memory chunk (linked list, newest first)
    firstChunkID  chunks.HeadChunkID

    app    chunkenc.Appender  // active appender for the head chunk
    nextAt int64              // timestamp at which to cut a new chunk

    lastValue float64         // last written value (for duplicate detection)
}

mmappedChunks are completed chunks memory-mapped from the chunk disk mapper — they exist on disk but are accessible as if in memory. Only headChunks is being actively written to.

Alert #

// From rules/alerting.go
type AlertingRule struct {
    name         string
    vector       parser.Expr   // the PromQL expression to evaluate
    holdDuration time.Duration // must be true for this long before → Firing (the "for:" clause)
    keepFiringFor time.Duration
    labels       labels.Labels // extra labels attached to alert
    annotations  labels.Labels // human-readable context (summary, description, runbook_url)
    active       map[uint64]*Alert // fingerprint → Alert; currently Pending or Firing
}

3. API / System Interface #

OpenMetrics Exposition Format (scrape target → Prometheus) #

The text format that instrumented services expose at /metrics:

# HELP http_requests_total Total HTTP requests
# TYPE http_requests_total counter
http_requests_total{method="GET",handler="/api"} 1234 1609459200000
http_requests_total{method="POST",handler="/api"} 56
# HELP process_resident_memory_bytes Resident memory size in bytes
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 1.234e+08

The OpenMetrics RFC (CNCF project) formalizes this format with additional types (unknown, stateset, gaugehistogram, info) and content negotiation via Accept: application/openmetrics-text.

OTLP (push: instrumented service → OTel Collector → Prometheus remote write) #

The OpenTelemetry Protocol wire format. Two transports:

POST /v1/metrics (HTTP/1.1 or HTTP/2, protobuf or JSON body)
gRPC Export(ExportMetricsServiceRequest) (HTTP/2)

The OTel Collector’s OTLP receiver handles both:

// From opentelemetry-collector/receiver/otlpreceiver/otlp.go
type otlpReceiver struct {
    cfg         *Config
    serverGRPC  *grpc.Server   // gRPC transport
    serverHTTP  *http.Server   // HTTP/protobuf transport

    nextMetrics consumer.Metrics  // downstream: processor or exporter
    nextTraces  consumer.Traces
    nextLogs    consumer.Logs
}

The OTLP data model is hierarchical: ResourceMetrics → ScopeMetrics → Metric → DataPoint. Each Metric has a type (Gauge, Sum/Counter, Histogram, ExponentialHistogram, Summary). This hierarchy carries resource attributes (host, cloud provider, region) and instrumentation scope (library name, version) alongside the data — richer than the Prometheus flat label model.

Prometheus HTTP API (query) #

GET /api/v1/query?query=<promql>&time=<timestamp>        # instant query
GET /api/v1/query_range?query=<promql>&start=&end=&step= # range query
GET /api/v1/series?match[]=<selector>                    # series metadata
GET /api/v1/labels                                       # label names
GET /api/v1/label/<name>/values                          # label values
POST /api/v1/alerts                                      # active alerts

Prometheus Remote Write (long-term export) #

Prometheus sends batches of samples to a remote write endpoint via HTTP POST. The payload is a snappy-compressed protobuf WriteRequest:

message WriteRequest {
  repeated TimeSeries timeseries = 1;
}
message TimeSeries {
  repeated Label labels  = 1;
  repeated Sample samples = 2;
}
message Sample {
  double value     = 1;
  int64  timestamp = 2;  // milliseconds
}

Remote write is the integration point with VictoriaMetrics, Thanos, Cortex, and Grafana Mimir — they all expose a /api/v1/write endpoint that speaks this protocol.

Alertmanager API #

Prometheus posts firing alerts to Alertmanager:

POST /api/v2/alerts   [{ "labels": {...}, "annotations": {...}, "startsAt": "...", "generatorURL": "..." }]

Alertmanager handles deduplication, grouping, inhibition, silencing, and routing to receivers (PagerDuty, Slack, email, webhook).

Standards Reference #

Standard	Body	Scope
OpenMetrics exposition format	CNCF / RFC	Text wire format for scrape targets
OTLP protobuf spec	CNCF OpenTelemetry	Push protocol for metrics, traces, logs
Prometheus Remote Write 2.0	CNCF Prometheus	Export protocol to long-term stores
PromQL grammar	Prometheus	Query language spec (EBNF in docs)
Alertmanager API v2	Prometheus	Alert posting and management
OpenTelemetry Semantic Conventions	CNCF	Standard label names for HTTP, DB, cloud resources

4. Data Flow #

Pull (scrape) path #

Scrape target (Go/Java/Python service)
    │  GET /metrics (HTTP)
    │  response: OpenMetrics text
    ▼
Prometheus scrapePool / scrapeLoop
    │  parse OpenMetrics text
    │  apply relabeling rules
    │  call Appender.Append() per sample
    ▼
TSDB Head
    │  WAL.Log(record)           ← durability
    │  memSeries.headChunks.app.Append(t, v)  ← in-memory
    ▼
    ├── memSeries (in-memory, hot)
    └── WAL segments (on disk, recovery)

Push (OTLP) path #

Instrumented service (SDK)
    │  ExportMetricsServiceRequest (gRPC/HTTP)
    ▼
OTel Collector: OTLP receiver
    │  decode protobuf → pmetric.Metrics
    ▼
OTel Collector: batch processor
    │  accumulate until sendBatchSize or timeout
    ▼
OTel Collector: Prometheus remote write exporter
    │  convert pmetric → WriteRequest proto
    │  POST /api/v1/write (snappy-compressed)
    ▼
Prometheus / VictoriaMetrics remote write handler
    │  decode → samples
    │  Appender.Append() per sample
    ▼
TSDB Head (same path as pull)

Query path #

Grafana / client
    │  GET /api/v1/query_range?query=...
    ▼
PromQL Engine
    │  parse → AST
    │  plan evaluation steps
    ▼
Storage Queryable
    │  select matching series (label matchers → postings)
    │  iterate chunks in time range
    ▼
    ├── Head (recent data, in-memory chunks)
    └── Blocks (persisted blocks, mmap'd from disk)
        │
        └── [long-term store via remote read if needed]
    ▼
PromQL Engine: evaluate AST over samples
    │  functions: rate(), sum(), histogram_quantile(), ...
    ▼
Result (Matrix / Vector / Scalar)

Compaction path (background) #

Head block (2h of data, in-memory + WAL)
    │  compactc channel triggers compaction
    ▼
TSDB Compactor
    │  flush head → immutable Block on disk
    │  Block: index/ + chunks/ + meta.json + tombstones
    ▼
Blocks on disk (2h, 6h, 24h, 72h windows — logarithmic compaction)
    │  remote write exporter pushes samples upstream
    ▼
Long-term store (VictoriaMetrics / Thanos / S3-backed Mimir)

5. High Level Design #

┌──────────────────────────────────────────────────────────────┐
│                     Instrumented Services                    │
│  /metrics (OpenMetrics)    OTLP push (gRPC / HTTP)           │
└──────────┬──────────────────────────┬───────────────────────┘
           │ scrape (pull)            │ push
┌──────────▼────────┐    ┌───────────▼────────────────────────┐
│  Prometheus        │    │  OpenTelemetry Collector           │
│                    │    │                                    │
│  Service Discovery │    │  OTLP receiver                     │
│  scrapePool/Loop   │    │  batch processor                   │
│  relabeling        │    │  transform processor               │
│  TSDB Head         │    │  Prometheus remote write exporter  │
│  WAL               │◄───┤  (WriteRequest → /api/v1/write)   │
│  Compactor         │    └────────────────────────────────────┘
│  PromQL Engine     │
│  Rules Manager     │
│  Notifier          │
└──────┬─────────────┘
       │ remote write (WriteRequest)
┌──────▼──────────────────────────────────────────────────────┐
│  Long-Term Store                                            │
│  VictoriaMetrics / Thanos / Grafana Mimir                   │
│                                                             │
│  Inverted index (Tag → MetricID)                            │
│  Merge-tree storage (sorted, compressed blocks)             │
│  MetricsQL / PromQL compatible query layer                  │
└──────────────────────────────────────────────────────────────┘
       │ query (remote read or Grafana datasource)
┌──────▼──────────────────────────────────────────────────────┐
│  Grafana / Alertmanager / API consumers                     │
└─────────────────────────────────────────────────────────────┘

Persistence:

Store	Data	Technology
WAL	Recent samples pre-compaction	Append-only segmented file (128MB segments, 32KB pages)
Head block	Active 2h window in memory	`memSeries` + `MemPostings` index
Persisted blocks	Immutable 2h–72h blocks	mmap’d chunk files + inverted index
Long-term store	Years of retention	VictoriaMetrics merge-tree / S3-backed object storage
Alert state	Pending/Firing alerts	In-memory in Alertmanager (optionally persisted to disk)

6. Deep Dives #

Deep Dive 1: TSDB Storage — Head Block and WAL #

The DB struct is the entry point to Prometheus’s TSDB:

// From tsdb/db.go
type DB struct {
    dir      string
    head     *Head          // active 2-hour write window
    blocks   []*Block       // immutable persisted blocks
    compactor Compactor     // background compaction
    compactc  chan struct{}  // triggers compaction
}

The Head is the hot write path:

// From tsdb/head.go
type Head struct {
    series   *stripeSeries       // sharded map: HeadSeriesRef → *memSeries
    postings *index.MemPostings  // inverted index: label → []HeadSeriesRef
    wal      *wlog.WL            // write-ahead log

    chunkDiskMapper *chunks.ChunkDiskMapper  // mmap completed chunks to disk
    minTime, maxTime atomic.Int64            // time range of this head
    chunkRange atomic.Int64                  // chunk cut interval (~2h)
}

Write path for a single sample:

Appender.Append(ref, labels, t, v) called by scrape loop
If ref == 0 (first time this series is seen): look up or create memSeries in stripeSeries, update MemPostings
WAL record written: record.RefSample{Ref: ref, T: t, V: v} appended to the WAL
memSeries.app.Append(t, v) writes the sample into the active XORChunk
Commit() flushes the WAL page

WAL structure:

// From tsdb/wlog/wlog.go
type WL struct {
    dir         string
    segment     *Segment  // active segment file (~128MB)
    page        *page     // active 32KB page (in-memory buffer)
    segmentSize int       // DefaultSegmentSize = 128 * 1024 * 1024
}

const (
    DefaultSegmentSize = 128 * 1024 * 1024  // 128MB per segment
    pageSize           = 32 * 1024          // 32KB pages
)

Records are written into 32KB pages. When a page is full, it is flushed to disk. When a segment exceeds 128MB, a new segment is opened. On crash recovery, Prometheus replays all WAL records since the last completed block to reconstruct the Head.

WAL record types (from tsdb/record/):

Series — label set + ref, written once per new series
Samples — []RefSample{ref, t, v} — the bulk of WAL traffic
Tombstones — deletion markers
Exemplars — trace-linked exemplar data

Chunk encoding — XOR:

// From tsdb/chunkenc/xor.go
type XORChunk struct {
    b bstream  // bit stream
}

First sample: stored verbatim (64-bit timestamp + 64-bit float). Subsequent timestamps: δδ = (t - t_prev) - (t_prev - t_prev_prev) — the delta-of-delta is typically 0 for regular scrape intervals, encoding in 1 bit. Subsequent values: XOR with the previous value; the XOR often has leading and trailing zeros (the mantissa is shared), encoded using a control-bit scheme from the Gorilla paper. Result: ~1.3 bytes/sample for typical metrics.

Chunk lifecycle:

Append to headChunk (memChunk, in-memory)
    │  when nextAt is reached (~120 samples or ~30min):
    ▼
Cut new headChunk
    │  old headChunk → ChunkDiskMapper.WriteChunk()
    ▼
mmappedChunk (on disk, accessible via mmap)
    │  when Head chunkRange (~2h) expires:
    ▼
Compaction → immutable Block (index/ + chunks/ + meta.json)

Deep Dive 2: Inverted Index and Label Lookup #

Finding all series that match {job="prometheus", method="GET"} requires an inverted index — the same data structure as a search engine’s posting list.

Prometheus MemPostings:

// tsdb/index/postings.go (inferred from Head usage)
type MemPostings struct {
    // map[label_name][label_value] → []HeadSeriesRef (sorted)
    m map[string]map[string][]storage.SeriesRef
}

A query like {job="prometheus", method="GET"} becomes:

Fetch posting list for job="prometheus" → [ref1, ref3, ref7, ...]
Fetch posting list for method="GET" → [ref1, ref2, ref7, ...]
Intersect (merge sorted lists) → [ref1, ref7, ...]
For each ref, load memSeries.lset to verify full label match (collision check)

For on-disk blocks, the same index is persisted in the index/ directory using a sorted, binary-searchable format.

VictoriaMetrics inverted index:

VictoriaMetrics uses a mergeset (LSM-tree style) for its inverted index:

// From VictoriaMetrics/lib/storage/index_db.go
const (
    nsPrefixTagToMetricIDs         = 1  // (tag_key, tag_value) → []MetricID
    nsPrefixDateTagToMetricIDs     = 6  // (date, tag_key, tag_value) → []MetricID
    nsPrefixMetricIDToMetricName   = 3  // MetricID → MetricName (reverse lookup)
    nsPrefixDateMetricNameToTSID   = 7  // (date, MetricName) → TSID
)

The nsPrefixDateTagToMetricIDs (per-day index) is key for efficient time-range queries: rather than scanning all historical MetricIDs for a label, it only looks up the per-day posting lists that overlap the query range. This dramatically reduces memory for indexes with high churn (short-lived containers, ephemeral jobs).

The storage also maintains a hot tsidCache (MetricName → TSID) and metricNameCache (MetricID → MetricName) to avoid index lookups on every ingested sample for known series.

Deep Dive 3: Scrape Engine #

The scrape engine is the pull-model ingest path:

// From scrape/scrape.go
type scrapePool struct {
    appendable   storage.Appendable   // the TSDB
    client       *http.Client         // per-job HTTP client (TLS, auth)
    loops        map[uint64]loop       // target hash → scrapeLoop
    activeTargets map[uint64]*Target   // currently scraped targets
    offsetSeed   uint64               // jitter seed for staggering scrapes
    config       *config.ScrapeConfig
    buffers      *pool.Pool           // reusable byte slice pool for response bodies
}

Each target runs a scrapeLoop in its own goroutine. The offsetSeed staggers scrape start times across targets within a job to prevent the “thundering herd” — all targets hitting at the same second. The jitter is computed as:

offset = hash(target_labels) % scrape_interval

So targets are spread uniformly across the interval rather than synchronized.

Scrape loop lifecycle:

timer fires (every scrape_interval)
    │
    ▼
HTTP GET /metrics (with TLS, basic auth, OAuth2 as configured)
    │
    ▼
Parse response body (OpenMetrics or Prometheus text format)
    │  model.TextToMetricFamilies() or textparse.New()
    │
    ▼
Apply relabeling rules (target-level and metric-level)
    │  drop series matching drop_labels
    │  rewrite __address__ to target address
    │
    ▼
Appender.Append() for each (labels, timestamp, value)
    │  timestamp is the scrape start time (not the value in the exposition)
    │  ScrapeTimestampTolerance = 2ms (allows minor clock skew alignment)
    │
    ▼
Commit()
    │
    ▼
Report scrape metrics:
    scrape_duration_seconds, scrape_samples_scraped,
    scrape_samples_post_metric_relabeling, scrape_series_added

Staleness markers: When a target disappears (no longer returns a series), Prometheus writes a special stale NaN value at the next scrape. The PromQL engine interprets staleness markers as “this series ended here” and doesn’t extrapolate across the gap. Without this, graphs would show flat lines across target restarts.

Deep Dive 4: PromQL Query Engine #

The PromQL engine implements a two-phase evaluation model:

// From promql/engine.go
type Engine struct {
    timeout            time.Duration       // max query runtime (default: 2min)
    maxSamplesPerQuery int                 // OOM guard (default: 50M)
    lookbackDelta      time.Duration       // staleness window (default: 5min)
    activeQueryTracker QueryTracker        // concurrent query limit
}

// Two query types:
func (ng *Engine) NewInstantQuery(ctx, q, opts, qs string, ts time.Time) (Query, error)
func (ng *Engine) NewRangeQuery(ctx, q, opts, qs string, start, end time.Time, step time.Duration) (Query, error)

Evaluation flow:

PromQL string (e.g., "rate(http_requests_total[5m])")
    │
    ▼
parser.ParseExpr() → AST
    │  nodes: VectorSelector, MatrixSelector, Call, BinaryExpr, AggregateExpr
    │
    ▼
Engine.exec()
    │  for range query: evaluate at each step (start, start+step, ..., end)
    │
    ▼
evaluateAtTimestamp(t)
    │  VectorSelector: query storage for all matching series at t
    │    → uses lookbackDelta (5min default): find last sample in [t-δ, t]
    │  MatrixSelector: query samples in [t-range, t] for each series
    │    → used by rate(), irate(), increase(), ...
    │  Call: apply function (rate, sum, histogram_quantile, ...)
    │  AggregateExpr: group by labels, apply aggregation
    │
    ▼
Result: Vector (instant) or Matrix (range)

lookbackDelta (5 minutes): For an instant query at time t, if a series had its last sample at t-3min, the engine still returns that value (age is within 5 minutes). This handles scrape jitter — a 15s scrape interval means consecutive scrapes are at t, t+15s, t+30s, so the gap is bounded. The 5-minute default covers target restarts and missed scrapes.

rate() implementation:

rate(http_requests_total[5m]) at time t:
    samples = all samples in [t-5min, t] for each matching series
    rate = (last_value - first_value) / (last_timestamp - first_timestamp)
    // counter resets handled: if last < first, assume reset at 0

maxSamplesPerQuery: The engine counts the total number of samples loaded across all series for a query. If it exceeds the limit (default 50M), the query is aborted with ErrTooManySamples. This prevents a single high-cardinality query from OOM-killing the process.

Deep Dive 5: Cardinality — The Hard Problem #

Cardinality is the count of unique time series (unique label sets). It is the primary operational risk of a metrics system.

Why cardinality explodes:

Each unique combination of label values creates a distinct time series:

http_requests_total{method="GET", path="/api/v1/query", status="200", instance="pod-1"}
http_requests_total{method="GET", path="/api/v1/query", status="200", instance="pod-2"}
...
http_requests_total{method="GET", path="/api/v1/query", status="200", instance="pod-N"}

If instance has 10,000 pods, this single metric has 10,000 series. Adding another label with 100 values multiplies to 1,000,000 series. Common cardinality bombs:

User IDs, session IDs, request IDs as labels (unbounded)
Full URLs or SQL queries as labels (unbounded)
Kubernetes pod names without aggregation (proportional to pod count × churn)

Cardinality in the Head:

Every unique label set requires:

One memSeries in stripeSeries (~400 bytes in-memory)
One entry in MemPostings per label name=value pair
One WAL Series record (written once per new series)

At 10M active series: ~4GB RAM just for the series index, before any sample data.

Cardinality cache in VictoriaMetrics:

// From VictoriaMetrics/lib/storage/storage.go
type Storage struct {
    tsidCache       *workingsetcache.Cache  // MetricName → TSID (hot cache)
    metricNameCache *workingsetcache.Cache  // MetricID → MetricName
}

The workingsetcache is a two-layer cache (active + previous generation). When a new series arrives, it is checked against the cache first — the index is only hit if the series is truly new. This reduces indexDB write amplification under steady-state (few new series per scrape).

High churn rate:

Kubernetes workloads create and destroy pods continuously. Each new pod = new instance label value = new time series. Dead series remain in the Head until compaction, then in persisted blocks until retention expires. VictoriaMetrics’s per-day index (nsPrefixDateTagToMetricIDs) bounds the memory used for these stale series — the index for old dates can be evicted once outside the query window.

Mitigation strategies:

Relabeling: drop high-cardinality labels before ingestion (metric_relabel_configs)
Recording rules: pre-aggregate high-cardinality metrics into lower-cardinality summaries
Cardinality limits: Prometheus --storage.tsdb.max-block-chunk-segment-size, VictoriaMetrics -maxLabelsPerTimeseries
Alerting on cardinality: topk(10, count by (__name__)({__name__=~".+"})) identifies the worst offenders

Deep Dive 6: OTel Collector Pipeline #

The OTel Collector implements a receiver → processor → exporter pipeline:

// From opentelemetry-collector/processor/batchprocessor/batch_processor.go
type batchProcessor[T any] struct {
    timeout       time.Duration  // max time before sending (default: 200ms)
    sendBatchSize int            // target batch size (default: 8192 data points)

    batcher batcher[T]  // singletonBatcher or multiBatcher (for metadata routing)
}

type shard[T any] struct {
    exportCtx context.Context   // carries metadata (tenant ID, etc.)
    timer     *time.Timer       // fires when timeout elapses
    newItem   chan T             // receives data from producers
    batch     batch[T]          // accumulated data
}

The batch processor sends when either sendBatchSize is reached or timeout elapses — whichever comes first. This is the same dual-trigger pattern as Kafka’s linger.ms + batch.size.

multiBatcher: When MetadataKeys are configured (e.g., X-Tenant-ID), the batch processor shards batches by metadata key combination. Each unique combination gets its own shard. This enables multi-tenant pipelines where different tenants must be routed to different backends.

Pipeline composition:

otlpreceiver (gRPC :4317, HTTP :4318)
    │  consumer.Metrics
    ▼
batchprocessor
    │  accumulate until 8192 data points or 200ms
    ▼
transformprocessor (optional: rename attributes, set resource attributes)
    │
    ▼
prometheusremotewriteexporter
    │  convert pmetric.Metrics → []prompb.TimeSeries
    │  POST to /api/v1/write (snappy-compressed protobuf)
    │  retry with exponential backoff on 5xx
    ▼
Prometheus / VictoriaMetrics

OTLP → Prometheus conversion semantics:

OTLP Sum (monotonic=true) → Prometheus counter OTLP Gauge → Prometheus gauge OTLP Histogram → Prometheus _bucket, _count, _sum series OTLP ExponentialHistogram → Prometheus native histogram

Resource attributes (e.g., host.name, cloud.region) become Prometheus labels. The job and instance labels are synthesized from service.name and host.name resource attributes by convention.

Deep Dive 7: Alerting Pipeline #

The alerting pipeline runs inside the Prometheus Rules Manager:

// From rules/group.go
type Group struct {
    name     string
    interval time.Duration  // evaluation frequency (default: 1m)
    rules    []Rule         // AlertingRule and RecordingRule
    opts     *ManagerOptions
    evaluationTime time.Duration  // time taken for last eval (monitored)
}

// From rules/alerting.go
type AlertingRule struct {
    vector       parser.Expr    // PromQL expression
    holdDuration time.Duration  // the "for:" clause — must stay true this long
    active       map[uint64]*Alert
}

Alert state machine:

[no data or expression = false]
        │
        ▼
Inactive (not in active map)
        │  expression becomes true
        ▼
Pending (in active map, ActiveAt = now)
        │  continuously true for holdDuration
        ▼
Firing (sent to Alertmanager)
        │  expression becomes false
        ▼
[removed from active map after keepFiringFor]

The holdDuration (the for: clause in alerting rules) prevents flapping alerts from firing. A spike that lasts 30 seconds won’t fire an alert with for: 5m. The rule must evaluate to true at every evaluation within the hold window.

Evaluation:

// Rules Manager calls at each interval tick:
QueryFunc(ctx, rule.vector.String(), evalTimestamp) → promql.Vector
// Each element of the returned Vector is a potential alert:
//   Labels: the label set of the series (becomes alert labels)
//   Value:  the sample value (available as $value in annotations)

Recording rules write the result back as new time series:

// From rules/recording.go
type RecordingRule struct {
    name   string        // the metric name to write results as
    vector parser.Expr   // the PromQL expression to evaluate
    labels labels.Labels // extra labels to attach
}

Pre-aggregation via recording rules is the primary mitigation for slow dashboard queries over high-cardinality data. A rule that pre-computes sum(rate(http_requests_total[5m])) by (job) runs once per minute and stores the result as a low-cardinality time series, making dashboard queries instant.

Alertmanager integration:

The Notifier in Prometheus batches Alert objects and posts them to Alertmanager via the /api/v2/alerts endpoint. Alertmanager then:

Groups alerts with the same label set into a single notification (configurable via group_by)
Deduplicates identical alerts from multiple Prometheus instances (HA setup)
Inhibits lower-priority alerts when a higher-priority alert is firing (e.g., suppress host alerts when the whole datacenter is down)
Silences alerts matching a label selector for a specified duration
Routes to receivers: PagerDuty, Slack, email, OpsGenie, webhook

Deep Dive 8: Long-Term Storage — VictoriaMetrics vs Thanos #

Local Prometheus retention is bounded by disk (~15 days typically). Two patterns for long-term storage:

Pattern 1: Thanos (sidecar + object storage)

Prometheus (with Thanos sidecar)
    │  remote write → Thanos Receiver
    │  OR: sidecar uploads completed blocks to S3/GCS
    ▼
Thanos Store Gateway (reads from S3, caches index)
    │
Thanos Querier (fan-out PromQL queries across Store Gateways + live Prometheus)

Thanos uses Prometheus’s own block format on object storage. Query fan-out merges results from live Prometheus instances (recent data) and Store Gateways (historical data) transparently.

Pattern 2: VictoriaMetrics (remote write target)

Prometheus configures VictoriaMetrics as a remote write endpoint:

remote_write:
  - url: http://victoria-metrics:8428/api/v1/write

VictoriaMetrics stores data in its own merge-tree format:

// From VictoriaMetrics/lib/storage/storage.go
type Storage struct {
    // Two partitions: small (recent, hot) and big (older, compressed)
    // Merge happens in background when small partition exceeds threshold
    tsidCache       *workingsetcache.Cache  // hot-path series lookup
    metricNameCache *workingsetcache.Cache  // reverse lookup
}

VM’s merge-tree columnar storage achieves ~4–7x better compression than Prometheus’s block format for typical workloads, because it sorts and delta-encodes entire columns across many series simultaneously, rather than per-series XOR encoding. The TSID sort order (MetricGroupID → JobID → InstanceID → MetricID) places related series adjacent on disk, improving scan locality.

Trade-off summary:

Dimension	Prometheus local	Thanos	VictoriaMetrics
Retention	~15 days (disk bound)	Years (object storage)	Years (local or cluster)
Query federation	No	Yes (Thanos Querier)	Yes (vmselect)
Operational complexity	Low	High (many components)	Medium
Compression	~1.3 bytes/sample	Same as Prometheus	~0.4 bytes/sample
PromQL compatibility	Native	Native	MetricsQL (superset)
Remote write protocol	Sends	Receives (Receiver)	Receives