Table of Contents
Metrics Monitoring: System Design #
A distributed metrics monitoring platform in the style of Datadog, Grafana Cloud, or AWS CloudWatch. Covers the scrape/push ingestion duality, TSDB storage internals, PromQL query execution, alerting, and long-term retention. Grounded in the Prometheus TSDB and scrape engine, VictoriaMetrics storage internals, and the OpenTelemetry Collector pipeline.
1. Requirements #
Functional Requirements #
- Metric ingestion — pull (scrape) and push (OTLP, remote write) models; support for counters, gauges, histograms, summaries
- Storage — store time series with configurable retention (15 days local; years in long-term store)
- Querying — ad-hoc and dashboard queries over time ranges; instant and range queries
- Alerting — evaluate PromQL expressions on a schedule; fire alerts when conditions hold for a configurable duration
- Dashboards — read-path API for Grafana-compatible visualization
- Label-based routing — relabeling rules to transform, drop, or re-route series during ingestion
- Federation — aggregate metrics across multiple Prometheus instances
- Remote write / read — export metrics to long-term storage backends (Thanos, VictoriaMetrics, Cortex)
Non-Functional Requirements #
- Throughput — ingest up to 1M samples/sec per node; hundreds of millions of active series across a cluster
- Write latency — P99 append < 10ms; WAL flush < 1ms
- Query latency — P99 instant query < 500ms; range query over 24h < 5s
- Availability — 99.9% for ingest (writes are local, failures are isolated); 99.9% for query
- Cardinality — support millions of unique label combinations per metric name without OOM
- Retention — local TSDB: 15 days; long-term: years (object storage or dedicated TSDB)
- Durability — WAL ensures at-most 2 hours of data loss on crash (head block recovery window)
Capacity Estimation #
| Metric | Estimate |
|---|---|
| Active time series per node | 2–10M |
| Ingest rate per node | ~500K–1M samples/sec |
| Sample size on disk (compressed) | ~1.3 bytes/sample (XOR + delta encoding) |
| Retention window | 15 days local |
| Storage per node | 10M series × 1 sample/15s × 86400 × 15 days × 1.3B ≈ 750GB |
| Label cardinality | Up to 100M unique label value combinations across all series |
| Rule evaluation | 1K–100K rules evaluated every 15–60 seconds |
The XOR encoding figure comes directly from the Gorilla paper (Pelkonen et al., 2015) and is implemented in Prometheus’s XORChunk.
2. Core Entities #
TimeSeries #
A time series is uniquely identified by its full label set — metric name plus all key=value labels:
http_requests_total{method="GET", handler="/api/v1/query", job="prometheus", instance="localhost:9090"}
The metric name is stored as the special label __name__. Two time series with identical label sets but different metric names are different series.
Sample #
A (timestamp, value) pair attached to a time series. Prometheus uses millisecond timestamps:
// From storage/interface.go
Append(ref SeriesRef, l labels.Labels, t int64, v float64) (SeriesRef, error)
// ^^^^ ms timestamp ^^^^ float64 value
Four sample types exist in the OpenMetrics spec: float64 (the original), native histogram buckets, float histograms, and exemplars. The Appender interface handles all via separate methods (AppendHistogram, AppendExemplar).
MetricName (VictoriaMetrics) #
VictoriaMetrics separates the metric name group from the label tags:
// From VictoriaMetrics/lib/storage/metric_name.go
type MetricName struct {
MetricGroup []byte // the __name__ label
Tags []Tag // sorted key=value pairs; must be sorted for canonical comparison
}
The canonical form has tags sorted by key — two MetricName values with the same labels in different order are deduplicated into one series.
TSID (VictoriaMetrics) #
The internal identifier for a time series in VictoriaMetrics:
// From VictoriaMetrics/lib/storage/tsid.go
type TSID struct {
MetricGroupID uint64 // hash of metric name (__name__)
JobID uint32 // hash of job label
InstanceID uint32 // hash of instance label
MetricID uint64 // unique ID for the full label set (primary key)
}
MetricID is the authoritative identifier — the other fields are optimizations for grouping related series on disk (same metric name → nearby MetricIDs → better compression). The inverted index maps (label key, label value) → []MetricID to enable label-based queries.
Chunk #
A compressed block of samples for a single time series. The Prometheus XORChunk uses Gorilla XOR encoding:
// From tsdb/chunkenc/xor.go
type XORChunk struct {
b bstream // bit stream
}
// Stores the first timestamp and value verbatim;
// subsequent timestamps as delta-of-deltas (varint),
// subsequent values as XOR with the previous value (leading/trailing zero optimization).
A chunk holds ~120 samples by default (SamplesPerChunk), covering ~30 minutes at the standard 15-second scrape interval. Compression ratio: ~1.3 bytes/sample for float metrics.
memSeries #
The in-memory representation of one time series in the Prometheus Head:
// From tsdb/head.go
type memSeries struct {
ref chunks.HeadSeriesRef // stable ID within the head block
lset labels.Labels // full label set
mmappedChunks []*mmappedChunk // completed chunks mmap'd to disk
headChunks *memChunk // active in-memory chunk (linked list, newest first)
firstChunkID chunks.HeadChunkID
app chunkenc.Appender // active appender for the head chunk
nextAt int64 // timestamp at which to cut a new chunk
lastValue float64 // last written value (for duplicate detection)
}
mmappedChunks are completed chunks memory-mapped from the chunk disk mapper — they exist on disk but are accessible as if in memory. Only headChunks is being actively written to.
Alert #
// From rules/alerting.go
type AlertingRule struct {
name string
vector parser.Expr // the PromQL expression to evaluate
holdDuration time.Duration // must be true for this long before → Firing (the "for:" clause)
keepFiringFor time.Duration
labels labels.Labels // extra labels attached to alert
annotations labels.Labels // human-readable context (summary, description, runbook_url)
active map[uint64]*Alert // fingerprint → Alert; currently Pending or Firing
}
3. API / System Interface #
OpenMetrics Exposition Format (scrape target → Prometheus) #
The text format that instrumented services expose at /metrics:
# HELP http_requests_total Total HTTP requests
# TYPE http_requests_total counter
http_requests_total{method="GET",handler="/api"} 1234 1609459200000
http_requests_total{method="POST",handler="/api"} 56
# HELP process_resident_memory_bytes Resident memory size in bytes
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 1.234e+08
The OpenMetrics RFC (CNCF project) formalizes this format with additional types (unknown, stateset, gaugehistogram, info) and content negotiation via Accept: application/openmetrics-text.
OTLP (push: instrumented service → OTel Collector → Prometheus remote write) #
The OpenTelemetry Protocol wire format. Two transports:
POST /v1/metrics(HTTP/1.1 or HTTP/2, protobuf or JSON body)- gRPC
Export(ExportMetricsServiceRequest)(HTTP/2)
The OTel Collector’s OTLP receiver handles both:
// From opentelemetry-collector/receiver/otlpreceiver/otlp.go
type otlpReceiver struct {
cfg *Config
serverGRPC *grpc.Server // gRPC transport
serverHTTP *http.Server // HTTP/protobuf transport
nextMetrics consumer.Metrics // downstream: processor or exporter
nextTraces consumer.Traces
nextLogs consumer.Logs
}
The OTLP data model is hierarchical: ResourceMetrics → ScopeMetrics → Metric → DataPoint. Each Metric has a type (Gauge, Sum/Counter, Histogram, ExponentialHistogram, Summary). This hierarchy carries resource attributes (host, cloud provider, region) and instrumentation scope (library name, version) alongside the data — richer than the Prometheus flat label model.
Prometheus HTTP API (query) #
GET /api/v1/query?query=<promql>&time=<timestamp> # instant query
GET /api/v1/query_range?query=<promql>&start=&end=&step= # range query
GET /api/v1/series?match[]=<selector> # series metadata
GET /api/v1/labels # label names
GET /api/v1/label/<name>/values # label values
POST /api/v1/alerts # active alerts
Prometheus Remote Write (long-term export) #
Prometheus sends batches of samples to a remote write endpoint via HTTP POST. The payload is a snappy-compressed protobuf WriteRequest:
message WriteRequest {
repeated TimeSeries timeseries = 1;
}
message TimeSeries {
repeated Label labels = 1;
repeated Sample samples = 2;
}
message Sample {
double value = 1;
int64 timestamp = 2; // milliseconds
}
Remote write is the integration point with VictoriaMetrics, Thanos, Cortex, and Grafana Mimir — they all expose a /api/v1/write endpoint that speaks this protocol.
Alertmanager API #
Prometheus posts firing alerts to Alertmanager:
POST /api/v2/alerts [{ "labels": {...}, "annotations": {...}, "startsAt": "...", "generatorURL": "..." }]
Alertmanager handles deduplication, grouping, inhibition, silencing, and routing to receivers (PagerDuty, Slack, email, webhook).
Standards Reference #
| Standard | Body | Scope |
|---|---|---|
| OpenMetrics exposition format | CNCF / RFC | Text wire format for scrape targets |
| OTLP protobuf spec | CNCF OpenTelemetry | Push protocol for metrics, traces, logs |
| Prometheus Remote Write 2.0 | CNCF Prometheus | Export protocol to long-term stores |
| PromQL grammar | Prometheus | Query language spec (EBNF in docs) |
| Alertmanager API v2 | Prometheus | Alert posting and management |
| OpenTelemetry Semantic Conventions | CNCF | Standard label names for HTTP, DB, cloud resources |
4. Data Flow #
Pull (scrape) path #
Scrape target (Go/Java/Python service)
│ GET /metrics (HTTP)
│ response: OpenMetrics text
▼
Prometheus scrapePool / scrapeLoop
│ parse OpenMetrics text
│ apply relabeling rules
│ call Appender.Append() per sample
▼
TSDB Head
│ WAL.Log(record) ← durability
│ memSeries.headChunks.app.Append(t, v) ← in-memory
▼
├── memSeries (in-memory, hot)
└── WAL segments (on disk, recovery)
Push (OTLP) path #
Instrumented service (SDK)
│ ExportMetricsServiceRequest (gRPC/HTTP)
▼
OTel Collector: OTLP receiver
│ decode protobuf → pmetric.Metrics
▼
OTel Collector: batch processor
│ accumulate until sendBatchSize or timeout
▼
OTel Collector: Prometheus remote write exporter
│ convert pmetric → WriteRequest proto
│ POST /api/v1/write (snappy-compressed)
▼
Prometheus / VictoriaMetrics remote write handler
│ decode → samples
│ Appender.Append() per sample
▼
TSDB Head (same path as pull)
Query path #
Grafana / client
│ GET /api/v1/query_range?query=...
▼
PromQL Engine
│ parse → AST
│ plan evaluation steps
▼
Storage Queryable
│ select matching series (label matchers → postings)
│ iterate chunks in time range
▼
├── Head (recent data, in-memory chunks)
└── Blocks (persisted blocks, mmap'd from disk)
│
└── [long-term store via remote read if needed]
▼
PromQL Engine: evaluate AST over samples
│ functions: rate(), sum(), histogram_quantile(), ...
▼
Result (Matrix / Vector / Scalar)
Compaction path (background) #
Head block (2h of data, in-memory + WAL)
│ compactc channel triggers compaction
▼
TSDB Compactor
│ flush head → immutable Block on disk
│ Block: index/ + chunks/ + meta.json + tombstones
▼
Blocks on disk (2h, 6h, 24h, 72h windows — logarithmic compaction)
│ remote write exporter pushes samples upstream
▼
Long-term store (VictoriaMetrics / Thanos / S3-backed Mimir)
5. High Level Design #
┌──────────────────────────────────────────────────────────────┐
│ Instrumented Services │
│ /metrics (OpenMetrics) OTLP push (gRPC / HTTP) │
└──────────┬──────────────────────────┬───────────────────────┘
│ scrape (pull) │ push
┌──────────▼────────┐ ┌───────────▼────────────────────────┐
│ Prometheus │ │ OpenTelemetry Collector │
│ │ │ │
│ Service Discovery │ │ OTLP receiver │
│ scrapePool/Loop │ │ batch processor │
│ relabeling │ │ transform processor │
│ TSDB Head │ │ Prometheus remote write exporter │
│ WAL │◄───┤ (WriteRequest → /api/v1/write) │
│ Compactor │ └────────────────────────────────────┘
│ PromQL Engine │
│ Rules Manager │
│ Notifier │
└──────┬─────────────┘
│ remote write (WriteRequest)
┌──────▼──────────────────────────────────────────────────────┐
│ Long-Term Store │
│ VictoriaMetrics / Thanos / Grafana Mimir │
│ │
│ Inverted index (Tag → MetricID) │
│ Merge-tree storage (sorted, compressed blocks) │
│ MetricsQL / PromQL compatible query layer │
└──────────────────────────────────────────────────────────────┘
│ query (remote read or Grafana datasource)
┌──────▼──────────────────────────────────────────────────────┐
│ Grafana / Alertmanager / API consumers │
└─────────────────────────────────────────────────────────────┘
Persistence:
| Store | Data | Technology |
|---|---|---|
| WAL | Recent samples pre-compaction | Append-only segmented file (128MB segments, 32KB pages) |
| Head block | Active 2h window in memory | memSeries + MemPostings index |
| Persisted blocks | Immutable 2h–72h blocks | mmap’d chunk files + inverted index |
| Long-term store | Years of retention | VictoriaMetrics merge-tree / S3-backed object storage |
| Alert state | Pending/Firing alerts | In-memory in Alertmanager (optionally persisted to disk) |
6. Deep Dives #
Deep Dive 1: TSDB Storage — Head Block and WAL #
The DB struct is the entry point to Prometheus’s TSDB:
// From tsdb/db.go
type DB struct {
dir string
head *Head // active 2-hour write window
blocks []*Block // immutable persisted blocks
compactor Compactor // background compaction
compactc chan struct{} // triggers compaction
}
The Head is the hot write path:
// From tsdb/head.go
type Head struct {
series *stripeSeries // sharded map: HeadSeriesRef → *memSeries
postings *index.MemPostings // inverted index: label → []HeadSeriesRef
wal *wlog.WL // write-ahead log
chunkDiskMapper *chunks.ChunkDiskMapper // mmap completed chunks to disk
minTime, maxTime atomic.Int64 // time range of this head
chunkRange atomic.Int64 // chunk cut interval (~2h)
}
Write path for a single sample:
Appender.Append(ref, labels, t, v)called by scrape loop- If
ref == 0(first time this series is seen): look up or creatememSeriesinstripeSeries, updateMemPostings - WAL record written:
record.RefSample{Ref: ref, T: t, V: v}appended to the WAL memSeries.app.Append(t, v)writes the sample into the activeXORChunkCommit()flushes the WAL page
WAL structure:
// From tsdb/wlog/wlog.go
type WL struct {
dir string
segment *Segment // active segment file (~128MB)
page *page // active 32KB page (in-memory buffer)
segmentSize int // DefaultSegmentSize = 128 * 1024 * 1024
}
const (
DefaultSegmentSize = 128 * 1024 * 1024 // 128MB per segment
pageSize = 32 * 1024 // 32KB pages
)
Records are written into 32KB pages. When a page is full, it is flushed to disk. When a segment exceeds 128MB, a new segment is opened. On crash recovery, Prometheus replays all WAL records since the last completed block to reconstruct the Head.
WAL record types (from tsdb/record/):
Series— label set + ref, written once per new seriesSamples—[]RefSample{ref, t, v}— the bulk of WAL trafficTombstones— deletion markersExemplars— trace-linked exemplar data
Chunk encoding — XOR:
// From tsdb/chunkenc/xor.go
type XORChunk struct {
b bstream // bit stream
}
First sample: stored verbatim (64-bit timestamp + 64-bit float).
Subsequent timestamps: δδ = (t - t_prev) - (t_prev - t_prev_prev) — the delta-of-delta is typically 0 for regular scrape intervals, encoding in 1 bit. Subsequent values: XOR with the previous value; the XOR often has leading and trailing zeros (the mantissa is shared), encoded using a control-bit scheme from the Gorilla paper. Result: ~1.3 bytes/sample for typical metrics.
Chunk lifecycle:
Append to headChunk (memChunk, in-memory)
│ when nextAt is reached (~120 samples or ~30min):
▼
Cut new headChunk
│ old headChunk → ChunkDiskMapper.WriteChunk()
▼
mmappedChunk (on disk, accessible via mmap)
│ when Head chunkRange (~2h) expires:
▼
Compaction → immutable Block (index/ + chunks/ + meta.json)
Deep Dive 2: Inverted Index and Label Lookup #
Finding all series that match {job="prometheus", method="GET"} requires an inverted index — the same data structure as a search engine’s posting list.
Prometheus MemPostings:
// tsdb/index/postings.go (inferred from Head usage)
type MemPostings struct {
// map[label_name][label_value] → []HeadSeriesRef (sorted)
m map[string]map[string][]storage.SeriesRef
}
A query like {job="prometheus", method="GET"} becomes:
- Fetch posting list for
job="prometheus"→[ref1, ref3, ref7, ...] - Fetch posting list for
method="GET"→[ref1, ref2, ref7, ...] - Intersect (merge sorted lists) →
[ref1, ref7, ...] - For each ref, load
memSeries.lsetto verify full label match (collision check)
For on-disk blocks, the same index is persisted in the index/ directory using a sorted, binary-searchable format.
VictoriaMetrics inverted index:
VictoriaMetrics uses a mergeset (LSM-tree style) for its inverted index:
// From VictoriaMetrics/lib/storage/index_db.go
const (
nsPrefixTagToMetricIDs = 1 // (tag_key, tag_value) → []MetricID
nsPrefixDateTagToMetricIDs = 6 // (date, tag_key, tag_value) → []MetricID
nsPrefixMetricIDToMetricName = 3 // MetricID → MetricName (reverse lookup)
nsPrefixDateMetricNameToTSID = 7 // (date, MetricName) → TSID
)
The nsPrefixDateTagToMetricIDs (per-day index) is key for efficient time-range queries: rather than scanning all historical MetricIDs for a label, it only looks up the per-day posting lists that overlap the query range. This dramatically reduces memory for indexes with high churn (short-lived containers, ephemeral jobs).
The storage also maintains a hot tsidCache (MetricName → TSID) and metricNameCache (MetricID → MetricName) to avoid index lookups on every ingested sample for known series.
Deep Dive 3: Scrape Engine #
The scrape engine is the pull-model ingest path:
// From scrape/scrape.go
type scrapePool struct {
appendable storage.Appendable // the TSDB
client *http.Client // per-job HTTP client (TLS, auth)
loops map[uint64]loop // target hash → scrapeLoop
activeTargets map[uint64]*Target // currently scraped targets
offsetSeed uint64 // jitter seed for staggering scrapes
config *config.ScrapeConfig
buffers *pool.Pool // reusable byte slice pool for response bodies
}
Each target runs a scrapeLoop in its own goroutine. The offsetSeed staggers scrape start times across targets within a job to prevent the “thundering herd” — all targets hitting at the same second. The jitter is computed as:
offset = hash(target_labels) % scrape_interval
So targets are spread uniformly across the interval rather than synchronized.
Scrape loop lifecycle:
timer fires (every scrape_interval)
│
▼
HTTP GET /metrics (with TLS, basic auth, OAuth2 as configured)
│
▼
Parse response body (OpenMetrics or Prometheus text format)
│ model.TextToMetricFamilies() or textparse.New()
│
▼
Apply relabeling rules (target-level and metric-level)
│ drop series matching drop_labels
│ rewrite __address__ to target address
│
▼
Appender.Append() for each (labels, timestamp, value)
│ timestamp is the scrape start time (not the value in the exposition)
│ ScrapeTimestampTolerance = 2ms (allows minor clock skew alignment)
│
▼
Commit()
│
▼
Report scrape metrics:
scrape_duration_seconds, scrape_samples_scraped,
scrape_samples_post_metric_relabeling, scrape_series_added
Staleness markers: When a target disappears (no longer returns a series), Prometheus writes a special stale NaN value at the next scrape. The PromQL engine interprets staleness markers as “this series ended here” and doesn’t extrapolate across the gap. Without this, graphs would show flat lines across target restarts.
Deep Dive 4: PromQL Query Engine #
The PromQL engine implements a two-phase evaluation model:
// From promql/engine.go
type Engine struct {
timeout time.Duration // max query runtime (default: 2min)
maxSamplesPerQuery int // OOM guard (default: 50M)
lookbackDelta time.Duration // staleness window (default: 5min)
activeQueryTracker QueryTracker // concurrent query limit
}
// Two query types:
func (ng *Engine) NewInstantQuery(ctx, q, opts, qs string, ts time.Time) (Query, error)
func (ng *Engine) NewRangeQuery(ctx, q, opts, qs string, start, end time.Time, step time.Duration) (Query, error)
Evaluation flow:
PromQL string (e.g., "rate(http_requests_total[5m])")
│
▼
parser.ParseExpr() → AST
│ nodes: VectorSelector, MatrixSelector, Call, BinaryExpr, AggregateExpr
│
▼
Engine.exec()
│ for range query: evaluate at each step (start, start+step, ..., end)
│
▼
evaluateAtTimestamp(t)
│ VectorSelector: query storage for all matching series at t
│ → uses lookbackDelta (5min default): find last sample in [t-δ, t]
│ MatrixSelector: query samples in [t-range, t] for each series
│ → used by rate(), irate(), increase(), ...
│ Call: apply function (rate, sum, histogram_quantile, ...)
│ AggregateExpr: group by labels, apply aggregation
│
▼
Result: Vector (instant) or Matrix (range)
lookbackDelta (5 minutes): For an instant query at time t, if a series had its last sample at t-3min, the engine still returns that value (age is within 5 minutes). This handles scrape jitter — a 15s scrape interval means consecutive scrapes are at t, t+15s, t+30s, so the gap is bounded. The 5-minute default covers target restarts and missed scrapes.
rate() implementation:
rate(http_requests_total[5m]) at time t:
samples = all samples in [t-5min, t] for each matching series
rate = (last_value - first_value) / (last_timestamp - first_timestamp)
// counter resets handled: if last < first, assume reset at 0
maxSamplesPerQuery: The engine counts the total number of samples loaded across all series for a query. If it exceeds the limit (default 50M), the query is aborted with ErrTooManySamples. This prevents a single high-cardinality query from OOM-killing the process.
Deep Dive 5: Cardinality — The Hard Problem #
Cardinality is the count of unique time series (unique label sets). It is the primary operational risk of a metrics system.
Why cardinality explodes:
Each unique combination of label values creates a distinct time series:
http_requests_total{method="GET", path="/api/v1/query", status="200", instance="pod-1"}
http_requests_total{method="GET", path="/api/v1/query", status="200", instance="pod-2"}
...
http_requests_total{method="GET", path="/api/v1/query", status="200", instance="pod-N"}
If instance has 10,000 pods, this single metric has 10,000 series. Adding another label with 100 values multiplies to 1,000,000 series. Common cardinality bombs:
- User IDs, session IDs, request IDs as labels (unbounded)
- Full URLs or SQL queries as labels (unbounded)
- Kubernetes pod names without aggregation (proportional to pod count × churn)
Cardinality in the Head:
Every unique label set requires:
- One
memSeriesinstripeSeries(~400 bytes in-memory) - One entry in
MemPostingsper label name=value pair - One WAL
Seriesrecord (written once per new series)
At 10M active series: ~4GB RAM just for the series index, before any sample data.
Cardinality cache in VictoriaMetrics:
// From VictoriaMetrics/lib/storage/storage.go
type Storage struct {
tsidCache *workingsetcache.Cache // MetricName → TSID (hot cache)
metricNameCache *workingsetcache.Cache // MetricID → MetricName
}
The workingsetcache is a two-layer cache (active + previous generation). When a new series arrives, it is checked against the cache first — the index is only hit if the series is truly new. This reduces indexDB write amplification under steady-state (few new series per scrape).
High churn rate:
Kubernetes workloads create and destroy pods continuously. Each new pod = new instance label value = new time series. Dead series remain in the Head until compaction, then in persisted blocks until retention expires. VictoriaMetrics’s per-day index (nsPrefixDateTagToMetricIDs) bounds the memory used for these stale series — the index for old dates can be evicted once outside the query window.
Mitigation strategies:
- Relabeling: drop high-cardinality labels before ingestion (
metric_relabel_configs) - Recording rules: pre-aggregate high-cardinality metrics into lower-cardinality summaries
- Cardinality limits: Prometheus
--storage.tsdb.max-block-chunk-segment-size, VictoriaMetrics-maxLabelsPerTimeseries - Alerting on cardinality:
topk(10, count by (__name__)({__name__=~".+"}))identifies the worst offenders
Deep Dive 6: OTel Collector Pipeline #
The OTel Collector implements a receiver → processor → exporter pipeline:
// From opentelemetry-collector/processor/batchprocessor/batch_processor.go
type batchProcessor[T any] struct {
timeout time.Duration // max time before sending (default: 200ms)
sendBatchSize int // target batch size (default: 8192 data points)
batcher batcher[T] // singletonBatcher or multiBatcher (for metadata routing)
}
type shard[T any] struct {
exportCtx context.Context // carries metadata (tenant ID, etc.)
timer *time.Timer // fires when timeout elapses
newItem chan T // receives data from producers
batch batch[T] // accumulated data
}
The batch processor sends when either sendBatchSize is reached or timeout elapses — whichever comes first. This is the same dual-trigger pattern as Kafka’s linger.ms + batch.size.
multiBatcher: When MetadataKeys are configured (e.g., X-Tenant-ID), the batch processor shards batches by metadata key combination. Each unique combination gets its own shard. This enables multi-tenant pipelines where different tenants must be routed to different backends.
Pipeline composition:
otlpreceiver (gRPC :4317, HTTP :4318)
│ consumer.Metrics
▼
batchprocessor
│ accumulate until 8192 data points or 200ms
▼
transformprocessor (optional: rename attributes, set resource attributes)
│
▼
prometheusremotewriteexporter
│ convert pmetric.Metrics → []prompb.TimeSeries
│ POST to /api/v1/write (snappy-compressed protobuf)
│ retry with exponential backoff on 5xx
▼
Prometheus / VictoriaMetrics
OTLP → Prometheus conversion semantics:
OTLP Sum (monotonic=true) → Prometheus counter
OTLP Gauge → Prometheus gauge
OTLP Histogram → Prometheus _bucket, _count, _sum series
OTLP ExponentialHistogram → Prometheus native histogram
Resource attributes (e.g., host.name, cloud.region) become Prometheus labels. The job and instance labels are synthesized from service.name and host.name resource attributes by convention.
Deep Dive 7: Alerting Pipeline #
The alerting pipeline runs inside the Prometheus Rules Manager:
// From rules/group.go
type Group struct {
name string
interval time.Duration // evaluation frequency (default: 1m)
rules []Rule // AlertingRule and RecordingRule
opts *ManagerOptions
evaluationTime time.Duration // time taken for last eval (monitored)
}
// From rules/alerting.go
type AlertingRule struct {
vector parser.Expr // PromQL expression
holdDuration time.Duration // the "for:" clause — must stay true this long
active map[uint64]*Alert
}
Alert state machine:
[no data or expression = false]
│
▼
Inactive (not in active map)
│ expression becomes true
▼
Pending (in active map, ActiveAt = now)
│ continuously true for holdDuration
▼
Firing (sent to Alertmanager)
│ expression becomes false
▼
[removed from active map after keepFiringFor]
The holdDuration (the for: clause in alerting rules) prevents flapping alerts from firing. A spike that lasts 30 seconds won’t fire an alert with for: 5m. The rule must evaluate to true at every evaluation within the hold window.
Evaluation:
// Rules Manager calls at each interval tick:
QueryFunc(ctx, rule.vector.String(), evalTimestamp) → promql.Vector
// Each element of the returned Vector is a potential alert:
// Labels: the label set of the series (becomes alert labels)
// Value: the sample value (available as $value in annotations)
Recording rules write the result back as new time series:
// From rules/recording.go
type RecordingRule struct {
name string // the metric name to write results as
vector parser.Expr // the PromQL expression to evaluate
labels labels.Labels // extra labels to attach
}
Pre-aggregation via recording rules is the primary mitigation for slow dashboard queries over high-cardinality data. A rule that pre-computes sum(rate(http_requests_total[5m])) by (job) runs once per minute and stores the result as a low-cardinality time series, making dashboard queries instant.
Alertmanager integration:
The Notifier in Prometheus batches Alert objects and posts them to Alertmanager via the /api/v2/alerts endpoint. Alertmanager then:
- Groups alerts with the same label set into a single notification (configurable via
group_by) - Deduplicates identical alerts from multiple Prometheus instances (HA setup)
- Inhibits lower-priority alerts when a higher-priority alert is firing (e.g., suppress host alerts when the whole datacenter is down)
- Silences alerts matching a label selector for a specified duration
- Routes to receivers: PagerDuty, Slack, email, OpsGenie, webhook
Deep Dive 8: Long-Term Storage — VictoriaMetrics vs Thanos #
Local Prometheus retention is bounded by disk (~15 days typically). Two patterns for long-term storage:
Pattern 1: Thanos (sidecar + object storage)
Prometheus (with Thanos sidecar)
│ remote write → Thanos Receiver
│ OR: sidecar uploads completed blocks to S3/GCS
▼
Thanos Store Gateway (reads from S3, caches index)
│
Thanos Querier (fan-out PromQL queries across Store Gateways + live Prometheus)
Thanos uses Prometheus’s own block format on object storage. Query fan-out merges results from live Prometheus instances (recent data) and Store Gateways (historical data) transparently.
Pattern 2: VictoriaMetrics (remote write target)
Prometheus configures VictoriaMetrics as a remote write endpoint:
remote_write:
- url: http://victoria-metrics:8428/api/v1/write
VictoriaMetrics stores data in its own merge-tree format:
// From VictoriaMetrics/lib/storage/storage.go
type Storage struct {
// Two partitions: small (recent, hot) and big (older, compressed)
// Merge happens in background when small partition exceeds threshold
tsidCache *workingsetcache.Cache // hot-path series lookup
metricNameCache *workingsetcache.Cache // reverse lookup
}
VM’s merge-tree columnar storage achieves ~4–7x better compression than Prometheus’s block format for typical workloads, because it sorts and delta-encodes entire columns across many series simultaneously, rather than per-series XOR encoding. The TSID sort order (MetricGroupID → JobID → InstanceID → MetricID) places related series adjacent on disk, improving scan locality.
Trade-off summary:
| Dimension | Prometheus local | Thanos | VictoriaMetrics |
|---|---|---|---|
| Retention | ~15 days (disk bound) | Years (object storage) | Years (local or cluster) |
| Query federation | No | Yes (Thanos Querier) | Yes (vmselect) |
| Operational complexity | Low | High (many components) | Medium |
| Compression | ~1.3 bytes/sample | Same as Prometheus | ~0.4 bytes/sample |
| PromQL compatibility | Native | Native | MetricsQL (superset) |
| Remote write protocol | Sends | Receives (Receiver) | Receives |