Skip to main content
  1. concepts/

Telemetry / Feedback #

telemetry = structured observations about system behavior
feedback  = decisions and actions driven by those observations

It answers:

what is happening, why, and what should change?

Role in the catalog: the evaluation layer’s second member — predicted twice before it arrived. capacity.md forecast it (“SLOs are the availability-denominated sibling of unit economics”); log.md evicted trace logs to it; index_structures.md forecast that “the metric you didn’t emit” would be silent omission’s next member. All three debts land here:

capacity.md prices the system in MONEY;
this block prices it in EVIDENCE.

Goodhart appears in both files for the same reason: “teams optimize the metric instead of the service” IS “teams optimize attributed cost instead of real cost,” one layer over.

Central tension:

visibility and control  vs  cost, cardinality, noise, and feedback instability

Design Axes (the core module) #

Axis 1 — The Decision the Signal Supports (the structural cleave) #

Telemetry is derived data with no intrinsic value — it is worth exactly the decisions it improves. So the design runs BACKWARDS from the decision:

page a human           needs ACTIONABILITY; every alert must map to an
                       action, and false positives are paid in attention
                       (alert fatigue is the attention budget collapsing —
                       goodput*, denominated in on-call trust)
drive a controller     needs FRESHNESS and STABILITY — backpressure.md
                       axis 5's control loop, whole: metric lag,
                       oscillation, and two-controllers-fighting appear
                       in both files because they are one hazard
explore / debug        needs RICHNESS, tolerates lag (traces, logs,
                       profiles live mostly here)
prove accountability   needs INTEGRITY against the operator —
                       log.md's adversarial-auditor readership, home:
                       audit telemetry is a fact-log whose threat model
                       includes the admin
attribute cost         → capacity.md axis 4, whole (metering,
                       attribution, the Goodhart loop)

The deep lesson’s slogan row belongs here:

telemetry volume ≠ observability.
a signal no decision consumes is cost without value —
capacity.md's unit-economics judgment, applied to the observations
themselves.

Interrogation:

For each signal: name the decision. No decision? Delete it or demote it.
For each alert: name the ACTION. "Look into it" is not an action.
For each dashboard panel: what would someone do differently having seen it?
Who consumes it — human, controller, auditor — and does the signal's
  freshness/integrity match that consumer's needs?

Axis 2 — Signal Geometry (what question each kind CAN answer) #

Not rivals — projections of one underlying reality at different aggregation and causality points:

metrics    pre-aggregated numbers — cheap, and the aggregation is
           IRREVERSIBLE: you cannot get the tail back out of an average
           ("average vs tail" lives here; export histograms, not means)
logs       discrete instances — expensive, complete per-event
traces     CAUSAL structure — the only signal that answers
           "where did the latency go"
profiles   resources attributed to code paths —
           "where did the CPU go"
events     state transitions — with the k8s caveat the source doc gets
           right: ephemeral, NOT durable audit (that's log.md's
           adversarial readership, and k8s Events don't qualify)

The join key across all of them:

correlation / trace ID — and "no correlation ID" is the failure that
orphans every signal from every other: five telescopes, no shared sky.

Interrogation:

Which geometry answers the axis-1 decision's question?
  (paging on a symptom → metric; explaining it → trace; proving it → audit)
Is the aggregation throwing away the thing you'll need? (tails, exemplars)
Does the correlation ID survive every hop — including the async ones?
  (queues and schedulers are where trace context goes to die)

Axis 3 — Fidelity Economics #

capacity.md’s unit economics, applied to observations:

cardinality   write amplification wearing labels — one unbounded label
              (user ID, pod name × restart) multiplies every series;
              index_structures.md's cardinality regimes, verbatim
sampling      a DELIBERATE silent-omission trade: the rare event you
              sampled away is invisible forever — head vs tail sampling
              is choosing WHICH rare things survive
retention     the treaty again: keep evidence as long as the questions
              it answers stay askable, and no longer

Interrogation:

Which labels are bounded by construction, and which by hope?
What does the sampler drop, and is the incident you'll debug in it?
Can the P99 of a bad hour still be answered next month? (retention
  and resolution decide together)

Axis 4 — Vantage #

white-box   the system reports on itself — rich, and fate-shared (the star*)
black-box   synthetic probes measure from OUTSIDE — the user's chair:
            representative paths (through auth, through cache, through
            the real region), or the probe measures a system no user visits

This axis exists BECAUSE of the star: the external probe is the witness that doesn’t share the patient’s fate.

Interrogation:

Does a probe exercise the path users actually take?
When white-box goes dark, what black-box signal remains?

Axis 5 — The Judgment Layer (SLI → SLO → error budget → burn rate) #

capacity.md’s evaluation machinery, denominated in reliability:

SLI           the measurement that tracks USER experience —
              "healthy metric ≠ healthy user" is an SLI chosen from the
              server's chair instead of the user's
SLO           the objective: how much unreliability is budgeted
error budget  the spend account — reliability as a currency, with the
              same open-loop/closed-loop question as cost:
              a budget nobody enforces is a dashboard
burn rate     the derivative that makes alerts timely — paging on budget
              EXHAUSTED is paging after the money is gone
              (capacity.md's "alerts fire after the spend," verbatim)

Interrogation:

Does the SLI move when users suffer — tested against a real incident?
What HAPPENS at budget exhaustion — freeze, page, nothing? (open or closed?)
Are the burn-rate windows paired (fast+slow) so both cliffs and slow
  leaks page in time?

Technical Bottleneck: The Observer Is Inside the System* #

telemetry shares fate with what it observes.
the sickest node stops reporting; the overloaded path drops its own
spans first; the outage that matters most is the one that takes the
telemetry pipeline with it.
the dead don't page.

So every signal is a self-report from a witness who is also the patient — and absence of data is SYSTEMATICALLY ambiguous between “zero” and “the reporter died.” The deep lesson’s “absence ≠ zero” is this star’s one-line form. It is silent omission*’s third member, with a survivorship twist the other two lack:

the omission is CORRELATED WITH SEVERITY —
the worse things are, the less evidence arrives.

Known recipes:

deadman switch (flagship)   alert on the ABSENCE of the signal —
                            inverting the ambiguity into a page
                            ("up == 0 or absent(up)" is the whole idea)
black-box vantage           axis 4: a witness outside the blast radius
out-of-band paths           the signals needed DURING collapse travel
                            a pipeline that doesn't share the system's
                            dependencies (separate cluster, separate
                            cloud, separate auth)
per-signal disambiguation   the mandatory design question, asked per
                            signal: can missing be distinguished from
                            zero HERE? (staleness markers, counter
                            heartbeats, scrape-success meta-metrics)

A strong design says explicitly:

the decision every signal supports (axis 1),
the geometry that can answer it and the ID that joins them (axis 2),
the cardinality, sampling, and retention bills (axis 3),
the outside witness for when the inside ones die (axis 4),
the SLI the user would agree with, and what budget exhaustion DOES (axis 5),
and for every signal: what its absence means — because the most
important page is the one about the silence.

Telemetry As Protocol (the crossing-point spec — keep) #

instrument → emit → collect → enrich (resource attributes)
→ transform/sample/REDACT (secrets and tenant data are boundary.md's
  concern riding the pipeline — PII in logs is a data-boundary breach
  through the observability side door)
→ store/index/aggregate → query/visualize/alert → decide or actuate
→ feed back (and stabilize: backpressure.md's loop discipline)

Metrics instantiation:

app exposes samples → scrape → time series storage
→ rate/window queries → alert rule → human or controller acts
(scrape-success is itself a signal — the star's meta-metric)

Traces instantiation:

request starts trace → context propagates (survives the async hops!)
→ spans recorded → sampled/exported → stored/indexed
→ operator walks the critical path

Control-loop instantiation (→ backpressure.md axis 5, whole):

measure → compare to target → adjust → actuate
→ observe DELAYED effect → dampen (hysteresis)
(and name the other controller on the same plant)

Named Configurations (lookup table) #

Vector = {decision, geometry, fidelity, vantage, judgment}. Rows marked → are owned elsewhere.

NameVectorCanonical study objectSignature failure
Metricsany, pre-aggregated, cardinality-bounded, white-box, feeds SLIsPrometheus/OpenMetricscardinality explosion; irreversible aggregation hides tails; counter resets
Logsdebug, discrete, expensive, white-box, —structured JSON + request IDsno correlation ID; secrets/PII (boundary breach); clock-skew misordering
Tracesdebug (“where did latency go”), causal, sampled, white-box, —OpenTelemetry modelsampling hides the rare one (axis 3’s trade); context dies at async hops
Eventsoperational awareness, transitions, ephemeral, white-box, —k8s Eventstreated as durable audit (they are not — log.md’s readership test fails)
Profilesdebug (“where did CPU go”), stack-attributed, sampled, white-box, —pprof; eBPF continuoussampling bias; missing symbols; wrong workload sampled
Audit telemetry → log.md audit rowaccountability, fact-log, full-fidelity, white-box, integrity vs operatorCloudTrail; k8s auditdisabled during incident; denials omitted; mutable by the adversary it watches
SLI/SLOjudgment, derived, windowed, user-anchored, the layer itselfSRE SLO/error-budget modelSLI ≠ user experience; budget with no enforcement (open loop); Goodhart
Alertingpage a human, rules over signals, routed, —, burn-rate-drivenAlertmanager + burn ratesfatigue (attention goodput*); flapping; symptom-without-action; the missing page
Dashboardsexplore, visual read model, —, —, —RED/USE in Grafanalies via stale data (star*: absence rendered as zero); averages hide tails; no link to action
Control loops → backpressure.md axis 5actuate, fresh metrics, —, white-box, target-drivenHPA; TCP congestion control(owned: oscillation, lag, controller fights)
Cost telemetry → capacity.md axis 4attribute, meters+tags, —, —, money-denominatedOpenCost(owned: untagged spend, unfair splits, Goodhart)
Synthetic/black-boxuser’s-chair truth, probes, sparse, outside the blast radius, feeds SLIssynthetic monitoringunrepresentative path (bypasses auth/cache); too sparse for regional issues

Vocabulary #

signal  sample  series  label  attribute  cardinality
correlation ID  trace context  span  exemplar
histogram  percentile  tail  aggregation (irreversible)
head/tail sampling  retention  resolution
SLI  SLO  error budget  burn rate  window pair
deadman switch  absent()  scrape success  staleness marker
white-box  black-box  synthetic  probe  representative path
actionability  alert fatigue  flapping  routing  silence
redaction  audit integrity

Deep Lesson #

Telemetry bugs come from confusing pairs on different axes:

measurement          vs  truth                (a signal is a claim by a witness — the star* asks who the witness is)
healthy metric       vs  healthy user         (axis 5: the SLI was chosen from the wrong chair)
average              vs  tail                 (axis 2: aggregation is irreversible)
absence of data      vs  zero                 (the star*'s one-liner — and the deadman's reason)
logs                 vs  audit                (axis 1: debug richness ≠ integrity against the operator)
events               vs  durable history      (axis 2: k8s Events are weather, not record)
alert                vs  actionability        (axis 1: a page without an action is spent attention)
autoscaling          vs  overload protection  (→ backpressure + capacity: minutes vs seconds, again)
telemetry volume     vs  observability        (axis 1's slogan: unconsumed signals are cost)

Design procedure: start from the decision and work backwards to the signal, pick the geometry that can answer it and join everything with one ID, bound the cardinality and choose what sampling may lose, keep one witness outside the blast radius, anchor the SLI in the user’s chair and close the budget loop — and for every signal, write down what its silence means, because the pipeline’s own death is the one event it cannot report. The named types are recognition shortcuts, not the design space.