Telemetry / Feedback #
telemetry = structured observations about system behavior
feedback = decisions and actions driven by those observations
It answers:
what is happening, why, and what should change?
Role in the catalog: the evaluation layer’s second member — predicted twice before it arrived. capacity.md forecast it (“SLOs are the availability-denominated sibling of unit economics”); log.md evicted trace logs to it; index_structures.md forecast that “the metric you didn’t emit” would be silent omission’s next member. All three debts land here:
capacity.md prices the system in MONEY;
this block prices it in EVIDENCE.
Goodhart appears in both files for the same reason: “teams optimize the metric instead of the service” IS “teams optimize attributed cost instead of real cost,” one layer over.
Central tension:
visibility and control vs cost, cardinality, noise, and feedback instability
Design Axes (the core module) #
Axis 1 — The Decision the Signal Supports (the structural cleave) #
Telemetry is derived data with no intrinsic value — it is worth exactly the decisions it improves. So the design runs BACKWARDS from the decision:
page a human needs ACTIONABILITY; every alert must map to an
action, and false positives are paid in attention
(alert fatigue is the attention budget collapsing —
goodput*, denominated in on-call trust)
drive a controller needs FRESHNESS and STABILITY — backpressure.md
axis 5's control loop, whole: metric lag,
oscillation, and two-controllers-fighting appear
in both files because they are one hazard
explore / debug needs RICHNESS, tolerates lag (traces, logs,
profiles live mostly here)
prove accountability needs INTEGRITY against the operator —
log.md's adversarial-auditor readership, home:
audit telemetry is a fact-log whose threat model
includes the admin
attribute cost → capacity.md axis 4, whole (metering,
attribution, the Goodhart loop)
The deep lesson’s slogan row belongs here:
telemetry volume ≠ observability.
a signal no decision consumes is cost without value —
capacity.md's unit-economics judgment, applied to the observations
themselves.
Interrogation:
For each signal: name the decision. No decision? Delete it or demote it.
For each alert: name the ACTION. "Look into it" is not an action.
For each dashboard panel: what would someone do differently having seen it?
Who consumes it — human, controller, auditor — and does the signal's
freshness/integrity match that consumer's needs?
Axis 2 — Signal Geometry (what question each kind CAN answer) #
Not rivals — projections of one underlying reality at different aggregation and causality points:
metrics pre-aggregated numbers — cheap, and the aggregation is
IRREVERSIBLE: you cannot get the tail back out of an average
("average vs tail" lives here; export histograms, not means)
logs discrete instances — expensive, complete per-event
traces CAUSAL structure — the only signal that answers
"where did the latency go"
profiles resources attributed to code paths —
"where did the CPU go"
events state transitions — with the k8s caveat the source doc gets
right: ephemeral, NOT durable audit (that's log.md's
adversarial readership, and k8s Events don't qualify)
The join key across all of them:
correlation / trace ID — and "no correlation ID" is the failure that
orphans every signal from every other: five telescopes, no shared sky.
Interrogation:
Which geometry answers the axis-1 decision's question?
(paging on a symptom → metric; explaining it → trace; proving it → audit)
Is the aggregation throwing away the thing you'll need? (tails, exemplars)
Does the correlation ID survive every hop — including the async ones?
(queues and schedulers are where trace context goes to die)
Axis 3 — Fidelity Economics #
capacity.md’s unit economics, applied to observations:
cardinality write amplification wearing labels — one unbounded label
(user ID, pod name × restart) multiplies every series;
index_structures.md's cardinality regimes, verbatim
sampling a DELIBERATE silent-omission trade: the rare event you
sampled away is invisible forever — head vs tail sampling
is choosing WHICH rare things survive
retention the treaty again: keep evidence as long as the questions
it answers stay askable, and no longer
Interrogation:
Which labels are bounded by construction, and which by hope?
What does the sampler drop, and is the incident you'll debug in it?
Can the P99 of a bad hour still be answered next month? (retention
and resolution decide together)
Axis 4 — Vantage #
white-box the system reports on itself — rich, and fate-shared (the star*)
black-box synthetic probes measure from OUTSIDE — the user's chair:
representative paths (through auth, through cache, through
the real region), or the probe measures a system no user visits
This axis exists BECAUSE of the star: the external probe is the witness that doesn’t share the patient’s fate.
Interrogation:
Does a probe exercise the path users actually take?
When white-box goes dark, what black-box signal remains?
Axis 5 — The Judgment Layer (SLI → SLO → error budget → burn rate) #
capacity.md’s evaluation machinery, denominated in reliability:
SLI the measurement that tracks USER experience —
"healthy metric ≠ healthy user" is an SLI chosen from the
server's chair instead of the user's
SLO the objective: how much unreliability is budgeted
error budget the spend account — reliability as a currency, with the
same open-loop/closed-loop question as cost:
a budget nobody enforces is a dashboard
burn rate the derivative that makes alerts timely — paging on budget
EXHAUSTED is paging after the money is gone
(capacity.md's "alerts fire after the spend," verbatim)
Interrogation:
Does the SLI move when users suffer — tested against a real incident?
What HAPPENS at budget exhaustion — freeze, page, nothing? (open or closed?)
Are the burn-rate windows paired (fast+slow) so both cliffs and slow
leaks page in time?
Technical Bottleneck: The Observer Is Inside the System* #
telemetry shares fate with what it observes.
the sickest node stops reporting; the overloaded path drops its own
spans first; the outage that matters most is the one that takes the
telemetry pipeline with it.
the dead don't page.
So every signal is a self-report from a witness who is also the patient — and absence of data is SYSTEMATICALLY ambiguous between “zero” and “the reporter died.” The deep lesson’s “absence ≠ zero” is this star’s one-line form. It is silent omission*’s third member, with a survivorship twist the other two lack:
the omission is CORRELATED WITH SEVERITY —
the worse things are, the less evidence arrives.
Known recipes:
deadman switch (flagship) alert on the ABSENCE of the signal —
inverting the ambiguity into a page
("up == 0 or absent(up)" is the whole idea)
black-box vantage axis 4: a witness outside the blast radius
out-of-band paths the signals needed DURING collapse travel
a pipeline that doesn't share the system's
dependencies (separate cluster, separate
cloud, separate auth)
per-signal disambiguation the mandatory design question, asked per
signal: can missing be distinguished from
zero HERE? (staleness markers, counter
heartbeats, scrape-success meta-metrics)
A strong design says explicitly:
the decision every signal supports (axis 1),
the geometry that can answer it and the ID that joins them (axis 2),
the cardinality, sampling, and retention bills (axis 3),
the outside witness for when the inside ones die (axis 4),
the SLI the user would agree with, and what budget exhaustion DOES (axis 5),
and for every signal: what its absence means — because the most
important page is the one about the silence.
Telemetry As Protocol (the crossing-point spec — keep) #
instrument → emit → collect → enrich (resource attributes)
→ transform/sample/REDACT (secrets and tenant data are boundary.md's
concern riding the pipeline — PII in logs is a data-boundary breach
through the observability side door)
→ store/index/aggregate → query/visualize/alert → decide or actuate
→ feed back (and stabilize: backpressure.md's loop discipline)
Metrics instantiation:
app exposes samples → scrape → time series storage
→ rate/window queries → alert rule → human or controller acts
(scrape-success is itself a signal — the star's meta-metric)
Traces instantiation:
request starts trace → context propagates (survives the async hops!)
→ spans recorded → sampled/exported → stored/indexed
→ operator walks the critical path
Control-loop instantiation (→ backpressure.md axis 5, whole):
measure → compare to target → adjust → actuate
→ observe DELAYED effect → dampen (hysteresis)
(and name the other controller on the same plant)
Named Configurations (lookup table) #
Vector = {decision, geometry, fidelity, vantage, judgment}. Rows marked → are owned elsewhere.
| Name | Vector | Canonical study object | Signature failure |
|---|---|---|---|
| Metrics | any, pre-aggregated, cardinality-bounded, white-box, feeds SLIs | Prometheus/OpenMetrics | cardinality explosion; irreversible aggregation hides tails; counter resets |
| Logs | debug, discrete, expensive, white-box, — | structured JSON + request IDs | no correlation ID; secrets/PII (boundary breach); clock-skew misordering |
| Traces | debug (“where did latency go”), causal, sampled, white-box, — | OpenTelemetry model | sampling hides the rare one (axis 3’s trade); context dies at async hops |
| Events | operational awareness, transitions, ephemeral, white-box, — | k8s Events | treated as durable audit (they are not — log.md’s readership test fails) |
| Profiles | debug (“where did CPU go”), stack-attributed, sampled, white-box, — | pprof; eBPF continuous | sampling bias; missing symbols; wrong workload sampled |
| Audit telemetry → log.md audit row | accountability, fact-log, full-fidelity, white-box, integrity vs operator | CloudTrail; k8s audit | disabled during incident; denials omitted; mutable by the adversary it watches |
| SLI/SLO | judgment, derived, windowed, user-anchored, the layer itself | SRE SLO/error-budget model | SLI ≠ user experience; budget with no enforcement (open loop); Goodhart |
| Alerting | page a human, rules over signals, routed, —, burn-rate-driven | Alertmanager + burn rates | fatigue (attention goodput*); flapping; symptom-without-action; the missing page |
| Dashboards | explore, visual read model, —, —, — | RED/USE in Grafana | lies via stale data (star*: absence rendered as zero); averages hide tails; no link to action |
| Control loops → backpressure.md axis 5 | actuate, fresh metrics, —, white-box, target-driven | HPA; TCP congestion control | (owned: oscillation, lag, controller fights) |
| Cost telemetry → capacity.md axis 4 | attribute, meters+tags, —, —, money-denominated | OpenCost | (owned: untagged spend, unfair splits, Goodhart) |
| Synthetic/black-box | user’s-chair truth, probes, sparse, outside the blast radius, feeds SLIs | synthetic monitoring | unrepresentative path (bypasses auth/cache); too sparse for regional issues |
Vocabulary #
signal sample series label attribute cardinality
correlation ID trace context span exemplar
histogram percentile tail aggregation (irreversible)
head/tail sampling retention resolution
SLI SLO error budget burn rate window pair
deadman switch absent() scrape success staleness marker
white-box black-box synthetic probe representative path
actionability alert fatigue flapping routing silence
redaction audit integrity
Deep Lesson #
Telemetry bugs come from confusing pairs on different axes:
measurement vs truth (a signal is a claim by a witness — the star* asks who the witness is)
healthy metric vs healthy user (axis 5: the SLI was chosen from the wrong chair)
average vs tail (axis 2: aggregation is irreversible)
absence of data vs zero (the star*'s one-liner — and the deadman's reason)
logs vs audit (axis 1: debug richness ≠ integrity against the operator)
events vs durable history (axis 2: k8s Events are weather, not record)
alert vs actionability (axis 1: a page without an action is spent attention)
autoscaling vs overload protection (→ backpressure + capacity: minutes vs seconds, again)
telemetry volume vs observability (axis 1's slogan: unconsumed signals are cost)
Design procedure: start from the decision and work backwards to the signal, pick the geometry that can answer it and join everything with one ID, bound the cardinality and choose what sampling may lose, keep one witness outside the blast radius, anchor the SLI in the user’s chair and close the budget loop — and for every signal, write down what its silence means, because the pipeline’s own death is the one event it cannot report. The named types are recognition shortcuts, not the design space.