Distributed Tracing System (Jaeger / Zipkin-class)
Distributed Tracing System (Jaeger / Zipkin-class) #
This note models a Jaeger/Zipkin-class distributed tracing system where services emit spans, collectors ingest them, storage indexes traces for query, and users search and inspect complete distributed traces across many services.
Step 1 - Normalize #
Assume the baseline prompt is:
- design a distributed tracing system
- applications emit spans with trace and parent relationships
- collectors ingest spans at high rate
- users search traces by service, operation, tags, and time range
- system reconstructs traces from stored spans
- storage scales across many services and tenants
Normalize into state-affecting paths.
| Requirement | Actor | Operation | State touched | Priority |
|---|---|---|---|---|
| Service emits span batch | Client | append event | S1create targetSpanRecord | C1 |
| Collector updates trace index / metadata | System | state transition | S1update targetTraceIndexState | C1 |
| User queries traces by filters | Client | read source | S1read source targetTraceIndexState | R1 |
| User fetches full trace by trace id | Client | read source | S1read source targetSpanRecord | R1 |
| System compacts / merges span storage segments | System | async process | S1hidden write targetTraceStorageBlockState | C1 |
| System applies retention / deletes expired traces | System | async process | S1hidden write targetTraceStorageBlockState | C1 |
| System routes tenant/shard to current owner | System | read source | S1read source targetPartitionMap | C1 |
| System reassigns shard ownership after node failure | System | state transition | S1update targetPartitionOwnership | C1 |
| User reads dashboards / status | Client | read projection | S1read projection targetTracingStatusView | R2 |
Notes on normalization #
Important choices:
- span ingestion is
append event- spans are immutable telemetry facts
- trace-index update is
state transition- searchable metadata evolves as new spans arrive for a trace
- trace fetch and trace search are read paths
- compaction and retention are explicit storage-lifecycle paths
- routing and ownership are explicit because this is distributed infra
This system is a hybrid of:
append-oriented telemetry ingestiontrace assembly by correlation idssearch/index over immutable span records
Step 2 - Critical Path Selection #
| Requirement | Priority class | Why |
|---|---|---|
| Emit / ingest spans | C1 | span telemetry is the primary product truth |
| Update trace index / metadata | C1 | trace search depends on correct trace-level metadata |
| Query traces by filters | R1 | core serving path |
| Fetch full trace by id | R1 | core serving path |
| Compact storage segments | C1 | storage lifecycle and query correctness depend on safe compaction |
| Apply retention / expiry | C1 | storage correctness and cost depend on safe deletion lifecycle |
| Route to shard owner | C1 | wrong routing can split authoritative index/storage ownership |
| Reassign shard ownership | C1 | failover must preserve ingestion/query/storage correctness |
| Dashboards / status | R2 | operational only |
Baseline critical paths #
Main C1 paths:
P1ingest spansP2update trace indexP3compact storage segmentsP4apply retention / expiryP5route to shard ownerP6reassign shard ownership
Main R1 paths:
P7query traces by filtersP8fetch full trace by id
This design is driven by:
- immutable span ingestion
- canonical trace correlation metadata
- search/index correctness
- safe storage lifecycle transitions
Step 3 - Primary State Extraction #
For a distributed tracing system, the minimal primary state is the immutable span record, current trace index metadata, storage block lifecycle state, and routing/ownership state.
| Candidate object label | Candidate source | Candidate needed for C1/R1? | Candidate decomposition action | Class | Primary? | Owner | Evolution | Scope kind | Scope value |
|---|---|---|---|---|---|---|---|---|---|
| SpanRecord | direct noun | Yes | keep as candidate | event | Yes | service | append-only | instance | trace_id + span_id |
| TraceIndexState | hidden write target | Yes | keep as candidate | entity | Yes | service | overwrite | instance | trace_id or search-key scope |
| TraceStorageBlockState | hidden write target | Yes | keep as candidate | process | Yes | service | state machine | instance | block_id |
| PartitionOwnership | hidden write target | Yes | keep as candidate | process | Yes | service | state machine | instance | shard_id |
| PartitionMap | hidden write target | Yes | keep as candidate | entity | Yes | service | overwrite | collection | tenant/shard map |
| TracingStatusView | derived read model | No | reject as UI artifact | projection | No | derived | overwrite | collection | tenant or cluster |
Important modeling choices #
SpanRecord #
Primary because:
- span ingestion is the core immutable telemetry fact
- includes trace id, span id, parent span id, timestamps, tags, and logs/events
TraceIndexState #
Primary because:
- trace search depends on searchable metadata
- captures service names, operation names, time range, duration bounds, tags/index references for a trace or query shard
TraceStorageBlockState #
Primary because:
- segment/head/block lifecycle determines retention, compaction, and queryability
PartitionOwnership / PartitionMap #
Needed because:
- collectors and queriers must consistently route ingestion and reads to authoritative storage/index owners
Minimal strict primary set #
The strongest minimal set is:
SpanRecordTraceIndexStateTraceStorageBlockStatePartitionOwnershipPartitionMap
Step 4 - Hard Invariants #
For a Jaeger/Zipkin-class tracing system, the hard invariants are about immutable span append semantics, correct trace correlation/indexing, safe storage compaction, and valid retention transitions.
| Path | Tier | Type | Invariant statement |
|---|---|---|---|
P1 ingest spans | HARD | uniqueness | Key (trace_id, span_id) maps to at most one logical outcome stored span record within trace scope. |
P1 ingest spans | HARD | accounting | Stored span fields preserve emitted trace correlation identifiers and timestamps for that span scope. |
P2 update trace index | HARD | accounting | TraceIndexState for a trace or search key reflects the union of authoritative stored spans relevant to that trace/search scope modulo retention policy. |
P3 compact storage segments | HARD | accounting | Compacted storage contents equal the union of authoritative source segments modulo dedup and retention rules, and source-to-destination transition preserves queryable truth. |
P4 apply retention / expiry | HARD | eligibility | Action expire_trace_data is valid only if affected spans/blocks are older than retention policy and current block state allows expiry at decision time. |
P5 route to shard owner | HARD | uniqueness | Key shard_id maps to at most one logical outcome current authoritative owner within shard_id. |
P6 reassign shard ownership | HARD | eligibility | Action reassign_shard is valid only if current owner is failed or relinquished and candidate owner is eligible and sufficiently current on shard_id at decision time. |
P7 query traces by filters | HARD | freshness | Query path reflects authoritative indexed trace metadata and stored spans within configured query consistency bound. |
P8 fetch full trace by id | HARD | freshness | Read path reflects authoritative stored spans for trace_id within configured query consistency bound. |
What matters most #
1. Spans are immutable #
Once accepted, a span record is an immutable telemetry fact.
2. Trace index must correspond to stored spans #
Search results must only point to traces that actually exist in storage.
3. Compaction and retention must preserve correctness #
Storage lifecycle operations cannot lose live data or surface expired data incorrectly.
4. Trace completeness is often eventually assembled #
A trace may arrive out of order or in partial batches, so query freshness is usually bounded, not instantaneous.
Step 5 - Execution Context #
For the baseline distributed tracing platform:
| Field | Value | Why |
|---|---|---|
| Topology | single service distributed | one logical tracing system spread across collectors, storage, and query nodes |
| Write coordination scope | per object scope | correctness is per span identity, trace/search index scope, block lifecycle, and shard ownership scope |
| Read consistency target | bounded stale allowed | trace assembly and search can tolerate small bounded ingest-to-query lag |
| Holder model | node | shard ownership is held by collector/storage nodes |
| Compensation acceptable? | No | wrong span storage or block retention cannot be safely repaired by compensation |
Derived implications #
holder_may_crash = true- collectors, indexers, or storage nodes can fail mid-ingestion
cross_service_write = false- baseline keeps span storage, index metadata, and ownership in one logical service
bounded_staleness_allowed = true- query paths can tolerate some bounded lag from ingestion to search visibility
cross_service_atomicity_required = false- no multi-service transaction across unrelated services in baseline
exclusive_claim_required = true- shard ownership must be exclusive
guarded_by_current_state = true- compaction and retention transitions depend on current block state
What this implies #
This pushes us toward:
- one authoritative owner per tenant or trace/search shard
- append-oriented write path for spans
- index updates maintained by the shard owner
- bounded-stale query/search visibility
Step 6 - Deterministic Mechanism Selection #
| Path | Write shape | Base mechanism | Required companions |
|---|---|---|---|
P1 ingest spans | append-only event | append log / span segment append | duplicate-span policy |
P2 update trace index | overwrite current value | single writer per shard or CAS on version | canonical trace/search key mapping |
P3 compact storage segments | guarded state transition | single-writer compaction with atomic segment swap | block manifest/version |
P4 apply retention / expiry | guarded state transition | single-writer retention transition | retention epoch/policy version |
P5 route to shard owner | exclusive claim | lease | fencing token, heartbeat |
P6 reassign shard ownership | guarded state transition | CAS on (state, version) | fencing token, shard catch-up check |
Why these fit #
Span ingestion #
Spans are immutable records, so append-only fits.
Trace index #
Current search metadata is current-value state maintained by the shard owner, so overwrite fits.
Compaction and retention #
These are storage-lifecycle changes that must preserve current query truth, so guarded transition fits.
Canonical substrate implied #
The baseline now points to:
- sharded collector / storage / query service
- one owner per tenant or trace/search shard
- append-oriented head or segment storage for spans
- trace/search index maintained over immutable span records
- safe compaction and retention on stored segments
Step 7 - Read Model / Source of Truth #
For a Jaeger/Zipkin-class system, truth is mostly direct source state. Dashboards and UIs are derived.
| Concept | Truth | Read path | Rebuild path |
|---|---|---|---|
C1 immutable span records | SpanRecord | read source directly | authoritative span store |
C2 trace/search metadata | TraceIndexState | read source directly | authoritative index metadata |
C3 storage segment lifecycle | TraceStorageBlockState | read source directly | authoritative block/segment manifest store |
C4 shard ownership | PartitionOwnership | read source directly | authoritative ownership store |
C5 shard routing map | PartitionMap | read source directly | authoritative routing metadata |
C6 dashboards / status | derived from spans, index state, and storage state | materialized view | recompute from authoritative state |
Important point #
For the core semantics:
- full-trace fetch reads authoritative stored spans
- trace search reads authoritative index metadata and then underlying spans as needed
- dashboards are projections
Step 8 - Failure Handling #
| Path | Retry | Competing writers | Crash after commit | Publish failure | Stale holder |
|---|---|---|---|---|---|
P1 ingest spans | client/collector retry may resend same spans; duplicate policy required | shard owner serializes authoritative storage/index updates for a shard | committed spans survive crash if WAL/segment append persisted | partial batch append recovered from WAL/segment replay | stale owner blocked by fencing token |
P2 update trace index | retry with metadata version | concurrent updates merge through canonical trace/search key mapping under shard owner | committed index metadata survives crash if persisted | index lag only delays search visibility | stale owner blocked by fencing token |
P3 compact storage segments | compaction can retry from source manifests | only one compactor should own a segment set at a time | if crash before atomic swap, source segments remain authoritative | partial destination segment ignored until manifest commit | stale compactor blocked by block manifest/version |
P4 apply retention / expiry | retention pass can retry safely | only current retention owner should transition current block state | committed retention survives crash if manifest persisted | partial deletion deferred until safe manifest transition | stale retention worker blocked by manifest/version |
P5 route to shard owner | retry after refreshing shard map | only one valid owner should exist | if owner changed, refreshed map points to new owner | n/a | stale owner rejected by fencing token |
P6 reassign shard ownership | retry failover transition safely | only one reassignment wins current ownership state | promoted owner crash triggers later reassignment | n/a | old owner fenced and must not continue serving |
P7 query traces by filters | query retry safe | many readers coexist | node crash drops query only | n/a | stale search bounded by configured query freshness |
P8 fetch full trace by id | read retry safe | many readers coexist | node crash drops query only | n/a | stale trace visibility bounded by configured query freshness |
What matters most #
1. Duplicate-span policy #
Collectors and agents may retry the same spans. The system needs a clear dedup policy for (trace_id, span_id).
2. Index lag is acceptable but bounded #
A span may be stored before its trace becomes fully searchable. That is usually acceptable if bounded.
3. Compaction and retention need atomic manifest transitions #
New compacted segments and expired segments should only change visibility when the manifest/state transition is fully committed.
Step 9 - Scale Adjustments #
| Hotspot | Type | First response |
|---|---|---|
| very high span ingestion rate | write throughput hotspot | shard by tenant, trace-id hash, or time bucket and add more collectors/ingesters |
| high-cardinality tag search | memory/read hotspot | constrain indexed tags and separate full-scan from indexed search |
| large trace fetch fan-in | read hotspot | colocate trace spans by trace id and cache recent traces |
| storage compaction bandwidth | write throughput hotspot | stagger compaction and separate hot ingest storage from cold segments |
| retention deletion storms | contention hotspot | batch retention by block and time bucket |
| dashboard/search load | read hotspot | add query frontends and caching for common searches |
What scales well #
This system scales by:
- sharding ingestion and storage by tenant and trace/search key
- keeping spans append-oriented
- colocating spans of the same trace when possible
- separating hot ingest segments from compacted storage
What fails first #
Usually:
- high-cardinality tag indexing
- search fanout across too many shards
- compaction IO bottlenecks
- huge traces causing skewed reads
Canonical design conclusion #
The mechanical outcome is:
- primary state:
SpanRecordTraceIndexStateTraceStorageBlockStatePartitionOwnershipPartitionMap
- critical invariants:
- immutable span append semantics
- canonical trace/search metadata reflecting stored spans
- safe segment compaction and retention preserving query truth
- exclusive shard ownership for ingestion/storage
- mechanisms:
append logsingle writer per shard- guarded compaction and retention transitions
- fenced shard ownership
- reads:
- direct authoritative reads for spans and trace/search metadata
- projections only for dashboards and cluster status
Polished interview answer #
I’d design the tracing system as a sharded collector-and-store service with one authoritative owner per tenant or trace/search shard. Applications emit immutable spans, collectors append them into durable storage, and shard owners maintain trace/search metadata so users can search by service, operation, tags, and time range. Full-trace fetches read stored spans directly by trace id, while search reads the index and then resolves matching traces. The storage layer periodically compacts segments and applies retention through guarded manifest transitions so query truth is preserved. The main scaling levers are more ingestion shards, careful limits on indexed tags, colocating spans by trace id, and separating hot ingest storage from compacted segments.
Concrete Substrate #
I’ll choose a sharded collector / storage / query system with append-oriented span segments plus secondary trace-search indexes as the concrete baseline, because it matches the mechanics we derived:
- append-only span ingestion
- current-value trace/search metadata
- safe compaction and retention lifecycle
- one owner per tenant or trace/search shard
Concrete tech family:
- collectors/query services in
GoorJava - per-shard WAL or append segments for spans
- compacted immutable segments in local disk or object storage
- secondary index store for trace search metadata
- metadata/control:
etcdor internal metadata quorum for shard ownership/routing
Each shard owner stores:
- current append segments / WAL for spans
- trace/search index metadata
- segment/block manifests
- retention and compaction state
Persisted segment store stores:
- immutable span segments
- segment manifests and indexes
Operation Layer #
1. Ingest span batch #
API
IngestSpans(span_batch)
Initiator
- application agent / collector client
Entry point
- collector / ingester
Authoritative decider
- shard owner for tenant/trace space
Precondition
- span batch parsed and routed to correct shard
Transition
- append
SpanRecords into WAL/segment - create or refresh
TraceIndexStateas needed
Response
- success / partial failure
2. Query traces by filters #
API
FindTraces(service, operation, tags, time_range, limit)
Initiator
- user/client
Entry point
- query frontend
Authoritative decider
- relevant shard owners plus query engine
Precondition
- none
Transition
- none
Response
- matching trace summaries / ids
3. Fetch full trace #
API
GetTrace(trace_id)
Initiator
- user/client
Entry point
- query frontend
Authoritative decider
- shard owner for
trace_id
Precondition
- none
Transition
- none
Response
- all spans for the trace
4. Compact storage segments #
API
- internal compaction loop
Initiator
- system
Entry point
- compactor / shard owner
Authoritative decider
- compaction owner for segment set
Precondition
- source segments eligible for compaction
Transition
- read source segments
- create destination compacted segment
- atomically update
TraceStorageBlockState/ manifest to swap visibility
5. Apply retention #
API
- internal retention loop
Initiator
- system
Entry point
- retention worker / shard owner
Authoritative decider
- retention owner for segment set
Precondition
- blocks older than retention horizon and current block state allows expiry
Transition
- mark expired segments as no longer queryable
- later delete physical storage
Entry Point vs Decider vs Responder #
| Path | Entry point | Authoritative decider | Physical responder | Logical responder |
|---|---|---|---|---|
| span ingest | collector / ingester | shard owner | ingester node | tracing system |
| trace search | query frontend | shard owners + query engine | query frontend | tracing system |
| full trace fetch | query frontend | shard owner for trace | query frontend | tracing system |
| compaction | compactor / shard owner | compaction owner | storage node | tracing system |
| retention | retention worker / shard owner | retention owner | storage node | tracing system |
| shard failover | follower / coordination layer | shard quorum / lease store | new leader / control plane | tracing system |
Concrete HLD #
Main components:
- collectors / ingestion frontends
- receive spans from applications or agents
- ingestion/storage shard owners
- authoritative owners for span segments, indexes, and block lifecycle
- query frontend
- handles search and trace-fetch APIs
- trace-search index
- supports filter-based lookup by service/operation/tags/time
- segment store
- stores immutable span segments
- metadata/control service
- tracks shard ownership and routing
Short Interview Version #
I’d build the tracing system as a sharded collector-and-store service with one authoritative owner per tenant or trace/search shard. Applications emit immutable spans, collectors append them into durable storage, and shard owners maintain trace/search metadata so users can search by service, operation, tags, and time range. Full-trace fetches read stored spans directly by trace id, while search reads the index and then resolves matching traces. The storage layer periodically compacts segments and applies retention through guarded manifest transitions so query truth is preserved. The main scaling levers are more ingestion shards, careful limits on indexed tags, colocating spans by trace id, and separating hot ingest storage from compacted segments.