Skip to main content
  1. System Design Components/

Distributed Tracing System (Jaeger / Zipkin-class)

Distributed Tracing System (Jaeger / Zipkin-class) #

This note models a Jaeger/Zipkin-class distributed tracing system where services emit spans, collectors ingest them, storage indexes traces for query, and users search and inspect complete distributed traces across many services.


Step 1 - Normalize #

Assume the baseline prompt is:

  • design a distributed tracing system
  • applications emit spans with trace and parent relationships
  • collectors ingest spans at high rate
  • users search traces by service, operation, tags, and time range
  • system reconstructs traces from stored spans
  • storage scales across many services and tenants

Normalize into state-affecting paths.

RequirementActorOperationState touchedPriority
Service emits span batchClientappend eventS1
create target
SpanRecord
C1
Collector updates trace index / metadataSystemstate transitionS1
update target
TraceIndexState
C1
User queries traces by filtersClientread sourceS1
read source target
TraceIndexState
R1
User fetches full trace by trace idClientread sourceS1
read source target
SpanRecord
R1
System compacts / merges span storage segmentsSystemasync processS1
hidden write target
TraceStorageBlockState
C1
System applies retention / deletes expired tracesSystemasync processS1
hidden write target
TraceStorageBlockState
C1
System routes tenant/shard to current ownerSystemread sourceS1
read source target
PartitionMap
C1
System reassigns shard ownership after node failureSystemstate transitionS1
update target
PartitionOwnership
C1
User reads dashboards / statusClientread projectionS1
read projection target
TracingStatusView
R2

Notes on normalization #

Important choices:

  • span ingestion is append event
    • spans are immutable telemetry facts
  • trace-index update is state transition
    • searchable metadata evolves as new spans arrive for a trace
  • trace fetch and trace search are read paths
  • compaction and retention are explicit storage-lifecycle paths
  • routing and ownership are explicit because this is distributed infra

This system is a hybrid of:

  • append-oriented telemetry ingestion
  • trace assembly by correlation ids
  • search/index over immutable span records

Step 2 - Critical Path Selection #

RequirementPriority classWhy
Emit / ingest spansC1span telemetry is the primary product truth
Update trace index / metadataC1trace search depends on correct trace-level metadata
Query traces by filtersR1core serving path
Fetch full trace by idR1core serving path
Compact storage segmentsC1storage lifecycle and query correctness depend on safe compaction
Apply retention / expiryC1storage correctness and cost depend on safe deletion lifecycle
Route to shard ownerC1wrong routing can split authoritative index/storage ownership
Reassign shard ownershipC1failover must preserve ingestion/query/storage correctness
Dashboards / statusR2operational only

Baseline critical paths #

Main C1 paths:

  • P1 ingest spans
  • P2 update trace index
  • P3 compact storage segments
  • P4 apply retention / expiry
  • P5 route to shard owner
  • P6 reassign shard ownership

Main R1 paths:

  • P7 query traces by filters
  • P8 fetch full trace by id

This design is driven by:

  • immutable span ingestion
  • canonical trace correlation metadata
  • search/index correctness
  • safe storage lifecycle transitions

Step 3 - Primary State Extraction #

For a distributed tracing system, the minimal primary state is the immutable span record, current trace index metadata, storage block lifecycle state, and routing/ownership state.

Candidate object labelCandidate sourceCandidate needed for C1/R1?Candidate decomposition actionClassPrimary?OwnerEvolutionScope kindScope value
SpanRecorddirect nounYeskeep as candidateeventYesserviceappend-onlyinstancetrace_id + span_id
TraceIndexStatehidden write targetYeskeep as candidateentityYesserviceoverwriteinstancetrace_id or search-key scope
TraceStorageBlockStatehidden write targetYeskeep as candidateprocessYesservicestate machineinstanceblock_id
PartitionOwnershiphidden write targetYeskeep as candidateprocessYesservicestate machineinstanceshard_id
PartitionMaphidden write targetYeskeep as candidateentityYesserviceoverwritecollectiontenant/shard map
TracingStatusViewderived read modelNoreject as UI artifactprojectionNoderivedoverwritecollectiontenant or cluster

Important modeling choices #

SpanRecord #

Primary because:

  • span ingestion is the core immutable telemetry fact
  • includes trace id, span id, parent span id, timestamps, tags, and logs/events

TraceIndexState #

Primary because:

  • trace search depends on searchable metadata
  • captures service names, operation names, time range, duration bounds, tags/index references for a trace or query shard

TraceStorageBlockState #

Primary because:

  • segment/head/block lifecycle determines retention, compaction, and queryability

PartitionOwnership / PartitionMap #

Needed because:

  • collectors and queriers must consistently route ingestion and reads to authoritative storage/index owners

Minimal strict primary set #

The strongest minimal set is:

  • SpanRecord
  • TraceIndexState
  • TraceStorageBlockState
  • PartitionOwnership
  • PartitionMap

Step 4 - Hard Invariants #

For a Jaeger/Zipkin-class tracing system, the hard invariants are about immutable span append semantics, correct trace correlation/indexing, safe storage compaction, and valid retention transitions.

PathTierTypeInvariant statement
P1 ingest spansHARDuniquenessKey (trace_id, span_id) maps to at most one logical outcome stored span record within trace scope.
P1 ingest spansHARDaccountingStored span fields preserve emitted trace correlation identifiers and timestamps for that span scope.
P2 update trace indexHARDaccountingTraceIndexState for a trace or search key reflects the union of authoritative stored spans relevant to that trace/search scope modulo retention policy.
P3 compact storage segmentsHARDaccountingCompacted storage contents equal the union of authoritative source segments modulo dedup and retention rules, and source-to-destination transition preserves queryable truth.
P4 apply retention / expiryHARDeligibilityAction expire_trace_data is valid only if affected spans/blocks are older than retention policy and current block state allows expiry at decision time.
P5 route to shard ownerHARDuniquenessKey shard_id maps to at most one logical outcome current authoritative owner within shard_id.
P6 reassign shard ownershipHARDeligibilityAction reassign_shard is valid only if current owner is failed or relinquished and candidate owner is eligible and sufficiently current on shard_id at decision time.
P7 query traces by filtersHARDfreshnessQuery path reflects authoritative indexed trace metadata and stored spans within configured query consistency bound.
P8 fetch full trace by idHARDfreshnessRead path reflects authoritative stored spans for trace_id within configured query consistency bound.

What matters most #

1. Spans are immutable #

Once accepted, a span record is an immutable telemetry fact.

2. Trace index must correspond to stored spans #

Search results must only point to traces that actually exist in storage.

3. Compaction and retention must preserve correctness #

Storage lifecycle operations cannot lose live data or surface expired data incorrectly.

4. Trace completeness is often eventually assembled #

A trace may arrive out of order or in partial batches, so query freshness is usually bounded, not instantaneous.


Step 5 - Execution Context #

For the baseline distributed tracing platform:

FieldValueWhy
Topologysingle service distributedone logical tracing system spread across collectors, storage, and query nodes
Write coordination scopeper object scopecorrectness is per span identity, trace/search index scope, block lifecycle, and shard ownership scope
Read consistency targetbounded stale allowedtrace assembly and search can tolerate small bounded ingest-to-query lag
Holder modelnodeshard ownership is held by collector/storage nodes
Compensation acceptable?Nowrong span storage or block retention cannot be safely repaired by compensation

Derived implications #

  • holder_may_crash = true

    • collectors, indexers, or storage nodes can fail mid-ingestion
  • cross_service_write = false

    • baseline keeps span storage, index metadata, and ownership in one logical service
  • bounded_staleness_allowed = true

    • query paths can tolerate some bounded lag from ingestion to search visibility
  • cross_service_atomicity_required = false

    • no multi-service transaction across unrelated services in baseline
  • exclusive_claim_required = true

    • shard ownership must be exclusive
  • guarded_by_current_state = true

    • compaction and retention transitions depend on current block state

What this implies #

This pushes us toward:

  • one authoritative owner per tenant or trace/search shard
  • append-oriented write path for spans
  • index updates maintained by the shard owner
  • bounded-stale query/search visibility

Step 6 - Deterministic Mechanism Selection #

PathWrite shapeBase mechanismRequired companions
P1 ingest spansappend-only eventappend log / span segment appendduplicate-span policy
P2 update trace indexoverwrite current valuesingle writer per shard or CAS on versioncanonical trace/search key mapping
P3 compact storage segmentsguarded state transitionsingle-writer compaction with atomic segment swapblock manifest/version
P4 apply retention / expiryguarded state transitionsingle-writer retention transitionretention epoch/policy version
P5 route to shard ownerexclusive claimleasefencing token, heartbeat
P6 reassign shard ownershipguarded state transitionCAS on (state, version)fencing token, shard catch-up check

Why these fit #

Span ingestion #

Spans are immutable records, so append-only fits.

Trace index #

Current search metadata is current-value state maintained by the shard owner, so overwrite fits.

Compaction and retention #

These are storage-lifecycle changes that must preserve current query truth, so guarded transition fits.

Canonical substrate implied #

The baseline now points to:

  • sharded collector / storage / query service
  • one owner per tenant or trace/search shard
  • append-oriented head or segment storage for spans
  • trace/search index maintained over immutable span records
  • safe compaction and retention on stored segments

Step 7 - Read Model / Source of Truth #

For a Jaeger/Zipkin-class system, truth is mostly direct source state. Dashboards and UIs are derived.

ConceptTruthRead pathRebuild path
C1 immutable span recordsSpanRecordread source directlyauthoritative span store
C2 trace/search metadataTraceIndexStateread source directlyauthoritative index metadata
C3 storage segment lifecycleTraceStorageBlockStateread source directlyauthoritative block/segment manifest store
C4 shard ownershipPartitionOwnershipread source directlyauthoritative ownership store
C5 shard routing mapPartitionMapread source directlyauthoritative routing metadata
C6 dashboards / statusderived from spans, index state, and storage statematerialized viewrecompute from authoritative state

Important point #

For the core semantics:

  • full-trace fetch reads authoritative stored spans
  • trace search reads authoritative index metadata and then underlying spans as needed
  • dashboards are projections

Step 8 - Failure Handling #

PathRetryCompeting writersCrash after commitPublish failureStale holder
P1 ingest spansclient/collector retry may resend same spans; duplicate policy requiredshard owner serializes authoritative storage/index updates for a shardcommitted spans survive crash if WAL/segment append persistedpartial batch append recovered from WAL/segment replaystale owner blocked by fencing token
P2 update trace indexretry with metadata versionconcurrent updates merge through canonical trace/search key mapping under shard ownercommitted index metadata survives crash if persistedindex lag only delays search visibilitystale owner blocked by fencing token
P3 compact storage segmentscompaction can retry from source manifestsonly one compactor should own a segment set at a timeif crash before atomic swap, source segments remain authoritativepartial destination segment ignored until manifest commitstale compactor blocked by block manifest/version
P4 apply retention / expiryretention pass can retry safelyonly current retention owner should transition current block statecommitted retention survives crash if manifest persistedpartial deletion deferred until safe manifest transitionstale retention worker blocked by manifest/version
P5 route to shard ownerretry after refreshing shard maponly one valid owner should existif owner changed, refreshed map points to new ownern/astale owner rejected by fencing token
P6 reassign shard ownershipretry failover transition safelyonly one reassignment wins current ownership statepromoted owner crash triggers later reassignmentn/aold owner fenced and must not continue serving
P7 query traces by filtersquery retry safemany readers coexistnode crash drops query onlyn/astale search bounded by configured query freshness
P8 fetch full trace by idread retry safemany readers coexistnode crash drops query onlyn/astale trace visibility bounded by configured query freshness

What matters most #

1. Duplicate-span policy #

Collectors and agents may retry the same spans. The system needs a clear dedup policy for (trace_id, span_id).

2. Index lag is acceptable but bounded #

A span may be stored before its trace becomes fully searchable. That is usually acceptable if bounded.

3. Compaction and retention need atomic manifest transitions #

New compacted segments and expired segments should only change visibility when the manifest/state transition is fully committed.


Step 9 - Scale Adjustments #

HotspotTypeFirst response
very high span ingestion ratewrite throughput hotspotshard by tenant, trace-id hash, or time bucket and add more collectors/ingesters
high-cardinality tag searchmemory/read hotspotconstrain indexed tags and separate full-scan from indexed search
large trace fetch fan-inread hotspotcolocate trace spans by trace id and cache recent traces
storage compaction bandwidthwrite throughput hotspotstagger compaction and separate hot ingest storage from cold segments
retention deletion stormscontention hotspotbatch retention by block and time bucket
dashboard/search loadread hotspotadd query frontends and caching for common searches

What scales well #

This system scales by:

  • sharding ingestion and storage by tenant and trace/search key
  • keeping spans append-oriented
  • colocating spans of the same trace when possible
  • separating hot ingest segments from compacted storage

What fails first #

Usually:

  • high-cardinality tag indexing
  • search fanout across too many shards
  • compaction IO bottlenecks
  • huge traces causing skewed reads

Canonical design conclusion #

The mechanical outcome is:

  • primary state:
    • SpanRecord
    • TraceIndexState
    • TraceStorageBlockState
    • PartitionOwnership
    • PartitionMap
  • critical invariants:
    • immutable span append semantics
    • canonical trace/search metadata reflecting stored spans
    • safe segment compaction and retention preserving query truth
    • exclusive shard ownership for ingestion/storage
  • mechanisms:
    • append log
    • single writer per shard
    • guarded compaction and retention transitions
    • fenced shard ownership
  • reads:
    • direct authoritative reads for spans and trace/search metadata
    • projections only for dashboards and cluster status

Polished interview answer #

I’d design the tracing system as a sharded collector-and-store service with one authoritative owner per tenant or trace/search shard. Applications emit immutable spans, collectors append them into durable storage, and shard owners maintain trace/search metadata so users can search by service, operation, tags, and time range. Full-trace fetches read stored spans directly by trace id, while search reads the index and then resolves matching traces. The storage layer periodically compacts segments and applies retention through guarded manifest transitions so query truth is preserved. The main scaling levers are more ingestion shards, careful limits on indexed tags, colocating spans by trace id, and separating hot ingest storage from compacted segments.


Concrete Substrate #

I’ll choose a sharded collector / storage / query system with append-oriented span segments plus secondary trace-search indexes as the concrete baseline, because it matches the mechanics we derived:

  • append-only span ingestion
  • current-value trace/search metadata
  • safe compaction and retention lifecycle
  • one owner per tenant or trace/search shard

Concrete tech family:

  • collectors/query services in Go or Java
  • per-shard WAL or append segments for spans
  • compacted immutable segments in local disk or object storage
  • secondary index store for trace search metadata
  • metadata/control:
    • etcd or internal metadata quorum for shard ownership/routing

Each shard owner stores:

  • current append segments / WAL for spans
  • trace/search index metadata
  • segment/block manifests
  • retention and compaction state

Persisted segment store stores:

  • immutable span segments
  • segment manifests and indexes

Operation Layer #

1. Ingest span batch #

API

  • IngestSpans(span_batch)

Initiator

  • application agent / collector client

Entry point

  • collector / ingester

Authoritative decider

  • shard owner for tenant/trace space

Precondition

  • span batch parsed and routed to correct shard

Transition

  • append SpanRecords into WAL/segment
  • create or refresh TraceIndexState as needed

Response

  • success / partial failure

2. Query traces by filters #

API

  • FindTraces(service, operation, tags, time_range, limit)

Initiator

  • user/client

Entry point

  • query frontend

Authoritative decider

  • relevant shard owners plus query engine

Precondition

  • none

Transition

  • none

Response

  • matching trace summaries / ids

3. Fetch full trace #

API

  • GetTrace(trace_id)

Initiator

  • user/client

Entry point

  • query frontend

Authoritative decider

  • shard owner for trace_id

Precondition

  • none

Transition

  • none

Response

  • all spans for the trace

4. Compact storage segments #

API

  • internal compaction loop

Initiator

  • system

Entry point

  • compactor / shard owner

Authoritative decider

  • compaction owner for segment set

Precondition

  • source segments eligible for compaction

Transition

  • read source segments
  • create destination compacted segment
  • atomically update TraceStorageBlockState / manifest to swap visibility

5. Apply retention #

API

  • internal retention loop

Initiator

  • system

Entry point

  • retention worker / shard owner

Authoritative decider

  • retention owner for segment set

Precondition

  • blocks older than retention horizon and current block state allows expiry

Transition

  • mark expired segments as no longer queryable
  • later delete physical storage

Entry Point vs Decider vs Responder #

PathEntry pointAuthoritative deciderPhysical responderLogical responder
span ingestcollector / ingestershard owneringester nodetracing system
trace searchquery frontendshard owners + query enginequery frontendtracing system
full trace fetchquery frontendshard owner for tracequery frontendtracing system
compactioncompactor / shard ownercompaction ownerstorage nodetracing system
retentionretention worker / shard ownerretention ownerstorage nodetracing system
shard failoverfollower / coordination layershard quorum / lease storenew leader / control planetracing system

Concrete HLD #

Main components:

  • collectors / ingestion frontends
    • receive spans from applications or agents
  • ingestion/storage shard owners
    • authoritative owners for span segments, indexes, and block lifecycle
  • query frontend
    • handles search and trace-fetch APIs
  • trace-search index
    • supports filter-based lookup by service/operation/tags/time
  • segment store
    • stores immutable span segments
  • metadata/control service
    • tracks shard ownership and routing

Short Interview Version #

I’d build the tracing system as a sharded collector-and-store service with one authoritative owner per tenant or trace/search shard. Applications emit immutable spans, collectors append them into durable storage, and shard owners maintain trace/search metadata so users can search by service, operation, tags, and time range. Full-trace fetches read stored spans directly by trace id, while search reads the index and then resolves matching traces. The storage layer periodically compacts segments and applies retention through guarded manifest transitions so query truth is preserved. The main scaling levers are more ingestion shards, careful limits on indexed tags, colocating spans by trace id, and separating hot ingest storage from compacted segments.