Distributed Tracing System (Jaeger / Zipkin-class) #

This note models a Jaeger/Zipkin-class distributed tracing system where services emit spans, collectors ingest them, storage indexes traces for query, and users search and inspect complete distributed traces across many services.

Step 1 - Normalize #

Assume the baseline prompt is:

design a distributed tracing system
applications emit spans with trace and parent relationships
collectors ingest spans at high rate
users search traces by service, operation, tags, and time range
system reconstructs traces from stored spans
storage scales across many services and tenants

Normalize into state-affecting paths.

Requirement	Actor	Operation	State touched	Priority
Service emits span batch	Client	append event	`S1` `create target` `SpanRecord`	C1
Collector updates trace index / metadata	System	state transition	`S1` `update target` `TraceIndexState`	C1
User queries traces by filters	Client	read source	`S1` `read source target` `TraceIndexState`	R1
User fetches full trace by trace id	Client	read source	`S1` `read source target` `SpanRecord`	R1
System compacts / merges span storage segments	System	async process	`S1` `hidden write target` `TraceStorageBlockState`	C1
System applies retention / deletes expired traces	System	async process	`S1` `hidden write target` `TraceStorageBlockState`	C1
System routes tenant/shard to current owner	System	read source	`S1` `read source target` `PartitionMap`	C1
System reassigns shard ownership after node failure	System	state transition	`S1` `update target` `PartitionOwnership`	C1
User reads dashboards / status	Client	read projection	`S1` `read projection target` `TracingStatusView`	R2

Notes on normalization #

Important choices:

span ingestion is append event
- spans are immutable telemetry facts
trace-index update is state transition
- searchable metadata evolves as new spans arrive for a trace
trace fetch and trace search are read paths
compaction and retention are explicit storage-lifecycle paths
routing and ownership are explicit because this is distributed infra

This system is a hybrid of:

append-oriented telemetry ingestion
trace assembly by correlation ids
search/index over immutable span records

Step 2 - Critical Path Selection #

Requirement	Priority class	Why
Emit / ingest spans	C1	span telemetry is the primary product truth
Update trace index / metadata	C1	trace search depends on correct trace-level metadata
Query traces by filters	R1	core serving path
Fetch full trace by id	R1	core serving path
Compact storage segments	C1	storage lifecycle and query correctness depend on safe compaction
Apply retention / expiry	C1	storage correctness and cost depend on safe deletion lifecycle
Route to shard owner	C1	wrong routing can split authoritative index/storage ownership
Reassign shard ownership	C1	failover must preserve ingestion/query/storage correctness
Dashboards / status	R2	operational only

Baseline critical paths #

Main C1 paths:

P1 ingest spans
P2 update trace index
P3 compact storage segments
P4 apply retention / expiry
P5 route to shard owner
P6 reassign shard ownership

Main R1 paths:

P7 query traces by filters
P8 fetch full trace by id

This design is driven by:

immutable span ingestion
canonical trace correlation metadata
search/index correctness
safe storage lifecycle transitions

Step 3 - Primary State Extraction #

For a distributed tracing system, the minimal primary state is the immutable span record, current trace index metadata, storage block lifecycle state, and routing/ownership state.

Candidate object label	Candidate source	Candidate needed for C1/R1?	Candidate decomposition action	Class	Primary?	Owner	Evolution	Scope kind	Scope value
SpanRecord	direct noun	Yes	keep as candidate	event	Yes	service	append-only	instance	trace_id + span_id
TraceIndexState	hidden write target	Yes	keep as candidate	entity	Yes	service	overwrite	instance	trace_id or search-key scope
TraceStorageBlockState	hidden write target	Yes	keep as candidate	process	Yes	service	state machine	instance	block_id
PartitionOwnership	hidden write target	Yes	keep as candidate	process	Yes	service	state machine	instance	shard_id
PartitionMap	hidden write target	Yes	keep as candidate	entity	Yes	service	overwrite	collection	tenant/shard map
TracingStatusView	derived read model	No	reject as UI artifact	projection	No	derived	overwrite	collection	tenant or cluster

Important modeling choices #

`SpanRecord` #

Primary because:

span ingestion is the core immutable telemetry fact
includes trace id, span id, parent span id, timestamps, tags, and logs/events

`TraceIndexState` #

Primary because:

trace search depends on searchable metadata
captures service names, operation names, time range, duration bounds, tags/index references for a trace or query shard

`TraceStorageBlockState` #

Primary because:

segment/head/block lifecycle determines retention, compaction, and queryability

`PartitionOwnership` / `PartitionMap` #

Needed because:

collectors and queriers must consistently route ingestion and reads to authoritative storage/index owners

Minimal strict primary set #

The strongest minimal set is:

SpanRecord
TraceIndexState
TraceStorageBlockState
PartitionOwnership
PartitionMap

Step 4 - Hard Invariants #

For a Jaeger/Zipkin-class tracing system, the hard invariants are about immutable span append semantics, correct trace correlation/indexing, safe storage compaction, and valid retention transitions.

Path	Tier	Type	Invariant statement
`P1` ingest spans	HARD	uniqueness	Key `(trace_id, span_id)` maps to at most one logical outcome `stored span record` within trace scope.
`P1` ingest spans	HARD	accounting	Stored span fields preserve emitted trace correlation identifiers and timestamps for that span scope.
`P2` update trace index	HARD	accounting	`TraceIndexState` for a trace or search key reflects the union of authoritative stored spans relevant to that trace/search scope modulo retention policy.
`P3` compact storage segments	HARD	accounting	Compacted storage contents equal the union of authoritative source segments modulo dedup and retention rules, and source-to-destination transition preserves queryable truth.
`P4` apply retention / expiry	HARD	eligibility	Action `expire_trace_data` is valid only if affected spans/blocks are older than retention policy and current block state allows expiry at decision time.
`P5` route to shard owner	HARD	uniqueness	Key `shard_id` maps to at most one logical outcome `current authoritative owner` within `shard_id`.
`P6` reassign shard ownership	HARD	eligibility	Action `reassign_shard` is valid only if `current owner is failed or relinquished and candidate owner is eligible and sufficiently current` on `shard_id` at decision time.
`P7` query traces by filters	HARD	freshness	Query path reflects authoritative indexed trace metadata and stored spans within configured query consistency bound.
`P8` fetch full trace by id	HARD	freshness	Read path reflects authoritative stored spans for `trace_id` within configured query consistency bound.

What matters most #

1. Spans are immutable #

Once accepted, a span record is an immutable telemetry fact.

2. Trace index must correspond to stored spans #

Search results must only point to traces that actually exist in storage.

3. Compaction and retention must preserve correctness #

Storage lifecycle operations cannot lose live data or surface expired data incorrectly.

4. Trace completeness is often eventually assembled #

A trace may arrive out of order or in partial batches, so query freshness is usually bounded, not instantaneous.

Step 5 - Execution Context #

For the baseline distributed tracing platform:

Field	Value	Why
Topology	single service distributed	one logical tracing system spread across collectors, storage, and query nodes
Write coordination scope	per object scope	correctness is per span identity, trace/search index scope, block lifecycle, and shard ownership scope
Read consistency target	bounded stale allowed	trace assembly and search can tolerate small bounded ingest-to-query lag
Holder model	node	shard ownership is held by collector/storage nodes
Compensation acceptable?	No	wrong span storage or block retention cannot be safely repaired by compensation

Derived implications #

holder_may_crash = true
- collectors, indexers, or storage nodes can fail mid-ingestion
cross_service_write = false
- baseline keeps span storage, index metadata, and ownership in one logical service
bounded_staleness_allowed = true
- query paths can tolerate some bounded lag from ingestion to search visibility
cross_service_atomicity_required = false
- no multi-service transaction across unrelated services in baseline
exclusive_claim_required = true
- shard ownership must be exclusive
guarded_by_current_state = true
- compaction and retention transitions depend on current block state

What this implies #

This pushes us toward:

one authoritative owner per tenant or trace/search shard
append-oriented write path for spans
index updates maintained by the shard owner
bounded-stale query/search visibility

Step 6 - Deterministic Mechanism Selection #

Path	Write shape	Base mechanism	Required companions
`P1` ingest spans	append-only event	append log / span segment append	duplicate-span policy
`P2` update trace index	overwrite current value	single writer per shard or CAS on version	canonical trace/search key mapping
`P3` compact storage segments	guarded state transition	single-writer compaction with atomic segment swap	block manifest/version
`P4` apply retention / expiry	guarded state transition	single-writer retention transition	retention epoch/policy version
`P5` route to shard owner	exclusive claim	lease	fencing token, heartbeat
`P6` reassign shard ownership	guarded state transition	CAS on `(state, version)`	fencing token, shard catch-up check

Why these fit #

Span ingestion #

Spans are immutable records, so append-only fits.

Trace index #

Current search metadata is current-value state maintained by the shard owner, so overwrite fits.

Compaction and retention #

These are storage-lifecycle changes that must preserve current query truth, so guarded transition fits.

Canonical substrate implied #

The baseline now points to:

sharded collector / storage / query service
one owner per tenant or trace/search shard
append-oriented head or segment storage for spans
trace/search index maintained over immutable span records
safe compaction and retention on stored segments

Step 7 - Read Model / Source of Truth #

For a Jaeger/Zipkin-class system, truth is mostly direct source state. Dashboards and UIs are derived.

Concept	Truth	Read path	Rebuild path
`C1` immutable span records	`SpanRecord`	read source directly	authoritative span store
`C2` trace/search metadata	`TraceIndexState`	read source directly	authoritative index metadata
`C3` storage segment lifecycle	`TraceStorageBlockState`	read source directly	authoritative block/segment manifest store
`C4` shard ownership	`PartitionOwnership`	read source directly	authoritative ownership store
`C5` shard routing map	`PartitionMap`	read source directly	authoritative routing metadata
`C6` dashboards / status	derived from spans, index state, and storage state	materialized view	recompute from authoritative state

Important point #

For the core semantics:

full-trace fetch reads authoritative stored spans
trace search reads authoritative index metadata and then underlying spans as needed
dashboards are projections

Step 8 - Failure Handling #

Path	Retry	Competing writers	Crash after commit	Publish failure	Stale holder
`P1` ingest spans	client/collector retry may resend same spans; duplicate policy required	shard owner serializes authoritative storage/index updates for a shard	committed spans survive crash if WAL/segment append persisted	partial batch append recovered from WAL/segment replay	stale owner blocked by fencing token
`P2` update trace index	retry with metadata version	concurrent updates merge through canonical trace/search key mapping under shard owner	committed index metadata survives crash if persisted	index lag only delays search visibility	stale owner blocked by fencing token
`P3` compact storage segments	compaction can retry from source manifests	only one compactor should own a segment set at a time	if crash before atomic swap, source segments remain authoritative	partial destination segment ignored until manifest commit	stale compactor blocked by block manifest/version
`P4` apply retention / expiry	retention pass can retry safely	only current retention owner should transition current block state	committed retention survives crash if manifest persisted	partial deletion deferred until safe manifest transition	stale retention worker blocked by manifest/version
`P5` route to shard owner	retry after refreshing shard map	only one valid owner should exist	if owner changed, refreshed map points to new owner	n/a	stale owner rejected by fencing token
`P6` reassign shard ownership	retry failover transition safely	only one reassignment wins current ownership state	promoted owner crash triggers later reassignment	n/a	old owner fenced and must not continue serving
`P7` query traces by filters	query retry safe	many readers coexist	node crash drops query only	n/a	stale search bounded by configured query freshness
`P8` fetch full trace by id	read retry safe	many readers coexist	node crash drops query only	n/a	stale trace visibility bounded by configured query freshness

What matters most #

1. Duplicate-span policy #

Collectors and agents may retry the same spans. The system needs a clear dedup policy for (trace_id, span_id).

2. Index lag is acceptable but bounded #

A span may be stored before its trace becomes fully searchable. That is usually acceptable if bounded.

3. Compaction and retention need atomic manifest transitions #

New compacted segments and expired segments should only change visibility when the manifest/state transition is fully committed.

Step 9 - Scale Adjustments #

Hotspot	Type	First response
very high span ingestion rate	write throughput hotspot	shard by tenant, trace-id hash, or time bucket and add more collectors/ingesters
high-cardinality tag search	memory/read hotspot	constrain indexed tags and separate full-scan from indexed search
large trace fetch fan-in	read hotspot	colocate trace spans by trace id and cache recent traces
storage compaction bandwidth	write throughput hotspot	stagger compaction and separate hot ingest storage from cold segments
retention deletion storms	contention hotspot	batch retention by block and time bucket
dashboard/search load	read hotspot	add query frontends and caching for common searches

What scales well #

This system scales by:

sharding ingestion and storage by tenant and trace/search key
keeping spans append-oriented
colocating spans of the same trace when possible
separating hot ingest segments from compacted storage

What fails first #

Usually:

high-cardinality tag indexing
search fanout across too many shards
compaction IO bottlenecks
huge traces causing skewed reads

Canonical design conclusion #

The mechanical outcome is:

primary state:
- SpanRecord
- TraceIndexState
- TraceStorageBlockState
- PartitionOwnership
- PartitionMap
critical invariants:
- immutable span append semantics
- canonical trace/search metadata reflecting stored spans
- safe segment compaction and retention preserving query truth
- exclusive shard ownership for ingestion/storage
mechanisms:
- append log
- single writer per shard
- guarded compaction and retention transitions
- fenced shard ownership
reads:
- direct authoritative reads for spans and trace/search metadata
- projections only for dashboards and cluster status

Polished interview answer #

I’d design the tracing system as a sharded collector-and-store service with one authoritative owner per tenant or trace/search shard. Applications emit immutable spans, collectors append them into durable storage, and shard owners maintain trace/search metadata so users can search by service, operation, tags, and time range. Full-trace fetches read stored spans directly by trace id, while search reads the index and then resolves matching traces. The storage layer periodically compacts segments and applies retention through guarded manifest transitions so query truth is preserved. The main scaling levers are more ingestion shards, careful limits on indexed tags, colocating spans by trace id, and separating hot ingest storage from compacted segments.

Concrete Substrate #

I’ll choose a sharded collector / storage / query system with append-oriented span segments plus secondary trace-search indexes as the concrete baseline, because it matches the mechanics we derived:

append-only span ingestion
current-value trace/search metadata
safe compaction and retention lifecycle
one owner per tenant or trace/search shard

Concrete tech family:

collectors/query services in Go or Java
per-shard WAL or append segments for spans
compacted immutable segments in local disk or object storage
secondary index store for trace search metadata
metadata/control:
- etcd or internal metadata quorum for shard ownership/routing

Each shard owner stores:

current append segments / WAL for spans
trace/search index metadata
segment/block manifests
retention and compaction state

Persisted segment store stores:

immutable span segments
segment manifests and indexes

Operation Layer #

1. Ingest span batch #

API

IngestSpans(span_batch)

Initiator

application agent / collector client

Entry point

collector / ingester

Authoritative decider

shard owner for tenant/trace space

Precondition

span batch parsed and routed to correct shard

Transition

append SpanRecords into WAL/segment
create or refresh TraceIndexState as needed

Response

success / partial failure

2. Query traces by filters #

API

FindTraces(service, operation, tags, time_range, limit)

Initiator

user/client

Entry point

query frontend

Authoritative decider

relevant shard owners plus query engine

Precondition

none

Transition

none

Response

matching trace summaries / ids

3. Fetch full trace #

API

GetTrace(trace_id)

Initiator

user/client

Entry point

query frontend

Authoritative decider

shard owner for trace_id

Precondition

none

Transition

none

Response

all spans for the trace

4. Compact storage segments #

API

internal compaction loop

Initiator

system

Entry point

compactor / shard owner

Authoritative decider

compaction owner for segment set

Precondition

source segments eligible for compaction

Transition

read source segments
create destination compacted segment
atomically update TraceStorageBlockState / manifest to swap visibility

5. Apply retention #

API

internal retention loop

Initiator

system

Entry point

retention worker / shard owner

Authoritative decider

retention owner for segment set

Precondition

blocks older than retention horizon and current block state allows expiry

Transition

mark expired segments as no longer queryable
later delete physical storage

Entry Point vs Decider vs Responder #

Path	Entry point	Authoritative decider	Physical responder	Logical responder
span ingest	collector / ingester	shard owner	ingester node	tracing system
trace search	query frontend	shard owners + query engine	query frontend	tracing system
full trace fetch	query frontend	shard owner for trace	query frontend	tracing system
compaction	compactor / shard owner	compaction owner	storage node	tracing system
retention	retention worker / shard owner	retention owner	storage node	tracing system
shard failover	follower / coordination layer	shard quorum / lease store	new leader / control plane	tracing system

Concrete HLD #

Main components:

collectors / ingestion frontends
- receive spans from applications or agents
ingestion/storage shard owners
- authoritative owners for span segments, indexes, and block lifecycle
query frontend
- handles search and trace-fetch APIs
trace-search index
- supports filter-based lookup by service/operation/tags/time
segment store
- stores immutable span segments
metadata/control service
- tracks shard ownership and routing

Short Interview Version #

I’d build the tracing system as a sharded collector-and-store service with one authoritative owner per tenant or trace/search shard. Applications emit immutable spans, collectors append them into durable storage, and shard owners maintain trace/search metadata so users can search by service, operation, tags, and time range. Full-trace fetches read stored spans directly by trace id, while search reads the index and then resolves matching traces. The storage layer periodically compacts segments and applies retention through guarded manifest transitions so query truth is preserved. The main scaling levers are more ingestion shards, careful limits on indexed tags, colocating spans by trace id, and separating hot ingest storage from compacted segments.

Distributed Tracing System (Jaeger / Zipkin-class) #

Step 1 - Normalize #

Notes on normalization #

Step 2 - Critical Path Selection #

Baseline critical paths #

Step 3 - Primary State Extraction #

Important modeling choices #

SpanRecord #

TraceIndexState #

TraceStorageBlockState #

PartitionOwnership / PartitionMap #

Minimal strict primary set #

Step 4 - Hard Invariants #

What matters most #

1. Spans are immutable #

2. Trace index must correspond to stored spans #

3. Compaction and retention must preserve correctness #

4. Trace completeness is often eventually assembled #

Step 5 - Execution Context #

Derived implications #

What this implies #

Step 6 - Deterministic Mechanism Selection #

Why these fit #

Span ingestion #

Trace index #

Compaction and retention #

Canonical substrate implied #

Step 7 - Read Model / Source of Truth #

Important point #

Step 8 - Failure Handling #

What matters most #

1. Duplicate-span policy #

2. Index lag is acceptable but bounded #

3. Compaction and retention need atomic manifest transitions #

Step 9 - Scale Adjustments #

What scales well #

What fails first #

Canonical design conclusion #

Polished interview answer #

Concrete Substrate #

Operation Layer #

1. Ingest span batch #

2. Query traces by filters #

3. Fetch full trace #

4. Compact storage segments #

5. Apply retention #

Entry Point vs Decider vs Responder #

Concrete HLD #

Short Interview Version #

`SpanRecord` #

`TraceIndexState` #

`TraceStorageBlockState` #

`PartitionOwnership` / `PartitionMap` #