Monarch Process And Storage Traces #

Source: monarch.txt

Monarch is primarily a storage system:

monitoring samples are ingested into regional in-memory leaves
data is sharded by target ranges and replicated across leaves
recovery logs and long-term repository back the in-memory store
queries execute over leaves through mixers and index servers
standing queries materialize derived time series back into Monarch
configuration is globally authored and zonally mirrored/cached

Its process traces are mostly about:

collection routing
range assignment and movement
standing query execution
configuration distribution

Its storage traces are mostly about:

in-memory leaf state plus best-effort recovery logging
query-visible leaf/zone/root index state
standing-query materialized output
globally committed configuration plus zonal mirrors/caches

It is not primarily a workflow/control-plane execution fleet like Borg.

Hot Paths #

The main hot paths in Monarch are:

client write -> ingestion router -> leaf router -> leaf replicas
- this is the primary ingestion path and one of the hottest paths in the system
query -> root/zone mixer -> index servers -> leaves -> partial aggregation -> stream results up
- this is the primary read/query path
standing query evaluator -> mixers -> leaves -> write derived results back
- a major internal read+write path

Supporting but less hot:

range movement / recovery during load balancing
field hints index streaming updates
configuration commit -> mirror propagation -> zonal cache refresh

Scaling Lens #

1. Collection / Ingestion #

flow
- metric writes / deltas from monitored targets
containers
- routing containers:
  - ingestion routers
  - leaf routers
- partition containers:
  - target ranges
- replication containers:
  - three leaf replicas per range
drains
- network drains:
  - router forwarding bandwidth
- storage drains:
  - leaf in-memory append/apply throughput
  - recovery-log write bandwidth
- control drains:
  - range-map propagation from leaves to routers
pressure signals
- router throughput saturation
- hot target-range concentration
- leaf CPU/memory pressure
- delayed or rejected writes outside admission window
- recovery-log lag during movement/recovery
relief types
- split / move target ranges
- add leaves horizontally within zone
- collection aggregation to shrink write volume
- lexicographic sharding by target
- temporary bucket shrink during range movement

2. Query Execution #

flow
- ad hoc queries and standing-query reads
containers
- routing containers:
  - root mixers
  - zone mixers
- index containers:
  - root and zone field hints indexes
  - leaf-local target index
- partition containers:
  - relevant leaf sets / target ranges
drains
- control drains:
  - query planning, pushdown, replica resolution
- network drains:
  - subquery fanout and streaming partial results upward
- execution drains:
  - leaf-level join/filter/group-by work
pressure signals
- query fanout explosion
- root/zone mixer latency growth
- slow or missing zones
- slow leaf replicas
- memory pressure from large query result sets
relief types
- query pushdown to zone and leaf
- field hints index pruning
- replica resolution to best-quality leaf
- zone pruning on soft deadlines
- hedged reads against fallback leaves
- per-user memory and CPU isolation

3. Standing Query Materialization #

flow
- periodic derived-query runs that read source data and write condensed outputs
containers
- queue containers:
  - evaluator shard workload
- execution containers:
  - zone/root evaluator shards
  - mixers and leaves
- snapshot containers:
  - materialized output time series
drains
- execution drains:
  - evaluator and mixer computation
- storage drains:
  - writes of derived output back into leaves
- network drains:
  - query result streaming
pressure signals
- evaluator backlog
- missed evaluation periods / deadline overruns
- excessive write amplification from derived outputs
- cross-zone query latency
relief types
- zone-local execution for most standing queries
- pushdown to reduce transferred and materialized volume
- sharding evaluators by standing-query hash
- materializing condensed outputs for later reads

4. Configuration Distribution #

flow
- config updates from global truth to zonal mirrors and caches
containers
- relation/config containers:
  - global config store in Spanner
- replication containers:
  - zonal mirrors
- cache containers:
  - component-local in-memory config caches
drains
- control drains:
  - validation, transformation, dependency tracking
- network drains:
  - mirror propagation and periodic cache refresh
pressure signals
- mirror lag
- stale component cache age
- config-application skew across zones
relief types
- zonal mirrors for locality and availability
- transformed per-component config to reduce distribution cost
- in-memory caching at leaves/routers/mixers
- tolerate stale cache during mirror outages

Best Compression #

Monarch’s dominant scale story is:

write flow is constrained by target-range partitioning + leaf in-memory drain
read flow is constrained by query fanout + mixer/leaf execution drain
the main pressure signals are:
- range hotspots
- query fanout
- mixer latency
- leaf saturation
- evaluator backlog
the canonical relief moves are:
- target-range splitting/movement
- collection aggregation
- query pushdown
- field-hints pruning
- zone pruning / hedged reads
- sharded evaluators and materialized outputs

1. Process Trace: Ingestion Routing To Leaf Write #

This is the core write-side path.

Core lines #

ingestion_path_state
destination_zone
target_range_assignment_version
admission_window_boundary
actor_trying_to_route_or_store

Reusable frame mapping #

state -> ingestion_path_state
owner/view -> destination_zone plus current target range owner set
monotonic marker -> target_range_assignment_version
validity boundary -> admission_window_boundary
actor trying to advance -> actor_trying_to_route_or_store

Trace #

time →

ingestion_path_state            CLIENT_SEND ---- ZONE_ROUTED ---- LEAF_ROUTED ----- STORED
destination_zone                unknown -------- zone-us-c1 ----- zone-us-c1 ------ zone-us-c1
target_range_assignment_version 200 ------------ 200 ------------ 201 ------------- 201
admission_window_boundary       open ----------- open ----------- delta accepted --- finalized later
actor_trying_to_route_or_store  client --------- ingest router -- leaf router ------ leaf replicas

What it teaches #

writes are regionally routed by location field
within the zone, writes are routed by target-range ownership
leaves accept writes into in-memory state and recovery logs
older deltas can be rejected by the admission window

State loci #

authoritative:
- leaf replicas storing the target range in memory
local/execution:
- ingestion router and leaf router forwarding state
cached/derived:
- location-to-zone map and leaf-router range map
repair:
- recovery logs and re-routing through updated range ownership

Core invariant #

a write must only be accepted into the currently assigned leaf replica set for its target range and must respect the admission window for aggregating deltas

2. Process Trace: Range Movement / Intra-zone Load Balancing #

This is Monarch’s main rebalancing path.

Core lines #

range_move_state
source_and_destination_leaf
range_assignment_epoch
recovery_completion_boundary
actor_trying_to_move_range

Reusable frame mapping #

state -> range_move_state
owner/view -> source_and_destination_leaf
monotonic marker -> range_assignment_epoch
validity boundary -> recovery_completion_boundary
actor trying to advance -> actor_trying_to_move_range

Trace #

time →

range_move_state              STABLE --------- DUAL_WRITE -------- RECOVERING ------ SOURCE_UNASSIGNED
source_and_destination_leaf   L1 ------------ L1+L9 ------------ L1+L9 ----------- L9
range_assignment_epoch        41 ------------ 42 --------------- 42 --------------- 43
recovery_completion_boundary  - ------------- log wait t1 ------ replay catchup --- recovery done
actor_trying_to_move_range    assigner ------ dest leaf -------- dest leaf -------- assigner/source leaf

What it teaches #

load balancing is by moving target ranges between leaves
destination leaf starts collecting before the source stops
recovery replays older data from logs in reverse chronological order
routing continues during transient assigner failure because leaves update routers directly

State loci #

authoritative:
- leaves actually storing the range and their recovery logs
local/execution:
- destination leaf recovery worker and source leaf collector
cached/derived:
- leaf-router range maps and assigner load view
repair:
- recovery-log replay until destination has complete range state

Core invariant #

a range move must preserve continuous data availability by overlapping collection and recovery before the source leaf is unassigned

3. Process Trace: Standing Query Evaluation #

This is the main internal derivation/materialization path.

Core lines #

standing_query_state
evaluation_scope
standing_query_version
query_deadline_or_period_boundary
actor_trying_to_evaluate

Reusable frame mapping #

state -> standing_query_state
owner/view -> evaluation_scope
monotonic marker -> standing_query_version
validity boundary -> query_deadline_or_period_boundary
actor trying to advance -> actor_trying_to_evaluate

Trace #

time →

standing_query_state           SCHEDULED ------ DISPATCHED ------ RUNNING ---------- MATERIALIZED
evaluation_scope               root/zone? ----- zone evaluator -- zone/root tree --- target output zone
standing_query_version         88 ------------ 88 -------------- 88 --------------- 88
query_deadline_or_period_boundary next tick --- deadline set ---- soft prune/hedge -- write period done
actor_trying_to_evaluate       evaluator ------ mixer ---------- leaves/mixers ----- evaluator/leaves

What it teaches #

standing queries are sharded across evaluators
most are zone-local because of pushdown and resilience
results are written back into Monarch as materialized time series
query execution is deadline-aware and can prune zones or hedge leaf reads

State loci #

authoritative:
- standing-query definition and resulting written output time series
local/execution:
- evaluator shard, mixer tree, leaf snapshots
cached/derived:
- index-guided relevant-child sets, rewritten query plan
repair:
- next standing-query period or re-evaluation after transient failure

Core invariant #

a standing query evaluation must operate over a bounded execution snapshot and materialize output according to the current validated standing-query definition

4. Process Trace: Configuration Distribution #

This is Monarch’s main control-plane propagation path.

Core lines #

configuration_distribution_state
global_vs_zonal_view
config_version
mirror_or_cache_staleness_boundary
actor_trying_to_apply_config

Reusable frame mapping #

state -> configuration_distribution_state
owner/view -> global_vs_zonal_view
monotonic marker -> config_version
validity boundary -> mirror_or_cache_staleness_boundary
actor trying to advance -> actor_trying_to_apply_config

Trace #

time →

configuration_distribution_state  EDITED -------- COMMITTED ------- MIRRORED ------- CACHED/APPLIED
global_vs_zonal_view              global draft -- global truth ---- zone mirror ---- zonal component cache
config_version                    510 ---------- 511 ------------- 511 ------------ 511
mirror_or_cache_staleness_boundary n/a -------- commit complete -- mirror current -- cache refresh period
actor_trying_to_apply_config      user/server --- config server --- mirror --------- leaf/router/mixer

What it teaches #

config is globally authored and validated in Spanner
transformed config is mirrored into each zone
zonal components cache relevant config in memory
zones keep operating with stale cache if the mirror becomes unavailable

State loci #

authoritative:
- global configuration server backed by Spanner
local/execution:
- zonal mirrors and component-local config caches
cached/derived:
- transformed per-component config subsets
repair:
- periodic refresh and mirror re-synchronization after failure

Core invariant #

only globally validated configuration versions may become the source for zonal mirrors and local caches

5. Storage Trace: Leaf In-Memory Store + Recovery Log #

This is Monarch’s core storage trace.

Core lines #

in_memory_leaf_version
recovery_log_visibility
durability_boundary
query_visible_leaf_state
writer_trying_to_append_point

Reusable frame mapping #

truth locus -> in_memory_leaf_version
replica/derived state -> recovery_log_visibility
durability/freshness boundary -> durability_boundary
visibility -> query_visible_leaf_state
actor trying to write/read/apply -> writer_trying_to_append_point

Trace #

time →

in_memory_leaf_version        900 ------------ 901 ---------------- 902 ----------------
recovery_log_visibility       899 ------------ best-effort append --- compacted/replayed -
durability_boundary           memory first ---- log maybe delayed --- long-term async ----
query_visible_leaf_state      900 ------------ 901 ---------------- 902 ----------------
writer_trying_to_append_point none ----------- leaf append --------- collection finalize -

What it teaches #

the primary serving truth is the in-memory leaf store
recovery logging is best-effort on the hot path and not synchronously acknowledged
long-term repository is colder and off the alerting path
Monarch explicitly prefers timely visibility over strict durability before ack

State loci #

authoritative:
- in-memory leaf replicas for recent monitoring truth
local/replica:
- per-range recovery logs in Colossus clusters
cached/derived:
- compacted/reformatted log files and long-term repository
repair:
- replay from logs into leaves during recovery or range movement

Core invariant #

the system may trade strict immediate durability for timely in-memory availability, but recovery state must be sufficient to reconstruct recent accepted leaf data within the designed fault model

6. Storage Trace: Query Execution With Field Hints Index #

This is Monarch’s core read-side localization trace.

Core lines #

field_hints_index_version
potentially_relevant_children
query_deadline_boundary
query_visible_result
reader_trying_to_localize_query

Reusable frame mapping #

truth locus -> actual leaf data plus live leaf snapshots
replica/derived state -> field_hints_index_version
durability/freshness boundary -> query_deadline_boundary
visibility -> query_visible_result
actor trying to write/read/apply -> reader_trying_to_localize_query

Trace #

time →

field_hints_index_version     700 ------------ 701 ---------------- 701 ----------------
potentially_relevant_children unknown -------- zones Z1,Z7 ------- leaves L2,L9,L11 ---
query_deadline_boundary       start ---------- soft zone deadline - hedge leaf fallback -
query_visible_result          none ----------- partial streaming --- final/partial done --
reader_trying_to_localize_query root mixer --- zone mixer -------- leaves + fallback ----

What it teaches #

query fanout is constrained first at root, then zone, then leaf
lower levels stream partial results upward
stale or missing index updates are used as availability signals
query visibility may be partial if zones are pruned or leaves are hedged

State loci #

authoritative:
- live leaf snapshots of time series data
local/replica:
- zone and root field hints indexes, leaf-local target indexes
cached/derived:
- relevant-child sets and rewritten query fragments in mixers
repair:
- fallback leaf reads, zone pruning, next query attempt

Core invariant #

query execution must localize to a safe superset of relevant children and may return partial results, but it must surface pruning/incompleteness rather than silently pretending completeness

7. Storage Trace: Standing Query Materialized Output #

This is Monarch’s main snapshot/projection trace.

Core lines #

standing_query_definition_version
derived_output_version
evaluation_period_boundary
consumer_visible_materialized_state
writer_trying_to_publish_materialized_output

Reusable frame mapping #

truth locus -> standing query definition plus source time series
replica/derived state -> derived_output_version
durability/freshness boundary -> evaluation_period_boundary
visibility -> consumer_visible_materialized_state
actor trying to write/read/apply -> writer_trying_to_publish_materialized_output

Trace #

time →

standing_query_definition_version 88 ------------- 88 ------------- 89 --------------
derived_output_version            v120 ----------- v121 ----------- v121 ------------
evaluation_period_boundary        t0-t1 ---------- t1-t2 ---------- next period -----
consumer_visible_materialized_state v120 --------- v121 ----------- v121 ------------
writer_trying_to_publish_materialized_output eval run 121 ------- leaf write ------- waiting next run

What it teaches #

standing queries create durable-ish materialized time series inside Monarch
freshness is period-bound, not instantaneous
definition changes and output generations are separate axes

State loci #

authoritative:
- source input tables and standing query definition
local/replica:
- evaluator shard state and leaf-written output replicas
cached/derived:
- materialized standing-query outputs queried later by users
repair:
- next scheduled evaluation or re-run after transient failure

Core invariant #

consumers of a standing query must observe output that corresponds to some completed evaluation period of the current or explicitly identified prior query definition

8. Storage Trace: Global Config Truth To Zonal Mirror And Cache #

This is Monarch’s storage-side configuration consistency trace.

Core lines #

global_config_commit_version
zonal_mirror_version
local_cache_version
staleness_boundary
reader_or_writer_trying_to_use_config

Reusable frame mapping #

truth locus -> global_config_commit_version
replica/derived state -> zonal_mirror_version and local_cache_version
durability/freshness boundary -> staleness_boundary
visibility -> component-visible config
actor trying to write/read/apply -> reader_or_writer_trying_to_use_config

Trace #

time →

global_config_commit_version   510 ------------ 511 ---------------- 511 ----------------
zonal_mirror_version           510 ------------ 510 ---------------- 511 ----------------
local_cache_version            510 ------------ 510 ---------------- 510/511 mix --------
staleness_boundary             healthy -------- mirror lag allowed -- cache refresh due ---
reader_or_writer_trying_to_use_config leaves read ----- mirrors copy ------ components refresh

What it teaches #

configuration has a strongly committed global truth
zonal mirrors and local caches may lag
components continue to operate under stale config when mirrors fail

State loci #

authoritative:
- global Spanner-backed config state
local/replica:
- zonal mirrors
cached/derived:
- in-memory component caches
repair:
- periodic refresh and mirror catch-up

Core invariant #

stale cached configuration may be served temporarily for availability, but config versions must advance monotonically from globally committed state