Skip to main content
  1. System Design Components/

Monarch Process And Storage Traces #

Source: monarch.txt

Monarch is primarily a storage system:

  • monitoring samples are ingested into regional in-memory leaves
  • data is sharded by target ranges and replicated across leaves
  • recovery logs and long-term repository back the in-memory store
  • queries execute over leaves through mixers and index servers
  • standing queries materialize derived time series back into Monarch
  • configuration is globally authored and zonally mirrored/cached

Its process traces are mostly about:

  • collection routing
  • range assignment and movement
  • standing query execution
  • configuration distribution

Its storage traces are mostly about:

  • in-memory leaf state plus best-effort recovery logging
  • query-visible leaf/zone/root index state
  • standing-query materialized output
  • globally committed configuration plus zonal mirrors/caches

It is not primarily a workflow/control-plane execution fleet like Borg.

Hot Paths #

The main hot paths in Monarch are:

  • client write -> ingestion router -> leaf router -> leaf replicas
    • this is the primary ingestion path and one of the hottest paths in the system
  • query -> root/zone mixer -> index servers -> leaves -> partial aggregation -> stream results up
    • this is the primary read/query path
  • standing query evaluator -> mixers -> leaves -> write derived results back
    • a major internal read+write path

Supporting but less hot:

  • range movement / recovery during load balancing
  • field hints index streaming updates
  • configuration commit -> mirror propagation -> zonal cache refresh

Scaling Lens #

1. Collection / Ingestion #

  • flow
    • metric writes / deltas from monitored targets
  • containers
    • routing containers:
      • ingestion routers
      • leaf routers
    • partition containers:
      • target ranges
    • replication containers:
      • three leaf replicas per range
  • drains
    • network drains:
      • router forwarding bandwidth
    • storage drains:
      • leaf in-memory append/apply throughput
      • recovery-log write bandwidth
    • control drains:
      • range-map propagation from leaves to routers
  • pressure signals
    • router throughput saturation
    • hot target-range concentration
    • leaf CPU/memory pressure
    • delayed or rejected writes outside admission window
    • recovery-log lag during movement/recovery
  • relief types
    • split / move target ranges
    • add leaves horizontally within zone
    • collection aggregation to shrink write volume
    • lexicographic sharding by target
    • temporary bucket shrink during range movement

2. Query Execution #

  • flow
    • ad hoc queries and standing-query reads
  • containers
    • routing containers:
      • root mixers
      • zone mixers
    • index containers:
      • root and zone field hints indexes
      • leaf-local target index
    • partition containers:
      • relevant leaf sets / target ranges
  • drains
    • control drains:
      • query planning, pushdown, replica resolution
    • network drains:
      • subquery fanout and streaming partial results upward
    • execution drains:
      • leaf-level join/filter/group-by work
  • pressure signals
    • query fanout explosion
    • root/zone mixer latency growth
    • slow or missing zones
    • slow leaf replicas
    • memory pressure from large query result sets
  • relief types
    • query pushdown to zone and leaf
    • field hints index pruning
    • replica resolution to best-quality leaf
    • zone pruning on soft deadlines
    • hedged reads against fallback leaves
    • per-user memory and CPU isolation

3. Standing Query Materialization #

  • flow
    • periodic derived-query runs that read source data and write condensed outputs
  • containers
    • queue containers:
      • evaluator shard workload
    • execution containers:
      • zone/root evaluator shards
      • mixers and leaves
    • snapshot containers:
      • materialized output time series
  • drains
    • execution drains:
      • evaluator and mixer computation
    • storage drains:
      • writes of derived output back into leaves
    • network drains:
      • query result streaming
  • pressure signals
    • evaluator backlog
    • missed evaluation periods / deadline overruns
    • excessive write amplification from derived outputs
    • cross-zone query latency
  • relief types
    • zone-local execution for most standing queries
    • pushdown to reduce transferred and materialized volume
    • sharding evaluators by standing-query hash
    • materializing condensed outputs for later reads

4. Configuration Distribution #

  • flow
    • config updates from global truth to zonal mirrors and caches
  • containers
    • relation/config containers:
      • global config store in Spanner
    • replication containers:
      • zonal mirrors
    • cache containers:
      • component-local in-memory config caches
  • drains
    • control drains:
      • validation, transformation, dependency tracking
    • network drains:
      • mirror propagation and periodic cache refresh
  • pressure signals
    • mirror lag
    • stale component cache age
    • config-application skew across zones
  • relief types
    • zonal mirrors for locality and availability
    • transformed per-component config to reduce distribution cost
    • in-memory caching at leaves/routers/mixers
    • tolerate stale cache during mirror outages

Best Compression #

Monarch’s dominant scale story is:

  • write flow is constrained by target-range partitioning + leaf in-memory drain
  • read flow is constrained by query fanout + mixer/leaf execution drain
  • the main pressure signals are:
    • range hotspots
    • query fanout
    • mixer latency
    • leaf saturation
    • evaluator backlog
  • the canonical relief moves are:
    • target-range splitting/movement
    • collection aggregation
    • query pushdown
    • field-hints pruning
    • zone pruning / hedged reads
    • sharded evaluators and materialized outputs

1. Process Trace: Ingestion Routing To Leaf Write #

This is the core write-side path.

Core lines #

  • ingestion_path_state
  • destination_zone
  • target_range_assignment_version
  • admission_window_boundary
  • actor_trying_to_route_or_store

Reusable frame mapping #

  • state -> ingestion_path_state
  • owner/view -> destination_zone plus current target range owner set
  • monotonic marker -> target_range_assignment_version
  • validity boundary -> admission_window_boundary
  • actor trying to advance -> actor_trying_to_route_or_store

Trace #

time →

ingestion_path_state            CLIENT_SEND ---- ZONE_ROUTED ---- LEAF_ROUTED ----- STORED
destination_zone                unknown -------- zone-us-c1 ----- zone-us-c1 ------ zone-us-c1
target_range_assignment_version 200 ------------ 200 ------------ 201 ------------- 201
admission_window_boundary       open ----------- open ----------- delta accepted --- finalized later
actor_trying_to_route_or_store  client --------- ingest router -- leaf router ------ leaf replicas

What it teaches #

  • writes are regionally routed by location field
  • within the zone, writes are routed by target-range ownership
  • leaves accept writes into in-memory state and recovery logs
  • older deltas can be rejected by the admission window

State loci #

  • authoritative:
    • leaf replicas storing the target range in memory
  • local/execution:
    • ingestion router and leaf router forwarding state
  • cached/derived:
    • location-to-zone map and leaf-router range map
  • repair:
    • recovery logs and re-routing through updated range ownership

Core invariant #

  • a write must only be accepted into the currently assigned leaf replica set for its target range and must respect the admission window for aggregating deltas

2. Process Trace: Range Movement / Intra-zone Load Balancing #

This is Monarch’s main rebalancing path.

Core lines #

  • range_move_state
  • source_and_destination_leaf
  • range_assignment_epoch
  • recovery_completion_boundary
  • actor_trying_to_move_range

Reusable frame mapping #

  • state -> range_move_state
  • owner/view -> source_and_destination_leaf
  • monotonic marker -> range_assignment_epoch
  • validity boundary -> recovery_completion_boundary
  • actor trying to advance -> actor_trying_to_move_range

Trace #

time →

range_move_state              STABLE --------- DUAL_WRITE -------- RECOVERING ------ SOURCE_UNASSIGNED
source_and_destination_leaf   L1 ------------ L1+L9 ------------ L1+L9 ----------- L9
range_assignment_epoch        41 ------------ 42 --------------- 42 --------------- 43
recovery_completion_boundary  - ------------- log wait t1 ------ replay catchup --- recovery done
actor_trying_to_move_range    assigner ------ dest leaf -------- dest leaf -------- assigner/source leaf

What it teaches #

  • load balancing is by moving target ranges between leaves
  • destination leaf starts collecting before the source stops
  • recovery replays older data from logs in reverse chronological order
  • routing continues during transient assigner failure because leaves update routers directly

State loci #

  • authoritative:
    • leaves actually storing the range and their recovery logs
  • local/execution:
    • destination leaf recovery worker and source leaf collector
  • cached/derived:
    • leaf-router range maps and assigner load view
  • repair:
    • recovery-log replay until destination has complete range state

Core invariant #

  • a range move must preserve continuous data availability by overlapping collection and recovery before the source leaf is unassigned

3. Process Trace: Standing Query Evaluation #

This is the main internal derivation/materialization path.

Core lines #

  • standing_query_state
  • evaluation_scope
  • standing_query_version
  • query_deadline_or_period_boundary
  • actor_trying_to_evaluate

Reusable frame mapping #

  • state -> standing_query_state
  • owner/view -> evaluation_scope
  • monotonic marker -> standing_query_version
  • validity boundary -> query_deadline_or_period_boundary
  • actor trying to advance -> actor_trying_to_evaluate

Trace #

time →

standing_query_state           SCHEDULED ------ DISPATCHED ------ RUNNING ---------- MATERIALIZED
evaluation_scope               root/zone? ----- zone evaluator -- zone/root tree --- target output zone
standing_query_version         88 ------------ 88 -------------- 88 --------------- 88
query_deadline_or_period_boundary next tick --- deadline set ---- soft prune/hedge -- write period done
actor_trying_to_evaluate       evaluator ------ mixer ---------- leaves/mixers ----- evaluator/leaves

What it teaches #

  • standing queries are sharded across evaluators
  • most are zone-local because of pushdown and resilience
  • results are written back into Monarch as materialized time series
  • query execution is deadline-aware and can prune zones or hedge leaf reads

State loci #

  • authoritative:
    • standing-query definition and resulting written output time series
  • local/execution:
    • evaluator shard, mixer tree, leaf snapshots
  • cached/derived:
    • index-guided relevant-child sets, rewritten query plan
  • repair:
    • next standing-query period or re-evaluation after transient failure

Core invariant #

  • a standing query evaluation must operate over a bounded execution snapshot and materialize output according to the current validated standing-query definition

4. Process Trace: Configuration Distribution #

This is Monarch’s main control-plane propagation path.

Core lines #

  • configuration_distribution_state
  • global_vs_zonal_view
  • config_version
  • mirror_or_cache_staleness_boundary
  • actor_trying_to_apply_config

Reusable frame mapping #

  • state -> configuration_distribution_state
  • owner/view -> global_vs_zonal_view
  • monotonic marker -> config_version
  • validity boundary -> mirror_or_cache_staleness_boundary
  • actor trying to advance -> actor_trying_to_apply_config

Trace #

time →

configuration_distribution_state  EDITED -------- COMMITTED ------- MIRRORED ------- CACHED/APPLIED
global_vs_zonal_view              global draft -- global truth ---- zone mirror ---- zonal component cache
config_version                    510 ---------- 511 ------------- 511 ------------ 511
mirror_or_cache_staleness_boundary n/a -------- commit complete -- mirror current -- cache refresh period
actor_trying_to_apply_config      user/server --- config server --- mirror --------- leaf/router/mixer

What it teaches #

  • config is globally authored and validated in Spanner
  • transformed config is mirrored into each zone
  • zonal components cache relevant config in memory
  • zones keep operating with stale cache if the mirror becomes unavailable

State loci #

  • authoritative:
    • global configuration server backed by Spanner
  • local/execution:
    • zonal mirrors and component-local config caches
  • cached/derived:
    • transformed per-component config subsets
  • repair:
    • periodic refresh and mirror re-synchronization after failure

Core invariant #

  • only globally validated configuration versions may become the source for zonal mirrors and local caches

5. Storage Trace: Leaf In-Memory Store + Recovery Log #

This is Monarch’s core storage trace.

Core lines #

  • in_memory_leaf_version
  • recovery_log_visibility
  • durability_boundary
  • query_visible_leaf_state
  • writer_trying_to_append_point

Reusable frame mapping #

  • truth locus -> in_memory_leaf_version
  • replica/derived state -> recovery_log_visibility
  • durability/freshness boundary -> durability_boundary
  • visibility -> query_visible_leaf_state
  • actor trying to write/read/apply -> writer_trying_to_append_point

Trace #

time →

in_memory_leaf_version        900 ------------ 901 ---------------- 902 ----------------
recovery_log_visibility       899 ------------ best-effort append --- compacted/replayed -
durability_boundary           memory first ---- log maybe delayed --- long-term async ----
query_visible_leaf_state      900 ------------ 901 ---------------- 902 ----------------
writer_trying_to_append_point none ----------- leaf append --------- collection finalize -

What it teaches #

  • the primary serving truth is the in-memory leaf store
  • recovery logging is best-effort on the hot path and not synchronously acknowledged
  • long-term repository is colder and off the alerting path
  • Monarch explicitly prefers timely visibility over strict durability before ack

State loci #

  • authoritative:
    • in-memory leaf replicas for recent monitoring truth
  • local/replica:
    • per-range recovery logs in Colossus clusters
  • cached/derived:
    • compacted/reformatted log files and long-term repository
  • repair:
    • replay from logs into leaves during recovery or range movement

Core invariant #

  • the system may trade strict immediate durability for timely in-memory availability, but recovery state must be sufficient to reconstruct recent accepted leaf data within the designed fault model

6. Storage Trace: Query Execution With Field Hints Index #

This is Monarch’s core read-side localization trace.

Core lines #

  • field_hints_index_version
  • potentially_relevant_children
  • query_deadline_boundary
  • query_visible_result
  • reader_trying_to_localize_query

Reusable frame mapping #

  • truth locus -> actual leaf data plus live leaf snapshots
  • replica/derived state -> field_hints_index_version
  • durability/freshness boundary -> query_deadline_boundary
  • visibility -> query_visible_result
  • actor trying to write/read/apply -> reader_trying_to_localize_query

Trace #

time →

field_hints_index_version     700 ------------ 701 ---------------- 701 ----------------
potentially_relevant_children unknown -------- zones Z1,Z7 ------- leaves L2,L9,L11 ---
query_deadline_boundary       start ---------- soft zone deadline - hedge leaf fallback -
query_visible_result          none ----------- partial streaming --- final/partial done --
reader_trying_to_localize_query root mixer --- zone mixer -------- leaves + fallback ----

What it teaches #

  • query fanout is constrained first at root, then zone, then leaf
  • lower levels stream partial results upward
  • stale or missing index updates are used as availability signals
  • query visibility may be partial if zones are pruned or leaves are hedged

State loci #

  • authoritative:
    • live leaf snapshots of time series data
  • local/replica:
    • zone and root field hints indexes, leaf-local target indexes
  • cached/derived:
    • relevant-child sets and rewritten query fragments in mixers
  • repair:
    • fallback leaf reads, zone pruning, next query attempt

Core invariant #

  • query execution must localize to a safe superset of relevant children and may return partial results, but it must surface pruning/incompleteness rather than silently pretending completeness

7. Storage Trace: Standing Query Materialized Output #

This is Monarch’s main snapshot/projection trace.

Core lines #

  • standing_query_definition_version
  • derived_output_version
  • evaluation_period_boundary
  • consumer_visible_materialized_state
  • writer_trying_to_publish_materialized_output

Reusable frame mapping #

  • truth locus -> standing query definition plus source time series
  • replica/derived state -> derived_output_version
  • durability/freshness boundary -> evaluation_period_boundary
  • visibility -> consumer_visible_materialized_state
  • actor trying to write/read/apply -> writer_trying_to_publish_materialized_output

Trace #

time →

standing_query_definition_version 88 ------------- 88 ------------- 89 --------------
derived_output_version            v120 ----------- v121 ----------- v121 ------------
evaluation_period_boundary        t0-t1 ---------- t1-t2 ---------- next period -----
consumer_visible_materialized_state v120 --------- v121 ----------- v121 ------------
writer_trying_to_publish_materialized_output eval run 121 ------- leaf write ------- waiting next run

What it teaches #

  • standing queries create durable-ish materialized time series inside Monarch
  • freshness is period-bound, not instantaneous
  • definition changes and output generations are separate axes

State loci #

  • authoritative:
    • source input tables and standing query definition
  • local/replica:
    • evaluator shard state and leaf-written output replicas
  • cached/derived:
    • materialized standing-query outputs queried later by users
  • repair:
    • next scheduled evaluation or re-run after transient failure

Core invariant #

  • consumers of a standing query must observe output that corresponds to some completed evaluation period of the current or explicitly identified prior query definition

8. Storage Trace: Global Config Truth To Zonal Mirror And Cache #

This is Monarch’s storage-side configuration consistency trace.

Core lines #

  • global_config_commit_version
  • zonal_mirror_version
  • local_cache_version
  • staleness_boundary
  • reader_or_writer_trying_to_use_config

Reusable frame mapping #

  • truth locus -> global_config_commit_version
  • replica/derived state -> zonal_mirror_version and local_cache_version
  • durability/freshness boundary -> staleness_boundary
  • visibility -> component-visible config
  • actor trying to write/read/apply -> reader_or_writer_trying_to_use_config

Trace #

time →

global_config_commit_version   510 ------------ 511 ---------------- 511 ----------------
zonal_mirror_version           510 ------------ 510 ---------------- 511 ----------------
local_cache_version            510 ------------ 510 ---------------- 510/511 mix --------
staleness_boundary             healthy -------- mirror lag allowed -- cache refresh due ---
reader_or_writer_trying_to_use_config leaves read ----- mirrors copy ------ components refresh

What it teaches #

  • configuration has a strongly committed global truth
  • zonal mirrors and local caches may lag
  • components continue to operate under stale config when mirrors fail

State loci #

  • authoritative:
    • global Spanner-backed config state
  • local/replica:
    • zonal mirrors
  • cached/derived:
    • in-memory component caches
  • repair:
    • periodic refresh and mirror catch-up

Core invariant #

  • stale cached configuration may be served temporarily for availability, but config versions must advance monotonically from globally committed state