Monarch Process And Storage Traces #
Source: monarch.txt
Monarch is primarily a storage system:
- monitoring samples are ingested into regional in-memory leaves
- data is sharded by target ranges and replicated across leaves
- recovery logs and long-term repository back the in-memory store
- queries execute over leaves through mixers and index servers
- standing queries materialize derived time series back into Monarch
- configuration is globally authored and zonally mirrored/cached
Its process traces are mostly about:
- collection routing
- range assignment and movement
- standing query execution
- configuration distribution
Its storage traces are mostly about:
- in-memory leaf state plus best-effort recovery logging
- query-visible leaf/zone/root index state
- standing-query materialized output
- globally committed configuration plus zonal mirrors/caches
It is not primarily a workflow/control-plane execution fleet like Borg.
Hot Paths #
The main hot paths in Monarch are:
client write -> ingestion router -> leaf router -> leaf replicas- this is the primary ingestion path and one of the hottest paths in the system
query -> root/zone mixer -> index servers -> leaves -> partial aggregation -> stream results up- this is the primary read/query path
standing query evaluator -> mixers -> leaves -> write derived results back- a major internal read+write path
Supporting but less hot:
range movement / recovery during load balancingfield hints index streaming updatesconfiguration commit -> mirror propagation -> zonal cache refresh
Scaling Lens #
1. Collection / Ingestion #
flow- metric writes / deltas from monitored targets
containers- routing containers:
- ingestion routers
- leaf routers
- partition containers:
- target ranges
- replication containers:
- three leaf replicas per range
- routing containers:
drains- network drains:
- router forwarding bandwidth
- storage drains:
- leaf in-memory append/apply throughput
- recovery-log write bandwidth
- control drains:
- range-map propagation from leaves to routers
- network drains:
pressure signals- router throughput saturation
- hot target-range concentration
- leaf CPU/memory pressure
- delayed or rejected writes outside admission window
- recovery-log lag during movement/recovery
relief types- split / move target ranges
- add leaves horizontally within zone
- collection aggregation to shrink write volume
- lexicographic sharding by target
- temporary bucket shrink during range movement
2. Query Execution #
flow- ad hoc queries and standing-query reads
containers- routing containers:
- root mixers
- zone mixers
- index containers:
- root and zone field hints indexes
- leaf-local target index
- partition containers:
- relevant leaf sets / target ranges
- routing containers:
drains- control drains:
- query planning, pushdown, replica resolution
- network drains:
- subquery fanout and streaming partial results upward
- execution drains:
- leaf-level join/filter/group-by work
- control drains:
pressure signals- query fanout explosion
- root/zone mixer latency growth
- slow or missing zones
- slow leaf replicas
- memory pressure from large query result sets
relief types- query pushdown to zone and leaf
- field hints index pruning
- replica resolution to best-quality leaf
- zone pruning on soft deadlines
- hedged reads against fallback leaves
- per-user memory and CPU isolation
3. Standing Query Materialization #
flow- periodic derived-query runs that read source data and write condensed outputs
containers- queue containers:
- evaluator shard workload
- execution containers:
- zone/root evaluator shards
- mixers and leaves
- snapshot containers:
- materialized output time series
- queue containers:
drains- execution drains:
- evaluator and mixer computation
- storage drains:
- writes of derived output back into leaves
- network drains:
- query result streaming
- execution drains:
pressure signals- evaluator backlog
- missed evaluation periods / deadline overruns
- excessive write amplification from derived outputs
- cross-zone query latency
relief types- zone-local execution for most standing queries
- pushdown to reduce transferred and materialized volume
- sharding evaluators by standing-query hash
- materializing condensed outputs for later reads
4. Configuration Distribution #
flow- config updates from global truth to zonal mirrors and caches
containers- relation/config containers:
- global config store in Spanner
- replication containers:
- zonal mirrors
- cache containers:
- component-local in-memory config caches
- relation/config containers:
drains- control drains:
- validation, transformation, dependency tracking
- network drains:
- mirror propagation and periodic cache refresh
- control drains:
pressure signals- mirror lag
- stale component cache age
- config-application skew across zones
relief types- zonal mirrors for locality and availability
- transformed per-component config to reduce distribution cost
- in-memory caching at leaves/routers/mixers
- tolerate stale cache during mirror outages
Best Compression #
Monarch’s dominant scale story is:
write flowis constrained bytarget-range partitioning + leaf in-memory drainread flowis constrained byquery fanout + mixer/leaf execution drain- the main pressure signals are:
- range hotspots
- query fanout
- mixer latency
- leaf saturation
- evaluator backlog
- the canonical relief moves are:
- target-range splitting/movement
- collection aggregation
- query pushdown
- field-hints pruning
- zone pruning / hedged reads
- sharded evaluators and materialized outputs
1. Process Trace: Ingestion Routing To Leaf Write #
This is the core write-side path.
Core lines #
ingestion_path_statedestination_zonetarget_range_assignment_versionadmission_window_boundaryactor_trying_to_route_or_store
Reusable frame mapping #
state->ingestion_path_stateowner/view->destination_zoneplus current target range owner setmonotonic marker->target_range_assignment_versionvalidity boundary->admission_window_boundaryactor trying to advance->actor_trying_to_route_or_store
Trace #
time →
ingestion_path_state CLIENT_SEND ---- ZONE_ROUTED ---- LEAF_ROUTED ----- STORED
destination_zone unknown -------- zone-us-c1 ----- zone-us-c1 ------ zone-us-c1
target_range_assignment_version 200 ------------ 200 ------------ 201 ------------- 201
admission_window_boundary open ----------- open ----------- delta accepted --- finalized later
actor_trying_to_route_or_store client --------- ingest router -- leaf router ------ leaf replicas
What it teaches #
- writes are regionally routed by location field
- within the zone, writes are routed by target-range ownership
- leaves accept writes into in-memory state and recovery logs
- older deltas can be rejected by the admission window
State loci #
- authoritative:
- leaf replicas storing the target range in memory
- local/execution:
- ingestion router and leaf router forwarding state
- cached/derived:
- location-to-zone map and leaf-router range map
- repair:
- recovery logs and re-routing through updated range ownership
Core invariant #
- a write must only be accepted into the currently assigned leaf replica set for its target range and must respect the admission window for aggregating deltas
2. Process Trace: Range Movement / Intra-zone Load Balancing #
This is Monarch’s main rebalancing path.
Core lines #
range_move_statesource_and_destination_leafrange_assignment_epochrecovery_completion_boundaryactor_trying_to_move_range
Reusable frame mapping #
state->range_move_stateowner/view->source_and_destination_leafmonotonic marker->range_assignment_epochvalidity boundary->recovery_completion_boundaryactor trying to advance->actor_trying_to_move_range
Trace #
time →
range_move_state STABLE --------- DUAL_WRITE -------- RECOVERING ------ SOURCE_UNASSIGNED
source_and_destination_leaf L1 ------------ L1+L9 ------------ L1+L9 ----------- L9
range_assignment_epoch 41 ------------ 42 --------------- 42 --------------- 43
recovery_completion_boundary - ------------- log wait t1 ------ replay catchup --- recovery done
actor_trying_to_move_range assigner ------ dest leaf -------- dest leaf -------- assigner/source leaf
What it teaches #
- load balancing is by moving target ranges between leaves
- destination leaf starts collecting before the source stops
- recovery replays older data from logs in reverse chronological order
- routing continues during transient assigner failure because leaves update routers directly
State loci #
- authoritative:
- leaves actually storing the range and their recovery logs
- local/execution:
- destination leaf recovery worker and source leaf collector
- cached/derived:
- leaf-router range maps and assigner load view
- repair:
- recovery-log replay until destination has complete range state
Core invariant #
- a range move must preserve continuous data availability by overlapping collection and recovery before the source leaf is unassigned
3. Process Trace: Standing Query Evaluation #
This is the main internal derivation/materialization path.
Core lines #
standing_query_stateevaluation_scopestanding_query_versionquery_deadline_or_period_boundaryactor_trying_to_evaluate
Reusable frame mapping #
state->standing_query_stateowner/view->evaluation_scopemonotonic marker->standing_query_versionvalidity boundary->query_deadline_or_period_boundaryactor trying to advance->actor_trying_to_evaluate
Trace #
time →
standing_query_state SCHEDULED ------ DISPATCHED ------ RUNNING ---------- MATERIALIZED
evaluation_scope root/zone? ----- zone evaluator -- zone/root tree --- target output zone
standing_query_version 88 ------------ 88 -------------- 88 --------------- 88
query_deadline_or_period_boundary next tick --- deadline set ---- soft prune/hedge -- write period done
actor_trying_to_evaluate evaluator ------ mixer ---------- leaves/mixers ----- evaluator/leaves
What it teaches #
- standing queries are sharded across evaluators
- most are zone-local because of pushdown and resilience
- results are written back into Monarch as materialized time series
- query execution is deadline-aware and can prune zones or hedge leaf reads
State loci #
- authoritative:
- standing-query definition and resulting written output time series
- local/execution:
- evaluator shard, mixer tree, leaf snapshots
- cached/derived:
- index-guided relevant-child sets, rewritten query plan
- repair:
- next standing-query period or re-evaluation after transient failure
Core invariant #
- a standing query evaluation must operate over a bounded execution snapshot and materialize output according to the current validated standing-query definition
4. Process Trace: Configuration Distribution #
This is Monarch’s main control-plane propagation path.
Core lines #
configuration_distribution_stateglobal_vs_zonal_viewconfig_versionmirror_or_cache_staleness_boundaryactor_trying_to_apply_config
Reusable frame mapping #
state->configuration_distribution_stateowner/view->global_vs_zonal_viewmonotonic marker->config_versionvalidity boundary->mirror_or_cache_staleness_boundaryactor trying to advance->actor_trying_to_apply_config
Trace #
time →
configuration_distribution_state EDITED -------- COMMITTED ------- MIRRORED ------- CACHED/APPLIED
global_vs_zonal_view global draft -- global truth ---- zone mirror ---- zonal component cache
config_version 510 ---------- 511 ------------- 511 ------------ 511
mirror_or_cache_staleness_boundary n/a -------- commit complete -- mirror current -- cache refresh period
actor_trying_to_apply_config user/server --- config server --- mirror --------- leaf/router/mixer
What it teaches #
- config is globally authored and validated in Spanner
- transformed config is mirrored into each zone
- zonal components cache relevant config in memory
- zones keep operating with stale cache if the mirror becomes unavailable
State loci #
- authoritative:
- global configuration server backed by Spanner
- local/execution:
- zonal mirrors and component-local config caches
- cached/derived:
- transformed per-component config subsets
- repair:
- periodic refresh and mirror re-synchronization after failure
Core invariant #
- only globally validated configuration versions may become the source for zonal mirrors and local caches
5. Storage Trace: Leaf In-Memory Store + Recovery Log #
This is Monarch’s core storage trace.
Core lines #
in_memory_leaf_versionrecovery_log_visibilitydurability_boundaryquery_visible_leaf_statewriter_trying_to_append_point
Reusable frame mapping #
truth locus->in_memory_leaf_versionreplica/derived state->recovery_log_visibilitydurability/freshness boundary->durability_boundaryvisibility->query_visible_leaf_stateactor trying to write/read/apply->writer_trying_to_append_point
Trace #
time →
in_memory_leaf_version 900 ------------ 901 ---------------- 902 ----------------
recovery_log_visibility 899 ------------ best-effort append --- compacted/replayed -
durability_boundary memory first ---- log maybe delayed --- long-term async ----
query_visible_leaf_state 900 ------------ 901 ---------------- 902 ----------------
writer_trying_to_append_point none ----------- leaf append --------- collection finalize -
What it teaches #
- the primary serving truth is the in-memory leaf store
- recovery logging is best-effort on the hot path and not synchronously acknowledged
- long-term repository is colder and off the alerting path
- Monarch explicitly prefers timely visibility over strict durability before ack
State loci #
- authoritative:
- in-memory leaf replicas for recent monitoring truth
- local/replica:
- per-range recovery logs in Colossus clusters
- cached/derived:
- compacted/reformatted log files and long-term repository
- repair:
- replay from logs into leaves during recovery or range movement
Core invariant #
- the system may trade strict immediate durability for timely in-memory availability, but recovery state must be sufficient to reconstruct recent accepted leaf data within the designed fault model
6. Storage Trace: Query Execution With Field Hints Index #
This is Monarch’s core read-side localization trace.
Core lines #
field_hints_index_versionpotentially_relevant_childrenquery_deadline_boundaryquery_visible_resultreader_trying_to_localize_query
Reusable frame mapping #
truth locus-> actual leaf data plus live leaf snapshotsreplica/derived state->field_hints_index_versiondurability/freshness boundary->query_deadline_boundaryvisibility->query_visible_resultactor trying to write/read/apply->reader_trying_to_localize_query
Trace #
time →
field_hints_index_version 700 ------------ 701 ---------------- 701 ----------------
potentially_relevant_children unknown -------- zones Z1,Z7 ------- leaves L2,L9,L11 ---
query_deadline_boundary start ---------- soft zone deadline - hedge leaf fallback -
query_visible_result none ----------- partial streaming --- final/partial done --
reader_trying_to_localize_query root mixer --- zone mixer -------- leaves + fallback ----
What it teaches #
- query fanout is constrained first at root, then zone, then leaf
- lower levels stream partial results upward
- stale or missing index updates are used as availability signals
- query visibility may be partial if zones are pruned or leaves are hedged
State loci #
- authoritative:
- live leaf snapshots of time series data
- local/replica:
- zone and root field hints indexes, leaf-local target indexes
- cached/derived:
- relevant-child sets and rewritten query fragments in mixers
- repair:
- fallback leaf reads, zone pruning, next query attempt
Core invariant #
- query execution must localize to a safe superset of relevant children and may return partial results, but it must surface pruning/incompleteness rather than silently pretending completeness
7. Storage Trace: Standing Query Materialized Output #
This is Monarch’s main snapshot/projection trace.
Core lines #
standing_query_definition_versionderived_output_versionevaluation_period_boundaryconsumer_visible_materialized_statewriter_trying_to_publish_materialized_output
Reusable frame mapping #
truth locus-> standing query definition plus source time seriesreplica/derived state->derived_output_versiondurability/freshness boundary->evaluation_period_boundaryvisibility->consumer_visible_materialized_stateactor trying to write/read/apply->writer_trying_to_publish_materialized_output
Trace #
time →
standing_query_definition_version 88 ------------- 88 ------------- 89 --------------
derived_output_version v120 ----------- v121 ----------- v121 ------------
evaluation_period_boundary t0-t1 ---------- t1-t2 ---------- next period -----
consumer_visible_materialized_state v120 --------- v121 ----------- v121 ------------
writer_trying_to_publish_materialized_output eval run 121 ------- leaf write ------- waiting next run
What it teaches #
- standing queries create durable-ish materialized time series inside Monarch
- freshness is period-bound, not instantaneous
- definition changes and output generations are separate axes
State loci #
- authoritative:
- source input tables and standing query definition
- local/replica:
- evaluator shard state and leaf-written output replicas
- cached/derived:
- materialized standing-query outputs queried later by users
- repair:
- next scheduled evaluation or re-run after transient failure
Core invariant #
- consumers of a standing query must observe output that corresponds to some completed evaluation period of the current or explicitly identified prior query definition
8. Storage Trace: Global Config Truth To Zonal Mirror And Cache #
This is Monarch’s storage-side configuration consistency trace.
Core lines #
global_config_commit_versionzonal_mirror_versionlocal_cache_versionstaleness_boundaryreader_or_writer_trying_to_use_config
Reusable frame mapping #
truth locus->global_config_commit_versionreplica/derived state->zonal_mirror_versionandlocal_cache_versiondurability/freshness boundary->staleness_boundaryvisibility-> component-visible configactor trying to write/read/apply->reader_or_writer_trying_to_use_config
Trace #
time →
global_config_commit_version 510 ------------ 511 ---------------- 511 ----------------
zonal_mirror_version 510 ------------ 510 ---------------- 511 ----------------
local_cache_version 510 ------------ 510 ---------------- 510/511 mix --------
staleness_boundary healthy -------- mirror lag allowed -- cache refresh due ---
reader_or_writer_trying_to_use_config leaves read ----- mirrors copy ------ components refresh
What it teaches #
- configuration has a strongly committed global truth
- zonal mirrors and local caches may lag
- components continue to operate under stale config when mirrors fail
State loci #
- authoritative:
- global Spanner-backed config state
- local/replica:
- zonal mirrors
- cached/derived:
- in-memory component caches
- repair:
- periodic refresh and mirror catch-up
Core invariant #
- stale cached configuration may be served temporarily for availability, but config versions must advance monotonically from globally committed state