Alerting and On-call Routing System #

This note models an alerting and on-call routing system where alert events are ingested, deduplicated into incidents, routed through escalation policies and on-call schedules, and acknowledgments or resolutions change incident lifecycle and notification behavior.

Step 1 - Normalize #

Assume the baseline prompt is:

design an alerting and on-call routing system
upstream monitors send alert events
system deduplicates related alerts into incidents
incidents route to current on-call responders
acknowledgments, escalations, and resolutions change routing
system scales across many services and teams

Normalize into state-affecting paths.

Requirement	Actor	Operation	State touched	Priority
Upstream system sends alert event	Client	append event	`S1` `create target` `AlertEvent`	C1
System deduplicates or correlates alert into incident	System	state transition	`S1` `update target` `IncidentState`	C1
Admin updates escalation / routing policy	Admin	overwrite state	`S1` `update target` `RoutingPolicy`	C1
Admin updates on-call schedule	Admin	overwrite state	`S1` `update target` `OnCallSchedule`	C1
System computes active responder for incident	System	read source	`S1` `read source target` `OnCallSchedule`	C1
System sends notification attempt	System	append event	`S1` `create target` `NotificationAttempt`	C1
User acknowledges incident	Client	state transition	`S1` `update target` `IncidentState`	C1
User resolves incident	Client	state transition	`S1` `update target` `IncidentState`	C1
System escalates incident after timeout	System	async process	`S1` `hidden write target` `IncidentState`	C1
System applies suppression / silence	Client	state transition	`S1` `update target` `SuppressionRule`	C1
User reads incident list / timeline	Client	read projection	`S1` `read projection target` `IncidentView`	R1
System routes team/tenant shard to current owner	System	read source	`S1` `read source target` `PartitionMap`	C1
System reassigns shard ownership after node failure	System	state transition	`S1` `update target` `PartitionOwnership`	C1

Notes on normalization #

Important choices:

alert ingestion is append event
- raw alert signals are immutable facts
dedup/correlation is state transition
- current incident lifecycle changes as new events arrive
routing policy and schedule are current-value control state
notification send is append event
- each attempt is an immutable delivery fact
ack/resolve/escalate are incident lifecycle transitions
suppression is explicit because it changes routing behavior

This system is a hybrid of:

event ingestion
incident lifecycle state machine
policy-driven routing and escalation

Step 2 - Critical Path Selection #

Requirement	Priority class	Why
Ingest alert event	C1	alert input is the primary signal source
Deduplicate/correlate into incident	C1	correctness depends on grouping and current incident lifecycle
Update routing / escalation policy	C1	future notification decisions depend on current policy
Update on-call schedule	C1	responder selection depends on current schedule
Compute active responder	C1	wrong responder means wrong page delivery
Send notification attempt	C1	delivery attempts are core product behavior
Acknowledge incident	C1	ack changes escalation behavior and who gets notified
Resolve incident	C1	resolution changes future routing and visibility
Escalate incident after timeout	C1	escalation correctness is central to on-call routing
Apply suppression / silence	C1	silences alter routing and dedup behavior
Read incident list / timeline	R1	core serving path
Route to shard owner	C1	wrong routing can split incident truth
Reassign shard ownership	C1	failover must preserve incident lifecycle correctness

Baseline critical paths #

Main C1 paths:

P1 ingest alert event
P2 correlate/dedup into incident
P3 update routing policy
P4 update on-call schedule
P5 compute active responder
P6 send notification attempt
P7 acknowledge incident
P8 resolve incident
P9 escalate incident
P10 apply suppression
P11 route to shard owner
P12 reassign shard ownership

Main R1 path:

P13 read incident list / timeline

This design is driven by:

immutable alert-event ingestion
one authoritative current incident lifecycle
current on-call schedule and policy
timed escalation and ack/resolve suppression of further notifications

Step 3 - Primary State Extraction #

For an alerting and on-call routing system, the minimal primary state is the raw alert event, the current incident lifecycle, the routing policy, the on-call schedule, suppression state, notification attempts, and routing/ownership state.

Candidate object label	Candidate source	Candidate needed for C1/R1?	Candidate decomposition action	Class	Primary?	Owner	Evolution	Scope kind	Scope value
AlertEvent	direct noun	Yes	keep as candidate	event	Yes	service	append-only	instance	event_id
IncidentState	lifecycle object	Yes	keep as candidate	process	Yes	service	state machine	instance	incident_key
RoutingPolicy	direct noun	Yes	keep as candidate	entity	Yes	service	overwrite	instance	team_id or service_id
OnCallSchedule	direct noun	Yes	keep as candidate	entity	Yes	service	overwrite	instance	schedule_id
SuppressionRule	direct noun	Yes	keep as candidate	process	Yes	service	state machine	instance	suppression_id
NotificationAttempt	direct noun	Yes	keep as candidate	event	Yes	service	append-only	instance	notification_id
PartitionOwnership	hidden write target	Yes	keep as candidate	process	Yes	service	state machine	instance	shard_id
PartitionMap	hidden write target	Yes	keep as candidate	entity	Yes	service	overwrite	collection	tenant/team shards
IncidentView	derived read model	No	reject as UI artifact	projection	No	derived	overwrite	collection	tenant/team

Important modeling choices #

`AlertEvent` #

Primary because:

upstream alerts are immutable inputs
include source, labels, severity, time, and dedup keys

`IncidentState` #

Primary because:

this is the central lifecycle object
captures states like OPEN, ACKED, ESCALATED, RESOLVED, SUPPRESSED

`RoutingPolicy` #

Primary because:

determines escalation chain, notification channels, and responder routing

`OnCallSchedule` #

Primary because:

current responder selection depends on the schedule timeline

`SuppressionRule` #

Primary because:

silences and maintenance windows materially change whether incidents route at all

`NotificationAttempt` #

Primary because:

each send attempt is an immutable product fact
delivery history and retry logic depend on it

Minimal strict primary set #

The strongest minimal set is:

AlertEvent
IncidentState
RoutingPolicy
OnCallSchedule
SuppressionRule
NotificationAttempt
PartitionOwnership
PartitionMap

Step 4 - Hard Invariants #

For an alerting and on-call routing system, the hard invariants are about immutable alert ingestion, one authoritative current incident lifecycle, valid policy/schedule-based responder selection, and guarded ack/escalation/resolution transitions.

Path	Tier	Type	Invariant statement
`P1` ingest alert event	HARD	uniqueness	Key `event_id` maps to at most one logical outcome `stored alert event` within event scope.
`P2` correlate/dedup into incident	HARD	uniqueness	Key `incident_key` maps to at most one logical outcome `current authoritative incident state` within incident scope.
`P2` correlate/dedup into incident	HARD	accounting	`IncidentState(incident_key)` reflects the authoritative aggregation of relevant alert events modulo suppression and resolution policy.
`P3` update routing policy	HARD	ordering	Routing-policy revisions are ordered by monotonic version within policy scope.
`P4` update on-call schedule	HARD	ordering	Schedule revisions are ordered by monotonic version within schedule scope.
`P5` compute active responder	HARD	eligibility	Action `select_responder` is valid only if current `RoutingPolicy`, `OnCallSchedule`, and `SuppressionRule` allow routing for the incident at decision time.
`P6` send notification attempt	HARD	accounting	`NotificationAttempt` corresponds to a valid incident, target recipient/channel, and escalation step derived from current policy evaluation.
`P7` acknowledge incident	HARD	eligibility	Action `ack_incident` is valid only if current `IncidentState` allows acknowledgment and caller is authorized at decision time.
`P8` resolve incident	HARD	eligibility	Action `resolve_incident` is valid only if current `IncidentState` allows resolution at decision time.
`P9` escalate incident	HARD	eligibility	Action `escalate_incident` is valid only if incident remains unresolved/unacked, escalation timeout has elapsed, and current policy still requires escalation at decision time.
`P10` apply suppression	HARD	ordering	Suppression-rule revisions are ordered by monotonic version within suppression scope.
`P11` route to shard owner	HARD	uniqueness	Key `shard_id` maps to at most one logical outcome `current authoritative owner` within `shard_id`.
`P12` reassign shard ownership	HARD	eligibility	Action `reassign_shard` is valid only if `current owner is failed or relinquished and candidate owner is eligible and sufficiently current` on `shard_id` at decision time.
`P13` read incident list / timeline	HARD	freshness	Incident read path reflects authoritative incident and notification state within configured consistency bound.

What matters most #

1. One authoritative incident lifecycle per dedup key #

This prevents split acknowledgment, split escalation, and duplicate routing decisions.

2. Routing must respect current policy, schedule, and suppression #

Wrong schedule or stale silence data pages the wrong person.

3. Escalation is guarded by current state #

If an incident is already acked or resolved, escalation must not continue.

4. Notification attempts are facts, not current state #

The attempt log is append-only; current incident routing state is separate.

Step 5 - Execution Context #

For the baseline alerting platform:

Field	Value	Why
Topology	single service distributed	one logical alerting system spread across ingest, routing, and state nodes
Write coordination scope	per object scope	correctness is per incident, policy, schedule, suppression, and shard ownership scope
Read consistency target	bounded stale allowed	UI views can tolerate small lag, but routing decisions should use authoritative current state
Holder model	none	no long-lived client-owned lease is central to routing correctness
Compensation acceptable?	No	wrong routing or stale escalation cannot be safely repaired afterward

Derived implications #

holder_may_crash = false
- workers may fail, but incidents are not temporarily held by clients the way queues are
cross_service_write = false
- baseline keeps incident, policy, schedule, and ownership state within one logical service
bounded_staleness_allowed = true
- list/timeline reads can tolerate bounded lag, but routing/evaluation should be authoritative
cross_service_atomicity_required = false
- no multi-service transaction across unrelated services in baseline
exclusive_claim_required = true
- shard ownership must be exclusive
guarded_by_current_state = true
- dedup, ack, resolve, and escalate all depend on current incident state

What this implies #

This pushes us toward:

one authoritative owner per incident/routing shard
append-oriented alert-event and notification-attempt logs
current-value policy/schedule state
guarded incident lifecycle transitions for ack/resolve/escalate

Step 6 - Deterministic Mechanism Selection #

Path	Write shape	Base mechanism	Required companions
`P1` ingest alert event	append-only event	append log	dedup key / request id
`P2` correlate/dedup into incident	guarded state transition	single writer per shard or CAS on `(state, version)`	incident key, lifecycle version
`P3` update routing policy	overwrite current value	CAS on version	policy version
`P4` update on-call schedule	overwrite current value	CAS on version	schedule version
`P5` compute active responder	read source	direct source read	current schedule expansion / timezone logic
`P6` send notification attempt	append-only event	append log	notification id, channel dedup policy
`P7` acknowledge incident	guarded state transition	CAS on `(state, version)`	authz, incident version
`P8` resolve incident	guarded state transition	CAS on `(state, version)`	incident version
`P9` escalate incident	guarded state transition	single writer timer-driven transition or CAS on `(state, version)`	escalation step, timeout epoch
`P10` apply suppression	overwrite current value	CAS on version	suppression version
`P11` route to shard owner	exclusive claim	lease	fencing token, heartbeat
`P12` reassign shard ownership	guarded state transition	CAS on `(state, version)`	fencing token, shard catch-up check

Why these fit #

Alert events and notification attempts #

These are immutable product facts, so append-only fits.

Incident correlation and lifecycle changes #

Ack, resolve, dedup aggregation, and escalation all depend on current state, so guarded transitions fit.

Policies and schedules #

These are current-value control state, so overwrite fits.

Shard routing #

Routing correctness depends on one current owner per shard, so exclusive claim fits.

Canonical substrate implied #

The baseline now points to:

sharded incident-routing service
one owner per team or incident shard
immutable alert-event and notification-attempt logs
current incident lifecycle state
current policy/schedule/suppression state
timer-driven escalation transitions

Step 7 - Read Model / Source of Truth #

For an alerting and on-call routing system, truth is mostly direct source state. Incident lists and dashboards are derived views.

Concept	Truth	Read path	Rebuild path
`C1` raw alert inputs	`AlertEvent`	read source directly	authoritative alert-event store
`C2` current incident lifecycle	`IncidentState`	read source directly	authoritative incident-state store
`C3` routing / escalation policy	`RoutingPolicy`	read source directly	authoritative policy store
`C4` on-call schedule	`OnCallSchedule`	read source directly	authoritative schedule store
`C5` suppression / silence state	`SuppressionRule`	read source directly	authoritative suppression store
`C6` notification history	`NotificationAttempt`	read source directly	authoritative notification-attempt store
`C7` shard ownership	`PartitionOwnership`	read source directly	authoritative ownership store
`C8` shard routing map	`PartitionMap`	read source directly	authoritative routing metadata
`C9` incident lists / dashboards	derived from incidents and notification history	materialized view	recompute from authoritative state

Important point #

For the core semantics:

routing decisions read authoritative incident, policy, schedule, and suppression state
incident UI can be a projection
notification history is append-only source truth, not just derived telemetry

Step 8 - Failure Handling #

Path	Retry	Competing writers	Crash after commit	Publish failure	Stale holder
`P1` ingest alert event	upstream retry may resend same event; dedup key required	competing event ingests coexist	committed alert events survive crash if durable	n/a	stale shard owner blocked by fencing token
`P2` correlate/dedup into incident	retry safe with incident version	concurrent dedup/lifecycle updates resolved by single shard owner or CAS	committed incident transition survives crash if persisted	incident projection may lag	stale shard owner blocked by fencing token
`P3` update routing policy	retry with policy version	stale update loses CAS	committed policy survives crash if persisted	config propagation may lag	n/a
`P4` update on-call schedule	retry with schedule version	stale update loses CAS	committed schedule survives crash if persisted	precomputed schedule expansion may lag	n/a
`P6` send notification attempt	retry may resend channel message unless notification-id or provider dedup is used	competing senders should be fenced by incident/shard ownership	committed attempt survives crash if persisted	external provider failure can trigger retry	stale routing worker blocked by ownership/version discipline
`P7` acknowledge incident	retry with incident version	stale ack loses guarded transition	committed ack survives crash if persisted	notification cancellation may lag but future escalation reads current state	n/a
`P8` resolve incident	retry with incident version	stale resolve loses guarded transition	committed resolve survives crash if persisted	downstream cleanup may lag	n/a
`P9` escalate incident	timer-driven retry safe with incident version/escalation step	only one escalation transition should win current state	committed escalation survives crash if persisted	notification send may retry independently	stale timer worker blocked by ownership/version discipline
`P10` apply suppression	retry with suppression version	stale update loses CAS	committed suppression survives crash if persisted	routing cache may lag	n/a
`P11` route to shard owner	retry after refreshing shard map	only one valid owner should exist	if owner changed, refreshed map points to new owner	n/a	stale owner rejected by fencing token
`P12` reassign shard ownership	retry failover transition safely	only one reassignment wins current ownership state	promoted owner crash triggers later reassignment	n/a	old owner fenced and must not continue serving
`P13` read incident list / timeline	read retry safe	many readers coexist	node crash drops query only	n/a	stale list bounded by configured projection freshness

What matters most #

1. Dedup key correctness #

If dedup grouping is wrong, incident lifecycle and routing are wrong even if everything else works.

2. Escalation must re-check current incident state #

A pending timer should not escalate an incident that is already acked, silenced, or resolved.

3. Notification delivery and incident truth are separate #

Notification attempts are append-only facts; current incident state determines whether more attempts should occur.

4. Schedule correctness is time-dependent #

Responder selection depends on timezone, handoff windows, overrides, and schedule version.

Step 9 - Scale Adjustments #

Hotspot	Type	First response
very high alert-event rate	write throughput hotspot	shard by tenant/team/incident key and add more ingest/routing owners
noisy incident dedup groups	contention hotspot	isolate hot incident keys and tune grouping windows
schedule lookups and expansion	read hotspot	precompute near-future schedule windows and cache per team
notification storms during major outage	contention hotspot	batch similar incidents, apply suppression, and rate-limit channel sends
timer-driven escalation scans	write throughput hotspot	bucket incident escalation deadlines and scan incrementally
incident dashboard load	read hotspot	use projections and caching for list/timeline views

What scales well #

This system scales by:

sharding incident and routing state by tenant/team/service
keeping alert-event and notification-attempt logs append-oriented
precomputing active on-call windows
using due-time indexes for escalations and reminders

What fails first #

Usually:

noisy alert storms
incorrect dedup grouping
expensive schedule expansion at runtime
notification-provider rate limits

Canonical design conclusion #

The mechanical outcome is:

primary state:
- AlertEvent
- IncidentState
- RoutingPolicy
- OnCallSchedule
- SuppressionRule
- NotificationAttempt
- PartitionOwnership
- PartitionMap
critical invariants:
- immutable alert-event ingestion
- one authoritative current incident lifecycle per dedup key
- routing decisions valid only under current policy, schedule, and suppression state
- ack/resolve/escalate are guarded by current incident state
- exclusive shard ownership for incident/routing state
mechanisms:
- append log
- single writer per shard
- guarded incident lifecycle transitions
- current-value policy/schedule state
- fenced shard ownership
reads:
- direct authoritative reads for routing decisions
- projections for incident lists, timelines, and dashboards

Polished interview answer #

I’d design the alerting platform as a sharded incident-routing service with one authoritative owner per team or incident shard. Upstream monitors append immutable alert events, and shard owners correlate them into a current incident lifecycle keyed by dedup rules. Routing decisions read current escalation policy, current on-call schedule, and any active suppressions to choose the right responder, while notification attempts are logged as immutable delivery facts. Acknowledge, resolve, and timed escalation are guarded incident-state transitions, so escalation only proceeds if the incident is still active and unacked. The main scaling levers are more shards, precomputed schedule windows, due-time indexes for escalation deadlines, and controls for noisy alert storms.

Concrete Substrate #

I’ll choose a sharded incident-routing service with append-oriented alert-event and notification-attempt logs plus current incident/policy/schedule state as the concrete baseline, because it matches the mechanics we derived:

append-only alert ingestion
guarded incident lifecycle
current-value policy/schedule/suppression state
timer-driven escalation
one owner per shard

Concrete tech family:

routing service in Go or Java
durable metadata/state storage:
- replicated DB or RocksDB-backed service state
shard replication:
- Raft or leader-follower replication with commit index
metadata/control:
- etcd or internal metadata quorum for shard ownership/routing
external notification adapters:
- email, SMS, push, Slack, PagerDuty-style connectors

Each shard owner stores:

alert-event log for owned scope
current IncidentState
current RoutingPolicy
current OnCallSchedule / expanded active windows
current SuppressionRule
notification-attempt history
escalation due-time index

Operation Layer #

1. Ingest alert event #

API

IngestAlert(event)

Initiator

upstream monitor / alert source

Entry point

ingestion API / routing frontend

Authoritative decider

shard owner for tenant/team/incident key

Precondition

event parsed and routed to correct shard

Transition

append AlertEvent
create or update IncidentState via dedup/correlation rules

Response

success / dedup-applied

2. Acknowledge incident #

API

AckIncident(incident_id, actor, expected_version?)

Initiator

user/client

Entry point

incident API

Authoritative decider

shard owner for incident

Precondition

incident currently ackable
actor authorized

Transition

guarded update IncidentState: OPEN/ESCALATED -> ACKED

3. Resolve incident #

API

ResolveIncident(incident_id, actor, expected_version?)

Initiator

user/client

Entry point

incident API

Authoritative decider

shard owner for incident

Precondition

incident currently resolvable

Transition

guarded update IncidentState -> RESOLVED

4. Escalate incident #

API

internal escalation timer flow

Initiator

system

Entry point

shard owner

Authoritative decider

shard owner

Precondition

escalation deadline elapsed
incident still unresolved/unacked and unsuppressed

Transition

guarded update IncidentState to next escalation step
append NotificationAttempt for next responder set

5. Update on-call schedule #

API

PutSchedule(schedule_id, schedule, expected_version?)

Initiator

admin

Entry point

config API

Authoritative decider

schedule store owner

Precondition

version matches if optimistic concurrency used

Transition

overwrite OnCallSchedule
optionally recompute near-future active windows

Entry Point vs Decider vs Responder #

Path	Entry point	Authoritative decider	Physical responder	Logical responder
alert ingest	ingestion API	incident shard owner	routing node	alerting system
ack / resolve	incident API	incident shard owner	API node	alerting system
schedule/policy update	config API	config store owner	config node	alerting system
escalation	shard owner timer loop	incident shard owner	routing node / adapter	alerting system
notification send	notification adapter	incident shard owner for decision, provider for delivery	adapter node	alerting system
shard failover	follower / coordination layer	shard quorum / lease store	new leader / control plane	alerting system

Concrete HLD #

Main components:

ingestion/routing frontend
- receives upstream alert events
incident shard owners
- authoritative owners for incident lifecycle, dedup, and escalation
policy/schedule service
- stores escalation policies and on-call schedules
notification adapters
- send pages/messages over external channels
metadata/control service
- tracks shard ownership and routing
incident query UI/API
- serves incident lists and timelines from projections or source state

Short Interview Version #

I’d build the alerting platform as a sharded incident-routing service with one authoritative owner per team or incident shard. Upstream monitors append immutable alert events, and shard owners correlate them into a current incident lifecycle keyed by dedup rules. Routing decisions read current escalation policy, current on-call schedule, and active suppressions to choose the right responder, while notification attempts are logged as immutable delivery facts. Acknowledge, resolve, and timed escalation are guarded incident-state transitions, so escalation only proceeds if the incident is still active and unacked. The main scaling levers are more shards, precomputed schedule windows, due-time indexes for escalation deadlines, and controls for noisy alert storms.

Alerting and On-call Routing System #

Step 1 - Normalize #

Notes on normalization #

Step 2 - Critical Path Selection #

Baseline critical paths #

Step 3 - Primary State Extraction #

Important modeling choices #

AlertEvent #

IncidentState #

RoutingPolicy #

OnCallSchedule #

SuppressionRule #

NotificationAttempt #

Minimal strict primary set #

Step 4 - Hard Invariants #

What matters most #

1. One authoritative incident lifecycle per dedup key #

2. Routing must respect current policy, schedule, and suppression #

3. Escalation is guarded by current state #

4. Notification attempts are facts, not current state #

Step 5 - Execution Context #

Derived implications #

What this implies #

Step 6 - Deterministic Mechanism Selection #

Why these fit #

Alert events and notification attempts #

Incident correlation and lifecycle changes #

Policies and schedules #

Shard routing #

Canonical substrate implied #

Step 7 - Read Model / Source of Truth #

Important point #

Step 8 - Failure Handling #

What matters most #

1. Dedup key correctness #

2. Escalation must re-check current incident state #

3. Notification delivery and incident truth are separate #

4. Schedule correctness is time-dependent #

Step 9 - Scale Adjustments #

What scales well #

What fails first #

Canonical design conclusion #

Polished interview answer #

Concrete Substrate #

Operation Layer #

1. Ingest alert event #

2. Acknowledge incident #

3. Resolve incident #

4. Escalate incident #

5. Update on-call schedule #

Entry Point vs Decider vs Responder #

Concrete HLD #

Short Interview Version #

`AlertEvent` #

`IncidentState` #

`RoutingPolicy` #

`OnCallSchedule` #

`SuppressionRule` #

`NotificationAttempt` #