Alerting and On-call Routing System
Alerting and On-call Routing System #
This note models an alerting and on-call routing system where alert events are ingested, deduplicated into incidents, routed through escalation policies and on-call schedules, and acknowledgments or resolutions change incident lifecycle and notification behavior.
Step 1 - Normalize #
Assume the baseline prompt is:
- design an alerting and on-call routing system
- upstream monitors send alert events
- system deduplicates related alerts into incidents
- incidents route to current on-call responders
- acknowledgments, escalations, and resolutions change routing
- system scales across many services and teams
Normalize into state-affecting paths.
| Requirement | Actor | Operation | State touched | Priority |
|---|---|---|---|---|
| Upstream system sends alert event | Client | append event | S1create targetAlertEvent | C1 |
| System deduplicates or correlates alert into incident | System | state transition | S1update targetIncidentState | C1 |
| Admin updates escalation / routing policy | Admin | overwrite state | S1update targetRoutingPolicy | C1 |
| Admin updates on-call schedule | Admin | overwrite state | S1update targetOnCallSchedule | C1 |
| System computes active responder for incident | System | read source | S1read source targetOnCallSchedule | C1 |
| System sends notification attempt | System | append event | S1create targetNotificationAttempt | C1 |
| User acknowledges incident | Client | state transition | S1update targetIncidentState | C1 |
| User resolves incident | Client | state transition | S1update targetIncidentState | C1 |
| System escalates incident after timeout | System | async process | S1hidden write targetIncidentState | C1 |
| System applies suppression / silence | Client | state transition | S1update targetSuppressionRule | C1 |
| User reads incident list / timeline | Client | read projection | S1read projection targetIncidentView | R1 |
| System routes team/tenant shard to current owner | System | read source | S1read source targetPartitionMap | C1 |
| System reassigns shard ownership after node failure | System | state transition | S1update targetPartitionOwnership | C1 |
Notes on normalization #
Important choices:
- alert ingestion is
append event- raw alert signals are immutable facts
- dedup/correlation is
state transition- current incident lifecycle changes as new events arrive
- routing policy and schedule are current-value control state
- notification send is
append event- each attempt is an immutable delivery fact
- ack/resolve/escalate are incident lifecycle transitions
- suppression is explicit because it changes routing behavior
This system is a hybrid of:
event ingestionincident lifecycle state machinepolicy-driven routing and escalation
Step 2 - Critical Path Selection #
| Requirement | Priority class | Why |
|---|---|---|
| Ingest alert event | C1 | alert input is the primary signal source |
| Deduplicate/correlate into incident | C1 | correctness depends on grouping and current incident lifecycle |
| Update routing / escalation policy | C1 | future notification decisions depend on current policy |
| Update on-call schedule | C1 | responder selection depends on current schedule |
| Compute active responder | C1 | wrong responder means wrong page delivery |
| Send notification attempt | C1 | delivery attempts are core product behavior |
| Acknowledge incident | C1 | ack changes escalation behavior and who gets notified |
| Resolve incident | C1 | resolution changes future routing and visibility |
| Escalate incident after timeout | C1 | escalation correctness is central to on-call routing |
| Apply suppression / silence | C1 | silences alter routing and dedup behavior |
| Read incident list / timeline | R1 | core serving path |
| Route to shard owner | C1 | wrong routing can split incident truth |
| Reassign shard ownership | C1 | failover must preserve incident lifecycle correctness |
Baseline critical paths #
Main C1 paths:
P1ingest alert eventP2correlate/dedup into incidentP3update routing policyP4update on-call scheduleP5compute active responderP6send notification attemptP7acknowledge incidentP8resolve incidentP9escalate incidentP10apply suppressionP11route to shard ownerP12reassign shard ownership
Main R1 path:
P13read incident list / timeline
This design is driven by:
- immutable alert-event ingestion
- one authoritative current incident lifecycle
- current on-call schedule and policy
- timed escalation and ack/resolve suppression of further notifications
Step 3 - Primary State Extraction #
For an alerting and on-call routing system, the minimal primary state is the raw alert event, the current incident lifecycle, the routing policy, the on-call schedule, suppression state, notification attempts, and routing/ownership state.
| Candidate object label | Candidate source | Candidate needed for C1/R1? | Candidate decomposition action | Class | Primary? | Owner | Evolution | Scope kind | Scope value |
|---|---|---|---|---|---|---|---|---|---|
| AlertEvent | direct noun | Yes | keep as candidate | event | Yes | service | append-only | instance | event_id |
| IncidentState | lifecycle object | Yes | keep as candidate | process | Yes | service | state machine | instance | incident_key |
| RoutingPolicy | direct noun | Yes | keep as candidate | entity | Yes | service | overwrite | instance | team_id or service_id |
| OnCallSchedule | direct noun | Yes | keep as candidate | entity | Yes | service | overwrite | instance | schedule_id |
| SuppressionRule | direct noun | Yes | keep as candidate | process | Yes | service | state machine | instance | suppression_id |
| NotificationAttempt | direct noun | Yes | keep as candidate | event | Yes | service | append-only | instance | notification_id |
| PartitionOwnership | hidden write target | Yes | keep as candidate | process | Yes | service | state machine | instance | shard_id |
| PartitionMap | hidden write target | Yes | keep as candidate | entity | Yes | service | overwrite | collection | tenant/team shards |
| IncidentView | derived read model | No | reject as UI artifact | projection | No | derived | overwrite | collection | tenant/team |
Important modeling choices #
AlertEvent #
Primary because:
- upstream alerts are immutable inputs
- include source, labels, severity, time, and dedup keys
IncidentState #
Primary because:
- this is the central lifecycle object
- captures states like
OPEN,ACKED,ESCALATED,RESOLVED,SUPPRESSED
RoutingPolicy #
Primary because:
- determines escalation chain, notification channels, and responder routing
OnCallSchedule #
Primary because:
- current responder selection depends on the schedule timeline
SuppressionRule #
Primary because:
- silences and maintenance windows materially change whether incidents route at all
NotificationAttempt #
Primary because:
- each send attempt is an immutable product fact
- delivery history and retry logic depend on it
Minimal strict primary set #
The strongest minimal set is:
AlertEventIncidentStateRoutingPolicyOnCallScheduleSuppressionRuleNotificationAttemptPartitionOwnershipPartitionMap
Step 4 - Hard Invariants #
For an alerting and on-call routing system, the hard invariants are about immutable alert ingestion, one authoritative current incident lifecycle, valid policy/schedule-based responder selection, and guarded ack/escalation/resolution transitions.
| Path | Tier | Type | Invariant statement |
|---|---|---|---|
P1 ingest alert event | HARD | uniqueness | Key event_id maps to at most one logical outcome stored alert event within event scope. |
P2 correlate/dedup into incident | HARD | uniqueness | Key incident_key maps to at most one logical outcome current authoritative incident state within incident scope. |
P2 correlate/dedup into incident | HARD | accounting | IncidentState(incident_key) reflects the authoritative aggregation of relevant alert events modulo suppression and resolution policy. |
P3 update routing policy | HARD | ordering | Routing-policy revisions are ordered by monotonic version within policy scope. |
P4 update on-call schedule | HARD | ordering | Schedule revisions are ordered by monotonic version within schedule scope. |
P5 compute active responder | HARD | eligibility | Action select_responder is valid only if current RoutingPolicy, OnCallSchedule, and SuppressionRule allow routing for the incident at decision time. |
P6 send notification attempt | HARD | accounting | NotificationAttempt corresponds to a valid incident, target recipient/channel, and escalation step derived from current policy evaluation. |
P7 acknowledge incident | HARD | eligibility | Action ack_incident is valid only if current IncidentState allows acknowledgment and caller is authorized at decision time. |
P8 resolve incident | HARD | eligibility | Action resolve_incident is valid only if current IncidentState allows resolution at decision time. |
P9 escalate incident | HARD | eligibility | Action escalate_incident is valid only if incident remains unresolved/unacked, escalation timeout has elapsed, and current policy still requires escalation at decision time. |
P10 apply suppression | HARD | ordering | Suppression-rule revisions are ordered by monotonic version within suppression scope. |
P11 route to shard owner | HARD | uniqueness | Key shard_id maps to at most one logical outcome current authoritative owner within shard_id. |
P12 reassign shard ownership | HARD | eligibility | Action reassign_shard is valid only if current owner is failed or relinquished and candidate owner is eligible and sufficiently current on shard_id at decision time. |
P13 read incident list / timeline | HARD | freshness | Incident read path reflects authoritative incident and notification state within configured consistency bound. |
What matters most #
1. One authoritative incident lifecycle per dedup key #
This prevents split acknowledgment, split escalation, and duplicate routing decisions.
2. Routing must respect current policy, schedule, and suppression #
Wrong schedule or stale silence data pages the wrong person.
3. Escalation is guarded by current state #
If an incident is already acked or resolved, escalation must not continue.
4. Notification attempts are facts, not current state #
The attempt log is append-only; current incident routing state is separate.
Step 5 - Execution Context #
For the baseline alerting platform:
| Field | Value | Why |
|---|---|---|
| Topology | single service distributed | one logical alerting system spread across ingest, routing, and state nodes |
| Write coordination scope | per object scope | correctness is per incident, policy, schedule, suppression, and shard ownership scope |
| Read consistency target | bounded stale allowed | UI views can tolerate small lag, but routing decisions should use authoritative current state |
| Holder model | none | no long-lived client-owned lease is central to routing correctness |
| Compensation acceptable? | No | wrong routing or stale escalation cannot be safely repaired afterward |
Derived implications #
holder_may_crash = false- workers may fail, but incidents are not temporarily held by clients the way queues are
cross_service_write = false- baseline keeps incident, policy, schedule, and ownership state within one logical service
bounded_staleness_allowed = true- list/timeline reads can tolerate bounded lag, but routing/evaluation should be authoritative
cross_service_atomicity_required = false- no multi-service transaction across unrelated services in baseline
exclusive_claim_required = true- shard ownership must be exclusive
guarded_by_current_state = true- dedup, ack, resolve, and escalate all depend on current incident state
What this implies #
This pushes us toward:
- one authoritative owner per incident/routing shard
- append-oriented alert-event and notification-attempt logs
- current-value policy/schedule state
- guarded incident lifecycle transitions for ack/resolve/escalate
Step 6 - Deterministic Mechanism Selection #
| Path | Write shape | Base mechanism | Required companions |
|---|---|---|---|
P1 ingest alert event | append-only event | append log | dedup key / request id |
P2 correlate/dedup into incident | guarded state transition | single writer per shard or CAS on (state, version) | incident key, lifecycle version |
P3 update routing policy | overwrite current value | CAS on version | policy version |
P4 update on-call schedule | overwrite current value | CAS on version | schedule version |
P5 compute active responder | read source | direct source read | current schedule expansion / timezone logic |
P6 send notification attempt | append-only event | append log | notification id, channel dedup policy |
P7 acknowledge incident | guarded state transition | CAS on (state, version) | authz, incident version |
P8 resolve incident | guarded state transition | CAS on (state, version) | incident version |
P9 escalate incident | guarded state transition | single writer timer-driven transition or CAS on (state, version) | escalation step, timeout epoch |
P10 apply suppression | overwrite current value | CAS on version | suppression version |
P11 route to shard owner | exclusive claim | lease | fencing token, heartbeat |
P12 reassign shard ownership | guarded state transition | CAS on (state, version) | fencing token, shard catch-up check |
Why these fit #
Alert events and notification attempts #
These are immutable product facts, so append-only fits.
Incident correlation and lifecycle changes #
Ack, resolve, dedup aggregation, and escalation all depend on current state, so guarded transitions fit.
Policies and schedules #
These are current-value control state, so overwrite fits.
Shard routing #
Routing correctness depends on one current owner per shard, so exclusive claim fits.
Canonical substrate implied #
The baseline now points to:
- sharded incident-routing service
- one owner per team or incident shard
- immutable alert-event and notification-attempt logs
- current incident lifecycle state
- current policy/schedule/suppression state
- timer-driven escalation transitions
Step 7 - Read Model / Source of Truth #
For an alerting and on-call routing system, truth is mostly direct source state. Incident lists and dashboards are derived views.
| Concept | Truth | Read path | Rebuild path |
|---|---|---|---|
C1 raw alert inputs | AlertEvent | read source directly | authoritative alert-event store |
C2 current incident lifecycle | IncidentState | read source directly | authoritative incident-state store |
C3 routing / escalation policy | RoutingPolicy | read source directly | authoritative policy store |
C4 on-call schedule | OnCallSchedule | read source directly | authoritative schedule store |
C5 suppression / silence state | SuppressionRule | read source directly | authoritative suppression store |
C6 notification history | NotificationAttempt | read source directly | authoritative notification-attempt store |
C7 shard ownership | PartitionOwnership | read source directly | authoritative ownership store |
C8 shard routing map | PartitionMap | read source directly | authoritative routing metadata |
C9 incident lists / dashboards | derived from incidents and notification history | materialized view | recompute from authoritative state |
Important point #
For the core semantics:
- routing decisions read authoritative incident, policy, schedule, and suppression state
- incident UI can be a projection
- notification history is append-only source truth, not just derived telemetry
Step 8 - Failure Handling #
| Path | Retry | Competing writers | Crash after commit | Publish failure | Stale holder |
|---|---|---|---|---|---|
P1 ingest alert event | upstream retry may resend same event; dedup key required | competing event ingests coexist | committed alert events survive crash if durable | n/a | stale shard owner blocked by fencing token |
P2 correlate/dedup into incident | retry safe with incident version | concurrent dedup/lifecycle updates resolved by single shard owner or CAS | committed incident transition survives crash if persisted | incident projection may lag | stale shard owner blocked by fencing token |
P3 update routing policy | retry with policy version | stale update loses CAS | committed policy survives crash if persisted | config propagation may lag | n/a |
P4 update on-call schedule | retry with schedule version | stale update loses CAS | committed schedule survives crash if persisted | precomputed schedule expansion may lag | n/a |
P6 send notification attempt | retry may resend channel message unless notification-id or provider dedup is used | competing senders should be fenced by incident/shard ownership | committed attempt survives crash if persisted | external provider failure can trigger retry | stale routing worker blocked by ownership/version discipline |
P7 acknowledge incident | retry with incident version | stale ack loses guarded transition | committed ack survives crash if persisted | notification cancellation may lag but future escalation reads current state | n/a |
P8 resolve incident | retry with incident version | stale resolve loses guarded transition | committed resolve survives crash if persisted | downstream cleanup may lag | n/a |
P9 escalate incident | timer-driven retry safe with incident version/escalation step | only one escalation transition should win current state | committed escalation survives crash if persisted | notification send may retry independently | stale timer worker blocked by ownership/version discipline |
P10 apply suppression | retry with suppression version | stale update loses CAS | committed suppression survives crash if persisted | routing cache may lag | n/a |
P11 route to shard owner | retry after refreshing shard map | only one valid owner should exist | if owner changed, refreshed map points to new owner | n/a | stale owner rejected by fencing token |
P12 reassign shard ownership | retry failover transition safely | only one reassignment wins current ownership state | promoted owner crash triggers later reassignment | n/a | old owner fenced and must not continue serving |
P13 read incident list / timeline | read retry safe | many readers coexist | node crash drops query only | n/a | stale list bounded by configured projection freshness |
What matters most #
1. Dedup key correctness #
If dedup grouping is wrong, incident lifecycle and routing are wrong even if everything else works.
2. Escalation must re-check current incident state #
A pending timer should not escalate an incident that is already acked, silenced, or resolved.
3. Notification delivery and incident truth are separate #
Notification attempts are append-only facts; current incident state determines whether more attempts should occur.
4. Schedule correctness is time-dependent #
Responder selection depends on timezone, handoff windows, overrides, and schedule version.
Step 9 - Scale Adjustments #
| Hotspot | Type | First response |
|---|---|---|
| very high alert-event rate | write throughput hotspot | shard by tenant/team/incident key and add more ingest/routing owners |
| noisy incident dedup groups | contention hotspot | isolate hot incident keys and tune grouping windows |
| schedule lookups and expansion | read hotspot | precompute near-future schedule windows and cache per team |
| notification storms during major outage | contention hotspot | batch similar incidents, apply suppression, and rate-limit channel sends |
| timer-driven escalation scans | write throughput hotspot | bucket incident escalation deadlines and scan incrementally |
| incident dashboard load | read hotspot | use projections and caching for list/timeline views |
What scales well #
This system scales by:
- sharding incident and routing state by tenant/team/service
- keeping alert-event and notification-attempt logs append-oriented
- precomputing active on-call windows
- using due-time indexes for escalations and reminders
What fails first #
Usually:
- noisy alert storms
- incorrect dedup grouping
- expensive schedule expansion at runtime
- notification-provider rate limits
Canonical design conclusion #
The mechanical outcome is:
- primary state:
AlertEventIncidentStateRoutingPolicyOnCallScheduleSuppressionRuleNotificationAttemptPartitionOwnershipPartitionMap
- critical invariants:
- immutable alert-event ingestion
- one authoritative current incident lifecycle per dedup key
- routing decisions valid only under current policy, schedule, and suppression state
- ack/resolve/escalate are guarded by current incident state
- exclusive shard ownership for incident/routing state
- mechanisms:
append logsingle writer per shard- guarded incident lifecycle transitions
- current-value policy/schedule state
- fenced shard ownership
- reads:
- direct authoritative reads for routing decisions
- projections for incident lists, timelines, and dashboards
Polished interview answer #
I’d design the alerting platform as a sharded incident-routing service with one authoritative owner per team or incident shard. Upstream monitors append immutable alert events, and shard owners correlate them into a current incident lifecycle keyed by dedup rules. Routing decisions read current escalation policy, current on-call schedule, and any active suppressions to choose the right responder, while notification attempts are logged as immutable delivery facts. Acknowledge, resolve, and timed escalation are guarded incident-state transitions, so escalation only proceeds if the incident is still active and unacked. The main scaling levers are more shards, precomputed schedule windows, due-time indexes for escalation deadlines, and controls for noisy alert storms.
Concrete Substrate #
I’ll choose a sharded incident-routing service with append-oriented alert-event and notification-attempt logs plus current incident/policy/schedule state as the concrete baseline, because it matches the mechanics we derived:
- append-only alert ingestion
- guarded incident lifecycle
- current-value policy/schedule/suppression state
- timer-driven escalation
- one owner per shard
Concrete tech family:
- routing service in
GoorJava - durable metadata/state storage:
- replicated DB or
RocksDB-backed service state
- replicated DB or
- shard replication:
Raftor leader-follower replication with commit index
- metadata/control:
etcdor internal metadata quorum for shard ownership/routing
- external notification adapters:
- email, SMS, push, Slack, PagerDuty-style connectors
Each shard owner stores:
- alert-event log for owned scope
- current
IncidentState - current
RoutingPolicy - current
OnCallSchedule/ expanded active windows - current
SuppressionRule - notification-attempt history
- escalation due-time index
Operation Layer #
1. Ingest alert event #
API
IngestAlert(event)
Initiator
- upstream monitor / alert source
Entry point
- ingestion API / routing frontend
Authoritative decider
- shard owner for tenant/team/incident key
Precondition
- event parsed and routed to correct shard
Transition
- append
AlertEvent - create or update
IncidentStatevia dedup/correlation rules
Response
- success / dedup-applied
2. Acknowledge incident #
API
AckIncident(incident_id, actor, expected_version?)
Initiator
- user/client
Entry point
- incident API
Authoritative decider
- shard owner for incident
Precondition
- incident currently ackable
- actor authorized
Transition
- guarded update
IncidentState: OPEN/ESCALATED -> ACKED
3. Resolve incident #
API
ResolveIncident(incident_id, actor, expected_version?)
Initiator
- user/client
Entry point
- incident API
Authoritative decider
- shard owner for incident
Precondition
- incident currently resolvable
Transition
- guarded update
IncidentState -> RESOLVED
4. Escalate incident #
API
- internal escalation timer flow
Initiator
- system
Entry point
- shard owner
Authoritative decider
- shard owner
Precondition
- escalation deadline elapsed
- incident still unresolved/unacked and unsuppressed
Transition
- guarded update
IncidentStateto next escalation step - append
NotificationAttemptfor next responder set
5. Update on-call schedule #
API
PutSchedule(schedule_id, schedule, expected_version?)
Initiator
- admin
Entry point
- config API
Authoritative decider
- schedule store owner
Precondition
- version matches if optimistic concurrency used
Transition
- overwrite
OnCallSchedule - optionally recompute near-future active windows
Entry Point vs Decider vs Responder #
| Path | Entry point | Authoritative decider | Physical responder | Logical responder |
|---|---|---|---|---|
| alert ingest | ingestion API | incident shard owner | routing node | alerting system |
| ack / resolve | incident API | incident shard owner | API node | alerting system |
| schedule/policy update | config API | config store owner | config node | alerting system |
| escalation | shard owner timer loop | incident shard owner | routing node / adapter | alerting system |
| notification send | notification adapter | incident shard owner for decision, provider for delivery | adapter node | alerting system |
| shard failover | follower / coordination layer | shard quorum / lease store | new leader / control plane | alerting system |
Concrete HLD #
Main components:
- ingestion/routing frontend
- receives upstream alert events
- incident shard owners
- authoritative owners for incident lifecycle, dedup, and escalation
- policy/schedule service
- stores escalation policies and on-call schedules
- notification adapters
- send pages/messages over external channels
- metadata/control service
- tracks shard ownership and routing
- incident query UI/API
- serves incident lists and timelines from projections or source state
Short Interview Version #
I’d build the alerting platform as a sharded incident-routing service with one authoritative owner per team or incident shard. Upstream monitors append immutable alert events, and shard owners correlate them into a current incident lifecycle keyed by dedup rules. Routing decisions read current escalation policy, current on-call schedule, and active suppressions to choose the right responder, while notification attempts are logged as immutable delivery facts. Acknowledge, resolve, and timed escalation are guarded incident-state transitions, so escalation only proceeds if the incident is still active and unacked. The main scaling levers are more shards, precomputed schedule windows, due-time indexes for escalation deadlines, and controls for noisy alert storms.