Skip to main content
  1. System Design Components/

Alerting and On-call Routing System

Alerting and On-call Routing System #

This note models an alerting and on-call routing system where alert events are ingested, deduplicated into incidents, routed through escalation policies and on-call schedules, and acknowledgments or resolutions change incident lifecycle and notification behavior.


Step 1 - Normalize #

Assume the baseline prompt is:

  • design an alerting and on-call routing system
  • upstream monitors send alert events
  • system deduplicates related alerts into incidents
  • incidents route to current on-call responders
  • acknowledgments, escalations, and resolutions change routing
  • system scales across many services and teams

Normalize into state-affecting paths.

RequirementActorOperationState touchedPriority
Upstream system sends alert eventClientappend eventS1
create target
AlertEvent
C1
System deduplicates or correlates alert into incidentSystemstate transitionS1
update target
IncidentState
C1
Admin updates escalation / routing policyAdminoverwrite stateS1
update target
RoutingPolicy
C1
Admin updates on-call scheduleAdminoverwrite stateS1
update target
OnCallSchedule
C1
System computes active responder for incidentSystemread sourceS1
read source target
OnCallSchedule
C1
System sends notification attemptSystemappend eventS1
create target
NotificationAttempt
C1
User acknowledges incidentClientstate transitionS1
update target
IncidentState
C1
User resolves incidentClientstate transitionS1
update target
IncidentState
C1
System escalates incident after timeoutSystemasync processS1
hidden write target
IncidentState
C1
System applies suppression / silenceClientstate transitionS1
update target
SuppressionRule
C1
User reads incident list / timelineClientread projectionS1
read projection target
IncidentView
R1
System routes team/tenant shard to current ownerSystemread sourceS1
read source target
PartitionMap
C1
System reassigns shard ownership after node failureSystemstate transitionS1
update target
PartitionOwnership
C1

Notes on normalization #

Important choices:

  • alert ingestion is append event
    • raw alert signals are immutable facts
  • dedup/correlation is state transition
    • current incident lifecycle changes as new events arrive
  • routing policy and schedule are current-value control state
  • notification send is append event
    • each attempt is an immutable delivery fact
  • ack/resolve/escalate are incident lifecycle transitions
  • suppression is explicit because it changes routing behavior

This system is a hybrid of:

  • event ingestion
  • incident lifecycle state machine
  • policy-driven routing and escalation

Step 2 - Critical Path Selection #

RequirementPriority classWhy
Ingest alert eventC1alert input is the primary signal source
Deduplicate/correlate into incidentC1correctness depends on grouping and current incident lifecycle
Update routing / escalation policyC1future notification decisions depend on current policy
Update on-call scheduleC1responder selection depends on current schedule
Compute active responderC1wrong responder means wrong page delivery
Send notification attemptC1delivery attempts are core product behavior
Acknowledge incidentC1ack changes escalation behavior and who gets notified
Resolve incidentC1resolution changes future routing and visibility
Escalate incident after timeoutC1escalation correctness is central to on-call routing
Apply suppression / silenceC1silences alter routing and dedup behavior
Read incident list / timelineR1core serving path
Route to shard ownerC1wrong routing can split incident truth
Reassign shard ownershipC1failover must preserve incident lifecycle correctness

Baseline critical paths #

Main C1 paths:

  • P1 ingest alert event
  • P2 correlate/dedup into incident
  • P3 update routing policy
  • P4 update on-call schedule
  • P5 compute active responder
  • P6 send notification attempt
  • P7 acknowledge incident
  • P8 resolve incident
  • P9 escalate incident
  • P10 apply suppression
  • P11 route to shard owner
  • P12 reassign shard ownership

Main R1 path:

  • P13 read incident list / timeline

This design is driven by:

  • immutable alert-event ingestion
  • one authoritative current incident lifecycle
  • current on-call schedule and policy
  • timed escalation and ack/resolve suppression of further notifications

Step 3 - Primary State Extraction #

For an alerting and on-call routing system, the minimal primary state is the raw alert event, the current incident lifecycle, the routing policy, the on-call schedule, suppression state, notification attempts, and routing/ownership state.

Candidate object labelCandidate sourceCandidate needed for C1/R1?Candidate decomposition actionClassPrimary?OwnerEvolutionScope kindScope value
AlertEventdirect nounYeskeep as candidateeventYesserviceappend-onlyinstanceevent_id
IncidentStatelifecycle objectYeskeep as candidateprocessYesservicestate machineinstanceincident_key
RoutingPolicydirect nounYeskeep as candidateentityYesserviceoverwriteinstanceteam_id or service_id
OnCallScheduledirect nounYeskeep as candidateentityYesserviceoverwriteinstanceschedule_id
SuppressionRuledirect nounYeskeep as candidateprocessYesservicestate machineinstancesuppression_id
NotificationAttemptdirect nounYeskeep as candidateeventYesserviceappend-onlyinstancenotification_id
PartitionOwnershiphidden write targetYeskeep as candidateprocessYesservicestate machineinstanceshard_id
PartitionMaphidden write targetYeskeep as candidateentityYesserviceoverwritecollectiontenant/team shards
IncidentViewderived read modelNoreject as UI artifactprojectionNoderivedoverwritecollectiontenant/team

Important modeling choices #

AlertEvent #

Primary because:

  • upstream alerts are immutable inputs
  • include source, labels, severity, time, and dedup keys

IncidentState #

Primary because:

  • this is the central lifecycle object
  • captures states like OPEN, ACKED, ESCALATED, RESOLVED, SUPPRESSED

RoutingPolicy #

Primary because:

  • determines escalation chain, notification channels, and responder routing

OnCallSchedule #

Primary because:

  • current responder selection depends on the schedule timeline

SuppressionRule #

Primary because:

  • silences and maintenance windows materially change whether incidents route at all

NotificationAttempt #

Primary because:

  • each send attempt is an immutable product fact
  • delivery history and retry logic depend on it

Minimal strict primary set #

The strongest minimal set is:

  • AlertEvent
  • IncidentState
  • RoutingPolicy
  • OnCallSchedule
  • SuppressionRule
  • NotificationAttempt
  • PartitionOwnership
  • PartitionMap

Step 4 - Hard Invariants #

For an alerting and on-call routing system, the hard invariants are about immutable alert ingestion, one authoritative current incident lifecycle, valid policy/schedule-based responder selection, and guarded ack/escalation/resolution transitions.

PathTierTypeInvariant statement
P1 ingest alert eventHARDuniquenessKey event_id maps to at most one logical outcome stored alert event within event scope.
P2 correlate/dedup into incidentHARDuniquenessKey incident_key maps to at most one logical outcome current authoritative incident state within incident scope.
P2 correlate/dedup into incidentHARDaccountingIncidentState(incident_key) reflects the authoritative aggregation of relevant alert events modulo suppression and resolution policy.
P3 update routing policyHARDorderingRouting-policy revisions are ordered by monotonic version within policy scope.
P4 update on-call scheduleHARDorderingSchedule revisions are ordered by monotonic version within schedule scope.
P5 compute active responderHARDeligibilityAction select_responder is valid only if current RoutingPolicy, OnCallSchedule, and SuppressionRule allow routing for the incident at decision time.
P6 send notification attemptHARDaccountingNotificationAttempt corresponds to a valid incident, target recipient/channel, and escalation step derived from current policy evaluation.
P7 acknowledge incidentHARDeligibilityAction ack_incident is valid only if current IncidentState allows acknowledgment and caller is authorized at decision time.
P8 resolve incidentHARDeligibilityAction resolve_incident is valid only if current IncidentState allows resolution at decision time.
P9 escalate incidentHARDeligibilityAction escalate_incident is valid only if incident remains unresolved/unacked, escalation timeout has elapsed, and current policy still requires escalation at decision time.
P10 apply suppressionHARDorderingSuppression-rule revisions are ordered by monotonic version within suppression scope.
P11 route to shard ownerHARDuniquenessKey shard_id maps to at most one logical outcome current authoritative owner within shard_id.
P12 reassign shard ownershipHARDeligibilityAction reassign_shard is valid only if current owner is failed or relinquished and candidate owner is eligible and sufficiently current on shard_id at decision time.
P13 read incident list / timelineHARDfreshnessIncident read path reflects authoritative incident and notification state within configured consistency bound.

What matters most #

1. One authoritative incident lifecycle per dedup key #

This prevents split acknowledgment, split escalation, and duplicate routing decisions.

2. Routing must respect current policy, schedule, and suppression #

Wrong schedule or stale silence data pages the wrong person.

3. Escalation is guarded by current state #

If an incident is already acked or resolved, escalation must not continue.

4. Notification attempts are facts, not current state #

The attempt log is append-only; current incident routing state is separate.


Step 5 - Execution Context #

For the baseline alerting platform:

FieldValueWhy
Topologysingle service distributedone logical alerting system spread across ingest, routing, and state nodes
Write coordination scopeper object scopecorrectness is per incident, policy, schedule, suppression, and shard ownership scope
Read consistency targetbounded stale allowedUI views can tolerate small lag, but routing decisions should use authoritative current state
Holder modelnoneno long-lived client-owned lease is central to routing correctness
Compensation acceptable?Nowrong routing or stale escalation cannot be safely repaired afterward

Derived implications #

  • holder_may_crash = false

    • workers may fail, but incidents are not temporarily held by clients the way queues are
  • cross_service_write = false

    • baseline keeps incident, policy, schedule, and ownership state within one logical service
  • bounded_staleness_allowed = true

    • list/timeline reads can tolerate bounded lag, but routing/evaluation should be authoritative
  • cross_service_atomicity_required = false

    • no multi-service transaction across unrelated services in baseline
  • exclusive_claim_required = true

    • shard ownership must be exclusive
  • guarded_by_current_state = true

    • dedup, ack, resolve, and escalate all depend on current incident state

What this implies #

This pushes us toward:

  • one authoritative owner per incident/routing shard
  • append-oriented alert-event and notification-attempt logs
  • current-value policy/schedule state
  • guarded incident lifecycle transitions for ack/resolve/escalate

Step 6 - Deterministic Mechanism Selection #

PathWrite shapeBase mechanismRequired companions
P1 ingest alert eventappend-only eventappend logdedup key / request id
P2 correlate/dedup into incidentguarded state transitionsingle writer per shard or CAS on (state, version)incident key, lifecycle version
P3 update routing policyoverwrite current valueCAS on versionpolicy version
P4 update on-call scheduleoverwrite current valueCAS on versionschedule version
P5 compute active responderread sourcedirect source readcurrent schedule expansion / timezone logic
P6 send notification attemptappend-only eventappend lognotification id, channel dedup policy
P7 acknowledge incidentguarded state transitionCAS on (state, version)authz, incident version
P8 resolve incidentguarded state transitionCAS on (state, version)incident version
P9 escalate incidentguarded state transitionsingle writer timer-driven transition or CAS on (state, version)escalation step, timeout epoch
P10 apply suppressionoverwrite current valueCAS on versionsuppression version
P11 route to shard ownerexclusive claimleasefencing token, heartbeat
P12 reassign shard ownershipguarded state transitionCAS on (state, version)fencing token, shard catch-up check

Why these fit #

Alert events and notification attempts #

These are immutable product facts, so append-only fits.

Incident correlation and lifecycle changes #

Ack, resolve, dedup aggregation, and escalation all depend on current state, so guarded transitions fit.

Policies and schedules #

These are current-value control state, so overwrite fits.

Shard routing #

Routing correctness depends on one current owner per shard, so exclusive claim fits.

Canonical substrate implied #

The baseline now points to:

  • sharded incident-routing service
  • one owner per team or incident shard
  • immutable alert-event and notification-attempt logs
  • current incident lifecycle state
  • current policy/schedule/suppression state
  • timer-driven escalation transitions

Step 7 - Read Model / Source of Truth #

For an alerting and on-call routing system, truth is mostly direct source state. Incident lists and dashboards are derived views.

ConceptTruthRead pathRebuild path
C1 raw alert inputsAlertEventread source directlyauthoritative alert-event store
C2 current incident lifecycleIncidentStateread source directlyauthoritative incident-state store
C3 routing / escalation policyRoutingPolicyread source directlyauthoritative policy store
C4 on-call scheduleOnCallScheduleread source directlyauthoritative schedule store
C5 suppression / silence stateSuppressionRuleread source directlyauthoritative suppression store
C6 notification historyNotificationAttemptread source directlyauthoritative notification-attempt store
C7 shard ownershipPartitionOwnershipread source directlyauthoritative ownership store
C8 shard routing mapPartitionMapread source directlyauthoritative routing metadata
C9 incident lists / dashboardsderived from incidents and notification historymaterialized viewrecompute from authoritative state

Important point #

For the core semantics:

  • routing decisions read authoritative incident, policy, schedule, and suppression state
  • incident UI can be a projection
  • notification history is append-only source truth, not just derived telemetry

Step 8 - Failure Handling #

PathRetryCompeting writersCrash after commitPublish failureStale holder
P1 ingest alert eventupstream retry may resend same event; dedup key requiredcompeting event ingests coexistcommitted alert events survive crash if durablen/astale shard owner blocked by fencing token
P2 correlate/dedup into incidentretry safe with incident versionconcurrent dedup/lifecycle updates resolved by single shard owner or CAScommitted incident transition survives crash if persistedincident projection may lagstale shard owner blocked by fencing token
P3 update routing policyretry with policy versionstale update loses CAScommitted policy survives crash if persistedconfig propagation may lagn/a
P4 update on-call scheduleretry with schedule versionstale update loses CAScommitted schedule survives crash if persistedprecomputed schedule expansion may lagn/a
P6 send notification attemptretry may resend channel message unless notification-id or provider dedup is usedcompeting senders should be fenced by incident/shard ownershipcommitted attempt survives crash if persistedexternal provider failure can trigger retrystale routing worker blocked by ownership/version discipline
P7 acknowledge incidentretry with incident versionstale ack loses guarded transitioncommitted ack survives crash if persistednotification cancellation may lag but future escalation reads current staten/a
P8 resolve incidentretry with incident versionstale resolve loses guarded transitioncommitted resolve survives crash if persisteddownstream cleanup may lagn/a
P9 escalate incidenttimer-driven retry safe with incident version/escalation steponly one escalation transition should win current statecommitted escalation survives crash if persistednotification send may retry independentlystale timer worker blocked by ownership/version discipline
P10 apply suppressionretry with suppression versionstale update loses CAScommitted suppression survives crash if persistedrouting cache may lagn/a
P11 route to shard ownerretry after refreshing shard maponly one valid owner should existif owner changed, refreshed map points to new ownern/astale owner rejected by fencing token
P12 reassign shard ownershipretry failover transition safelyonly one reassignment wins current ownership statepromoted owner crash triggers later reassignmentn/aold owner fenced and must not continue serving
P13 read incident list / timelineread retry safemany readers coexistnode crash drops query onlyn/astale list bounded by configured projection freshness

What matters most #

1. Dedup key correctness #

If dedup grouping is wrong, incident lifecycle and routing are wrong even if everything else works.

2. Escalation must re-check current incident state #

A pending timer should not escalate an incident that is already acked, silenced, or resolved.

3. Notification delivery and incident truth are separate #

Notification attempts are append-only facts; current incident state determines whether more attempts should occur.

4. Schedule correctness is time-dependent #

Responder selection depends on timezone, handoff windows, overrides, and schedule version.


Step 9 - Scale Adjustments #

HotspotTypeFirst response
very high alert-event ratewrite throughput hotspotshard by tenant/team/incident key and add more ingest/routing owners
noisy incident dedup groupscontention hotspotisolate hot incident keys and tune grouping windows
schedule lookups and expansionread hotspotprecompute near-future schedule windows and cache per team
notification storms during major outagecontention hotspotbatch similar incidents, apply suppression, and rate-limit channel sends
timer-driven escalation scanswrite throughput hotspotbucket incident escalation deadlines and scan incrementally
incident dashboard loadread hotspotuse projections and caching for list/timeline views

What scales well #

This system scales by:

  • sharding incident and routing state by tenant/team/service
  • keeping alert-event and notification-attempt logs append-oriented
  • precomputing active on-call windows
  • using due-time indexes for escalations and reminders

What fails first #

Usually:

  • noisy alert storms
  • incorrect dedup grouping
  • expensive schedule expansion at runtime
  • notification-provider rate limits

Canonical design conclusion #

The mechanical outcome is:

  • primary state:
    • AlertEvent
    • IncidentState
    • RoutingPolicy
    • OnCallSchedule
    • SuppressionRule
    • NotificationAttempt
    • PartitionOwnership
    • PartitionMap
  • critical invariants:
    • immutable alert-event ingestion
    • one authoritative current incident lifecycle per dedup key
    • routing decisions valid only under current policy, schedule, and suppression state
    • ack/resolve/escalate are guarded by current incident state
    • exclusive shard ownership for incident/routing state
  • mechanisms:
    • append log
    • single writer per shard
    • guarded incident lifecycle transitions
    • current-value policy/schedule state
    • fenced shard ownership
  • reads:
    • direct authoritative reads for routing decisions
    • projections for incident lists, timelines, and dashboards

Polished interview answer #

I’d design the alerting platform as a sharded incident-routing service with one authoritative owner per team or incident shard. Upstream monitors append immutable alert events, and shard owners correlate them into a current incident lifecycle keyed by dedup rules. Routing decisions read current escalation policy, current on-call schedule, and any active suppressions to choose the right responder, while notification attempts are logged as immutable delivery facts. Acknowledge, resolve, and timed escalation are guarded incident-state transitions, so escalation only proceeds if the incident is still active and unacked. The main scaling levers are more shards, precomputed schedule windows, due-time indexes for escalation deadlines, and controls for noisy alert storms.


Concrete Substrate #

I’ll choose a sharded incident-routing service with append-oriented alert-event and notification-attempt logs plus current incident/policy/schedule state as the concrete baseline, because it matches the mechanics we derived:

  • append-only alert ingestion
  • guarded incident lifecycle
  • current-value policy/schedule/suppression state
  • timer-driven escalation
  • one owner per shard

Concrete tech family:

  • routing service in Go or Java
  • durable metadata/state storage:
    • replicated DB or RocksDB-backed service state
  • shard replication:
    • Raft or leader-follower replication with commit index
  • metadata/control:
    • etcd or internal metadata quorum for shard ownership/routing
  • external notification adapters:
    • email, SMS, push, Slack, PagerDuty-style connectors

Each shard owner stores:

  • alert-event log for owned scope
  • current IncidentState
  • current RoutingPolicy
  • current OnCallSchedule / expanded active windows
  • current SuppressionRule
  • notification-attempt history
  • escalation due-time index

Operation Layer #

1. Ingest alert event #

API

  • IngestAlert(event)

Initiator

  • upstream monitor / alert source

Entry point

  • ingestion API / routing frontend

Authoritative decider

  • shard owner for tenant/team/incident key

Precondition

  • event parsed and routed to correct shard

Transition

  • append AlertEvent
  • create or update IncidentState via dedup/correlation rules

Response

  • success / dedup-applied

2. Acknowledge incident #

API

  • AckIncident(incident_id, actor, expected_version?)

Initiator

  • user/client

Entry point

  • incident API

Authoritative decider

  • shard owner for incident

Precondition

  • incident currently ackable
  • actor authorized

Transition

  • guarded update IncidentState: OPEN/ESCALATED -> ACKED

3. Resolve incident #

API

  • ResolveIncident(incident_id, actor, expected_version?)

Initiator

  • user/client

Entry point

  • incident API

Authoritative decider

  • shard owner for incident

Precondition

  • incident currently resolvable

Transition

  • guarded update IncidentState -> RESOLVED

4. Escalate incident #

API

  • internal escalation timer flow

Initiator

  • system

Entry point

  • shard owner

Authoritative decider

  • shard owner

Precondition

  • escalation deadline elapsed
  • incident still unresolved/unacked and unsuppressed

Transition

  • guarded update IncidentState to next escalation step
  • append NotificationAttempt for next responder set

5. Update on-call schedule #

API

  • PutSchedule(schedule_id, schedule, expected_version?)

Initiator

  • admin

Entry point

  • config API

Authoritative decider

  • schedule store owner

Precondition

  • version matches if optimistic concurrency used

Transition

  • overwrite OnCallSchedule
  • optionally recompute near-future active windows

Entry Point vs Decider vs Responder #

PathEntry pointAuthoritative deciderPhysical responderLogical responder
alert ingestingestion APIincident shard ownerrouting nodealerting system
ack / resolveincident APIincident shard ownerAPI nodealerting system
schedule/policy updateconfig APIconfig store ownerconfig nodealerting system
escalationshard owner timer loopincident shard ownerrouting node / adapteralerting system
notification sendnotification adapterincident shard owner for decision, provider for deliveryadapter nodealerting system
shard failoverfollower / coordination layershard quorum / lease storenew leader / control planealerting system

Concrete HLD #

Main components:

  • ingestion/routing frontend
    • receives upstream alert events
  • incident shard owners
    • authoritative owners for incident lifecycle, dedup, and escalation
  • policy/schedule service
    • stores escalation policies and on-call schedules
  • notification adapters
    • send pages/messages over external channels
  • metadata/control service
    • tracks shard ownership and routing
  • incident query UI/API
    • serves incident lists and timelines from projections or source state

Short Interview Version #

I’d build the alerting platform as a sharded incident-routing service with one authoritative owner per team or incident shard. Upstream monitors append immutable alert events, and shard owners correlate them into a current incident lifecycle keyed by dedup rules. Routing decisions read current escalation policy, current on-call schedule, and active suppressions to choose the right responder, while notification attempts are logged as immutable delivery facts. Acknowledge, resolve, and timed escalation are guarded incident-state transitions, so escalation only proceeds if the incident is still active and unacked. The main scaling levers are more shards, precomputed schedule windows, due-time indexes for escalation deadlines, and controls for noisy alert storms.