Table of Contents

Bit.ly: System Design #

Derived using the 20-step derivation framework. Every step produces an explicit output artifact. No steps are skipped or abbreviated.

Ordering Principle #

Product requirements
  → normalize into operations over state          (Step 1)
  → extract primary objects                       (Step 2)
  → assign ownership, ordering, evolution         (Step 3)
  → extract invariants                            (Step 4)
  → derive minimal DPs from invariants            (Step 5)
  → select concrete mechanisms                    (Step 6)
  → validate independence and source-of-truth     (Step 7)
  → specify exact algorithms                      (Step 8)
  → define logical data model                     (Step 9)
  → map to technology landscape                   (Step 10)
  → define deployment topology                    (Step 11)
  → classify consistency per path                 (Step 12)
  → identify scaling dimensions and hotspots      (Step 13)
  → enumerate failure modes                       (Step 14)
  → define SLOs                                   (Step 15)
  → define operational parameters                 (Step 16)
  → write runbooks                                (Step 17)
  → define observability                          (Step 18)
  → estimate costs                                (Step 19)
  → plan evolution                                (Step 20)

Context and Scale #

Bit.ly is a URL shortening service. The core proposition: long URLs become short codes (bit.ly/AbCd3F), which redirect to the original. Analytics are a premium feature.

Traffic asymmetry is the defining characteristic. Redirect traffic is approximately 10,000x write traffic. Creating a short URL is the rare event. Following one is the core function. Any design that does not start from this asymmetry will fail at scale.

Reference scale:

300 million short URLs created per month → ~115 writes/sec
10 billion redirects per month → ~3,800 redirects/sec average, peaks 10–100x that
P99 redirect latency budget: 10ms end-to-end (cache hit), 100ms (cold)
Analytics: click events must not block the redirect path

Step 1 — Problem Normalization #

Goal: Rewrite each functional requirement as (actor, operation, state).

Original Requirement	Actor	Operation	State Touched
User creates a short URL from a long URL	User / API client	Create record (append immutable mapping)	ShortLink record
User follows a short URL and is redirected	HTTP client (browser/bot)	Read redirect target; append click event	ShortLink.target_url (read); ClickEvent (append)
User sees click analytics (count, geo, referrer)	Authenticated user	Read derived aggregate	ClickAggregate (derived from ClickEvents)
User claims a custom slug	Authenticated user	Conditional insert (claim unique name)	ShortLink.slug (uniqueness domain)
Short URL expires after configured TTL	Scheduler / system	Transition state of ShortLink to EXPIRED	ShortLink.status
Admin disables / deletes a short URL	Admin actor	Overwrite ShortLink.status to DISABLED	ShortLink.status
User views their dashboard of short URLs	Authenticated user	Read projection over ShortLink records owned by user	UserLinkIndex (derived)

Key observations from normalization:

The redirect is a read of ShortLink.target_url followed by an append of a ClickEvent. It is not a write to any mutable state. This is critical for scaling: redirects are pure reads.
ClickAggregate is not state — it is a derived view computed from ClickEvents. Treating it as primary state would create a contention-heavy write path on every redirect.
Custom slug creation is a conditional insert — the operation must fail if the slug is already taken. This is an atomic uniqueness check, not a simple insert.
URL expiry is a state machine transition from ACTIVE to EXPIRED. It can be driven by a TTL check on read (lazy expiry) or a scheduler (eager expiry). These have different tradeoffs.

Step 2 — Object Extraction #

Goal: Identify primary objects, classify each, and apply the four purity tests.

Primary Objects #

1. ShortLink #

Class: Stable entity (long-lived record with identity)

Description: The canonical mapping from a short code (slug) to a long URL. Created once, read billions of times.

Fields: slug (PK), target_url, owner_user_id, created_at, expires_at (nullable), status (ACTIVE | EXPIRED | DISABLED), custom_domain (nullable), alias (human-readable name for dashboard)

Purity tests:

Ownership: Single writer after creation (owner and admin only). Slug assignment happens once at creation.
Evolution: State machine (ACTIVE → EXPIRED, ACTIVE → DISABLED). Target URL is immutable after creation (append-only by convention; overwriting would break caches).
Ordering: Total order within slug domain (each slug has exactly one owner; slug assignment is atomic and final).
Derivability: Not derivable from any other object. ShortLink IS the source of truth for the redirect target.

Verdict: Primary object. Not derivable. Cannot be merged.

2. ClickEvent #

Class: Immutable event (append-only fact)

Description: A single redirect event — one click on one short URL. Contains the slug, timestamp, geo (country/city inferred from IP), referrer header, user-agent.

Fields: event_id (UUID), slug, occurred_at, country, city, referrer, user_agent, ip_hash (privacy-preserving)

Purity tests:

Ownership: The redirect service is the sole writer. The user initiating the redirect does not own this record.
Evolution: Append-only. A click that happened cannot un-happen. Immutable.
Ordering: Total order by occurred_at within slug (for analytics queries). Partial order across slugs (no cross-slug ordering needed).
Derivability: Not derivable. ClickEvents are the source facts. All analytics are derived from them.

Verdict: Primary object. Event class. Source of truth for all click analytics.

3. ClickAggregate #

Class: Derived view (computed from ClickEvents)

Description: Pre-computed counts and breakdowns (by day, country, referrer) for a slug. Serves dashboard queries without scanning billions of raw events.

Purity tests:

Ownership: Written by the analytics pipeline, not by any user action.
Evolution: Merge-friendly (counts are commutative: adding a new click increments a counter).
Ordering: No meaningful order on the aggregate itself.
Derivability: Fully derivable from ClickEvents by SELECT slug, date_trunc('day', occurred_at), country, count(*) FROM ClickEvent GROUP BY .... Can be rebuilt from scratch.

Verdict: NOT a primary object. Derived view. Must not be treated as source of truth.

4. User #

Class: Stable entity

Description: Account that owns ShortLinks. Relevant for authentication, ownership enforcement, and dashboard queries.

Fields: user_id, email, plan (FREE | PRO | ENTERPRISE), created_at

Verdict: Primary object. Standard account entity. Not special to the core redirect path.

5. SlugNamespace #

Class: Relationship / uniqueness domain

Description: The global set of slug strings that are already claimed. Not a table per se, but the uniqueness invariant enforced across all ShortLink records.

Purity tests:

Derivability: Derivable from ShortLink records (the set of all slugs in use). However, the uniqueness invariant must be enforced atomically at write time, not derived. The namespace IS the constraint surface.

Verdict: Not a standalone object. The uniqueness constraint belongs on ShortLink.slug as a unique index. The enforcement mechanism is a DP, not an object.

6. UserLinkIndex #

Class: Derived view (projection)

Description: The list of ShortLinks belonging to a user, ordered by created_at descending. Powers the user dashboard.

Derivability: Fully derivable from ShortLink records filtered by owner_user_id. Rebuild: SELECT * FROM ShortLink WHERE owner_user_id = ? ORDER BY created_at DESC.

Verdict: Derived view. Not a primary object. A secondary index on ShortLink.owner_user_id suffices.

Object Summary Table #

Object	Class	Primary?	Source of Truth For
ShortLink	Stable entity	Yes	Redirect target, ownership, status
ClickEvent	Immutable event	Yes	All click analytics
ClickAggregate	Derived view	No	Nothing — projection only
User	Stable entity	Yes	Authentication, plan
SlugNamespace	Uniqueness constraint	No (constraint on ShortLink)	Enforced by unique index
UserLinkIndex	Derived view	No	Nothing — secondary index

Step 3 — Axis Assignment #

Goal: For each primary object, assign ownership (who writes?), evolution (append/overwrite/state-machine/merge), and ordering (total/partial/none, bound to scope).

ShortLink #

Ownership:   Multi-writer at creation (any authenticated user may create);
             single writer after creation (owner or admin modifies status).
             Slug assignment: system assigns auto-increment ID → Base62, or
             user claims custom slug (one winner per slug — CAS semantic).

Evolution:   State machine.
             Valid transitions:
               (none) → ACTIVE  [at creation]
               ACTIVE → EXPIRED [by scheduler when expires_at < now()]
               ACTIVE → DISABLED [by admin or owner]
               DISABLED → ACTIVE [by owner, premium only]
             target_url is immutable after creation (no valid transition modifies it).

Ordering:    Total order within slug (each slug is assigned once, atomically).
             Causal lifecycle order for status transitions (ACTIVE must precede EXPIRED).
             No cross-slug ordering needed.

ClickEvent #

Ownership:   Single writer (redirect service instance that handled the request).
             Multiple redirect service instances may write concurrently —
             but each individual event has a single, unambiguous writer.

Evolution:   Append-only. Events are immutable facts.

Ordering:    Total order by occurred_at within slug (for per-slug analytics).
             Approximate total order (clock skew across redirect nodes < 1ms
             acceptable for analytics — not a strict invariant).
             No ordering required across slugs.

User #

Ownership:   Single writer (the user themselves for profile; system for plan changes).

Evolution:   Overwrite for mutable fields (email, plan). Append-only for audit log.

Ordering:    No meaningful order among users.

Step 4 — Invariant Extraction #

Goal: Derive precise, testable, concurrency-aware invariants from the normalized requirements.

Invariant List #

I1. [Uniqueness] Slug uniqueness across ShortLinks. For any two distinct ShortLink records L1 and L2, L1.slug ≠ L2.slug. This holds at all times, including under concurrent creation requests.

I2. [Uniqueness / Idempotency] Duplicate creation requests produce the same ShortLink. If the same client submits the same creation request N times (same idempotency key), exactly one ShortLink is created. Subsequent submissions return the previously created record, not an error and not a second record.

I3. [Eligibility] Redirect returns the target URL only for ACTIVE ShortLinks. A redirect request for slug S returns HTTP 302 only if ShortLink(S).status == ACTIVE AND (expires_at IS NULL OR expires_at > now()). For EXPIRED or DISABLED slugs, the correct response is HTTP 410.

I4. [Eligibility] Custom slug creation succeeds only if the slug is unclaimed. A custom slug claim for slug S succeeds only if no ShortLink with slug == S currently exists. If another actor claims the same slug concurrently, exactly one claim succeeds.

I5. [Ordering] Status transitions follow the valid state machine. ShortLink.status may only transition via valid edges. Specifically: DISABLED → EXPIRED is forbidden. EXPIRED → DISABLED is forbidden without explicit re-activation first. Any transition that is not an edge in the defined state machine must be rejected.

I6. [Accounting] ClickAggregate(slug, window) = count(ClickEvents where slug = S and occurred_at in window). The aggregate click count for any slug and time window exactly equals the number of ClickEvent records in that window. The aggregate may be stale by at most ε = 60 seconds under normal operation.

I7. [Propagation] A newly created or status-changed ShortLink is reflected in the redirect path within ε. After a ShortLink is created or its status changes, the redirect service must return the updated result within ε_redirect = 5 seconds. (Cache TTL bound.)

I8. [Uniqueness] Each ClickEvent is processed exactly once in analytics aggregation. A ClickEvent that is written to the event stream is counted exactly once in ClickAggregate. Duplicate deliveries (from retry or at-least-once delivery) must be deduplicated before incrementing the aggregate.

I9. [Access-control] Only the ShortLink owner or admin may modify ShortLink.status or ShortLink.alias. Reads of target_url during redirect are unauthenticated. Writes to any ShortLink field require proof of ownership (owner_user_id match) or admin role.

I10. [Accounting] Auto-generated slugs are globally unique without retry. The slug generation process for auto-generated (non-custom) slugs must produce a collision-free slug deterministically, without requiring optimistic retry. (Base62 of auto-incremented ID satisfies this; random generation does not without collision checking.)

Step 5 — DP Derivation #

Goal: Identify the minimal enforcing mechanism (design parameter) per invariant cluster.

A DP is not a technology name. It is the minimal runtime capability required to make the invariant cluster enforceable.

Invariant Cluster	DP	Reasoning
I1, I4 — Slug uniqueness (system-assigned and custom)	Atomic conditional insert on slug as unique key	Only one mechanism can guarantee that exactly one writer wins: an atomic insert that fails if the key already exists. No application-level check-then-insert is sufficient (race condition).
I2 — Idempotent creation	Idempotency key store	A dedup table (idempotency_key → shortlink_id) with conditional insert. First write creates; subsequent writes with same key return cached result.
I3 — Redirect eligibility check	Low-latency key-value lookup with TTL-aware cache	The redirect path reads ShortLink by slug. Must be sub-millisecond for cache hit. Requires a cache that can serve millions of reads/sec.
I5 — State machine enforcement	CAS (compare-and-swap) on (status, version)	A status transition is valid only from a specific source state. Concurrent transitions to conflicting states must fail. CAS on (current_status, version) enforces this atomically.
I6, I8 — Click counting with exactly-once semantics	Append-only event log + idempotent consumer	Click events are appended to a durable log. Consumers read the log and aggregate. Dedup on event_id ensures exactly-once processing despite at-least-once delivery.
I7 — Redirect cache freshness	TTL-bounded cache with invalidation	Cache entries for ShortLink records expire after ε_redirect = 5 seconds. On status change, explicit cache invalidation reduces lag to near-zero for known-changed keys.
I9 — Access control	Token-gated write path	Every write request carries an authenticated session token. Service layer checks owner_user_id == authenticated_user_id before executing any mutation.
I10 — Collision-free auto-slug	Monotonic global counter → Base62 encoding	A single globally ordered counter (database sequence or distributed counter) guarantees uniqueness without collision probability. Base62 encoding produces short codes.

Step 6 — Mechanism Selection #

Goal: The mechanical bridge from DPs to concrete mechanisms. Apply the full discrimination procedure for the three key paths.

6.1 DP Classification by Invariant Type #

DP	Invariant Type	Mechanism Family
Atomic conditional insert (slug uniqueness)	Uniqueness	Locking / CAS family
Idempotency key store	Uniqueness / Idempotency	Conditional insert / dedup table
Low-latency KV lookup	Propagation	Cache family
CAS on status transitions	Eligibility + Ordering	CAS / optimistic locking family
Append-only event log	Accounting	Log / stream family
Idempotent consumer	Accounting	Dedup + aggregation family
TTL-bounded cache	Propagation	Cache family
Token-gated write path	Access-control	Auth middleware family
Monotonic counter + Base62	Uniqueness	Sequential assignment family

6.2 Ownership × Evolution Table #

Object	Ownership	Evolution	Table Result
ShortLink (creation)	Multi-writer, one winner per slug	State machine	→ CAS on (slug, version)
ShortLink (status update)	Single writer (owner/admin)	State machine	→ CAS on (status, version) to prevent concurrent conflicting transitions
ClickEvent	Multi-writer (many redirect nodes), all succeed	Append-only	→ No CAS needed; append to partitioned log
ClickAggregate	Single writer (analytics consumer)	Merge (commutative count)	→ Idempotent increment with dedup (CRDT G-Counter semantics)

6.3 Detailed Mechanical Derivation #

Derivation A: Slug Uniqueness Enforcement (Invariant I1, I4, I10) #

Step 6.3.1 — Q1 (Scope): The uniqueness constraint is within one service (the ShortLink creation service). It is not cross-region partitioned (a slug must be globally unique, not per-region unique). Therefore scope = within service → distributed CAS.

But “distributed CAS” requires a serialization point. Two options:

Option A: Database unique index (the DB serializes inserts on the unique key). The database becomes the arbiter.
Option B: Distributed lock (acquire lock on slug string before inserting). Adds network roundtrip and failure surface.

Option A (DB unique index) is strictly superior for slug uniqueness because:

The slug IS the database key. No separate lock namespace.
The insert IS the CAS. If it fails (duplicate key error), the caller knows exactly one other writer won.
No lock timeout or holder crash to handle.

Step 6.3.2 — Q2 (Failure): Crash of the writer during creation: the row either committed or did not. No partial state. Network partition: the creation request fails; the client retries with an idempotency key.

Q2 → Idempotency Key required. Crash of the creation service after the DB insert but before returning to the client means the client retries. Without an idempotency key, retry would hit the unique index (duplicate key error) and the client would incorrectly report failure. With an idempotency key:

First attempt: INSERT INTO idempotency_keys (key, shortlink_id) first, then INSERT ShortLink.
Retry: SELECT from idempotency_keys returns existing shortlink_id → return success.

Step 6.3.3 — Q3 (Data): The slug for auto-generated links is computed as Base62(auto_increment_id). The auto-increment is the source of truth for ordering and uniqueness. Base62 encoding is deterministic and bijective over the integer domain used. No collision probability.

For custom slugs: the slug string is user-supplied. The unique index is the collision mechanism.

Step 6.3.4 — Required combination: CAS (unique index INSERT) always requires an Idempotency Key. Applied.

Final mechanism for slug uniqueness:

Database sequence generates monotonically increasing link_id.
slug = Base62(link_id) — deterministic, no collision, no retry needed.
For custom slugs: INSERT INTO shortlinks (slug, ...) ON CONFLICT (slug) DO NOTHING — returns affected rows; if 0, slug was taken.
Idempotency key table: idempotency_keys(key TEXT PK, shortlink_id BIGINT, created_at TIMESTAMP).

Derivation B: Click Event Pipeline (Invariants I6, I8) #

Step 6.3.1 — Q1 (Scope): Click events are written by many redirect service instances across many servers. This is multi-writer, append-only. The destination is a durable log. Scope = cross-service (redirect service → analytics service). This eliminates in-transaction handling. The mechanism family is async guaranteed delivery.

Step 6.3.2 — Q5 (Coupling): The redirect path must not block on analytics writes (10ms latency budget). Therefore coupling is async, guaranteed delivery → Outbox pattern or direct append to a durable log.

Outbox pattern (write to DB, relay to stream) is appropriate when the event must be durably captured in the same transaction as the primary state change. But for click events:

The primary state change IS the click (redirect happened). There is no transactional coupling to a DB row.
The redirect service already responded 302 to the client.
Therefore: direct append to a durable partitioned log (e.g., Kafka) is correct. No outbox needed.

Write-ahead logging (CDC) would require a DB write on every redirect — that DB write IS the bottleneck we are trying to avoid.

Step 6.3.3 — Q2 (Failure): Crash of redirect node after responding 302 but before appending to log: click is lost. Acceptable — analytics is eventually consistent (I6 allows ε = 60 seconds of staleness; losing a small fraction of clicks is an acceptable analytics approximation at this scale). If exactness is required, the redirect node can fire-and-forget to Kafka asynchronously before responding, accepting the tiny window of loss.

Duplicate delivery from Kafka at-least-once: I8 requires exactly-once counting. Mechanism: each ClickEvent has a event_id = UUID. The analytics consumer maintains a dedup window: processed_event_ids (Bloom filter or Redis SET with TTL). Before incrementing ClickAggregate, check event_id not in dedup set.

For ClickHouse (columnar OLAP): use ReplacingMergeTree or AggregatingMergeTree with event_id dedup key to handle at-least-once delivery natively.

Step 6.3.4 — Q3 (Data): Click counts are commutative and associative (count += 1). This is G-Counter CRDT semantics. Each analytics consumer shard can independently accumulate partial counts and merge. No serialization point needed for counting.

Step 6.3.5 — Q4 (Access): Analytics is read » write for the aggregation layer (10B events/month, queries from dashboards). The aggregate is a pre-computed materialized view. Pattern: CQRS — ClickEvent log is the write model; ClickAggregate tables are the read model.

Final mechanism for click event pipeline:

Redirect service
  → fire-and-forget async write to Kafka topic: click_events
  → partition key: slug (ensures per-slug ordering within partition)

Kafka consumer (analytics worker)
  → reads click_events topic at-least-once
  → deduplicates on event_id (Bloom filter in memory, 5-minute window)
  → batches 10,000 events
  → bulk-inserts ClickEvent rows to ClickHouse
  → ClickHouse ReplacingMergeTree merges in background

ClickHouse aggregate table
  → materialized view: SELECT slug, toDate(occurred_at), country, count()
    FROM click_events GROUP BY slug, date, country
  → refreshed every 60 seconds (satisfies ε = 60s from I6)

Derivation C: Redirect Lookup Optimization (Invariant I3, I7) #

Step 6.3.1 — Q1 (Scope): The redirect lookup is a read of ShortLink by slug. With 10B redirects/month (~3,800 req/sec average, 100K+ req/sec peak for viral URLs), this is read-heavy. Pattern: Cache-Aside.

Step 6.3.2 — Q4 (Access): Read » Write (10,000x). The correct pattern is a multi-layer cache:

Layer 1: CDN edge cache (geo-distributed) — serves redirects at the PoP nearest to the user. Eliminates 80%+ of traffic from reaching origin. Cache-Control headers with max-age=300 (5 minutes, matching I7’s ε_redirect = 5 seconds is too aggressive for CDN — use 30–300 seconds depending on expected change frequency; new links are rarely changed).
Layer 2: Redis in-memory cache — serves slugs not in CDN cache. Sub-millisecond. Cache TTL = 60 seconds for hot slugs. Capacity: 100M active slugs × 200 bytes = 20GB.
Layer 3: Postgres — authoritative source. Hit only on cache miss.

Step 6.3.3 — Hotspot problem: A viral URL (slug abc123) may receive millions of redirects/sec. All hit the same Redis key. This is the thundering herd / hotspot problem.

Mechanisms:

CDN absorbs 95%+ of viral traffic — viral URLs have extremely high cache hit rates because many users in many regions access the same URL. CDN caches at edge.
Local in-process cache in redirect service — each redirect service instance maintains a local LRU cache (10K entries, 1-second TTL). Viral slugs are served from process memory without Redis roundtrip.
Redis read replicas — Redis Cluster with multiple read replicas per shard. Viral slug key can be read from any replica.

Step 6.3.4 — Q2 (Failure): Cache miss on viral URL going cold (TTL expires at CDN and Redis simultaneously):

Thundering herd: thousands of requests flood Redis/DB simultaneously.
Mechanism: probabilistic early re-caching (extend TTL before expiry for hot keys), OR request coalescing at Redis layer (SETNX-based mutex: first miss acquires “I am fetching” lock, others wait briefly then get result).

Step 6.3.5 — Cache invalidation on status change: When ShortLink.status changes (ACTIVE → DISABLED), cached entries must be invalidated:

Send invalidation message to Redis: DEL shortlink:{slug}.
CDN purge API call: PURGE https://bit.ly/{slug}.
Residual staleness = max(CDN purge propagation time, ~1–3 seconds).
Acceptable per I7 (ε_redirect = 5 seconds).

Step 6.3.6 — Required combination: Cache-Aside always needs a TTL (prevents stale-forever entries if invalidation fails). Applied: Redis TTL = 60 seconds, CDN max-age = 300 seconds (or purged on change).

Final mechanism for redirect lookup:

Client request: GET https://bit.ly/{slug}
  → CDN edge (layer 1):
      HIT: return 302, target_url, Cache-Control: max-age=300
      MISS: forward to redirect service

  → Redirect service (layer 2 — local process cache, 1s TTL LRU):
      HIT: return 302
      MISS: forward to Redis

  → Redis (layer 3 — 60s TTL):
      HIT: populate local cache, return 302
      MISS: coalesce concurrent fetches (SETNX lock)

  → Postgres (authoritative):
      SELECT target_url, status, expires_at FROM shortlinks WHERE slug = ?
      If status != ACTIVE or expires_at < now(): return 410
      Return 302, populate Redis, return to caller

Cache miss is a failure. The redirect path must never be allowed to become a DB read path under load. Redis must be provisioned to handle 100% of redirect traffic if CDN is bypassed.

Step 7 — Axiomatic Validation #

Source-of-Truth Table #

State Domain	Source of Truth	Location
Slug → target_url mapping	ShortLink table	Postgres (primary)
ShortLink.status	ShortLink table	Postgres (primary)
Click facts	ClickEvent table	ClickHouse (append-only)
Click aggregates	ClickAggregate materialized view	ClickHouse (derived)
User accounts	User table	Postgres
Hot slug cache	Redis key `shortlink:{slug}`	Redis (cache — not source of truth)
Idempotency keys	idempotency_keys table	Postgres

Dependency table:

Derived Object	Depends On	Rebuild Path
ClickAggregate	ClickEvent	`INSERT INTO click_agg SELECT slug, date, country, count(*) FROM click_events GROUP BY ...`
Redis cache entry	ShortLink (Postgres)	On cache miss: fetch from Postgres, repopulate Redis
UserLinkIndex	ShortLink	`SELECT * FROM shortlinks WHERE owner_user_id = ?` (secondary index)
CDN cached redirect	ShortLink (Postgres, via redirect service)	Purge CDN + TTL expiry; rebuild on next request

Projections with rebuild paths:

ClickAggregate rebuild: If ClickHouse is corrupted, replay ClickEvent log from Kafka (retain 30 days) → rebuild all aggregates. Duration: hours for full history, minutes for recent windows.
Redis cache rebuild: If Redis is wiped, cache warms organically on first miss per slug. No explicit rebuild needed. Hot slugs repopulate within seconds under live traffic.
UserLinkIndex: It is a secondary index on Postgres. If the index is dropped, it can be rebuilt in minutes with CREATE INDEX CONCURRENTLY.

Independence check:

ClickEvent is not derived from ShortLink. Click events reference slug as a foreign key (denormalized — slug may be deleted but click events are retained for historical analytics). This is correct: events are immutable facts; deleting the ShortLink does not delete the analytics history.
ClickAggregate is fully derivable from ClickEvent. No circular dependency.
Redis cache is fully derivable from Postgres. No circular dependency.

Step 8 — Algorithm Design #

Write Path 1: Short URL Creation (Auto-generated Slug) #

function createShortLink(request: CreateRequest, idempotency_key: string) -> ShortLink:

  // Step 1: Idempotency check
  existing = db.query(
    "SELECT shortlink_id FROM idempotency_keys WHERE key = $1",
    [idempotency_key]
  )
  if existing:
    return db.query("SELECT * FROM shortlinks WHERE id = $1", [existing.shortlink_id])

  // Step 2: Validate input
  if not isValidURL(request.target_url):
    raise InvalidURLError

  if request.expires_at != null and request.expires_at < now() + 60s:
    raise InvalidExpiryError  // must expire in the future

  // Step 3: Generate slug from auto-increment ID
  // The DB sequence guarantees monotonicity and uniqueness.
  // Base62 encoding: 0-9 = '0'-'9', 10-35 = 'a'-'z', 36-61 = 'A'-'Z'
  link_id = db.nextval('shortlinks_id_seq')
  slug = base62_encode(link_id)  // deterministic, no collision possible

  // Step 4: Insert ShortLink (cannot fail on slug uniqueness for auto-generated)
  db.execute("""
    INSERT INTO shortlinks (id, slug, target_url, owner_user_id, created_at, expires_at, status)
    VALUES ($1, $2, $3, $4, now(), $5, 'ACTIVE')
  """, [link_id, slug, request.target_url, request.user_id, request.expires_at])

  // Step 5: Record idempotency key atomically
  db.execute("""
    INSERT INTO idempotency_keys (key, shortlink_id, created_at)
    VALUES ($1, $2, now())
  """, [idempotency_key, link_id])

  // Step 6: Return created record
  return ShortLink{id: link_id, slug: slug, ...}

Idempotency: Steps 1–2 are atomic: the idempotency key is checked before any side effect. If the service crashes after step 4 but before step 5, the shortlink exists but the idempotency key is not recorded. On retry, step 4 would fail with a sequence ID conflict (different sequence value). Mitigation: wrap steps 4 and 5 in a single DB transaction.

Revised (correct) version:

BEGIN TRANSACTION
  link_id = nextval('shortlinks_id_seq')
  slug = base62_encode(link_id)
  INSERT INTO shortlinks (id, slug, ...)
  INSERT INTO idempotency_keys (key, shortlink_id, ...)
COMMIT

If the transaction rolls back, neither row is written. The client retries with the same idempotency key. The idempotency check at step 1 returns nothing, and the transaction is retried cleanly.

Write Path 2: Custom Slug Claim #

function claimCustomSlug(request: CustomSlugRequest, idempotency_key: string) -> ShortLink:

  // Step 1: Idempotency check (same as above)
  existing = checkIdempotency(idempotency_key)
  if existing: return existing

  // Step 2: Validate slug format
  if not isValidSlugFormat(request.slug):  // alphanum + hyphen, 3-50 chars
    raise InvalidSlugError

  // Step 3: Attempt atomic insert (CAS on slug uniqueness)
  BEGIN TRANSACTION
    result = db.execute("""
      INSERT INTO shortlinks (id, slug, target_url, owner_user_id, status, created_at)
      VALUES (nextval('shortlinks_id_seq'), $1, $2, $3, 'ACTIVE', now())
      ON CONFLICT (slug) DO NOTHING
      RETURNING id, slug
    """, [request.slug, request.target_url, request.user_id])

    if result.rowcount == 0:
      ROLLBACK
      raise SlugAlreadyTakenError(slug=request.slug)  // I4 enforced

    link_id = result.rows[0].id
    INSERT INTO idempotency_keys (key, shortlink_id, ...) VALUES (idempotency_key, link_id, now())
  COMMIT

  return ShortLink{slug: request.slug, ...}

State machine: ON CONFLICT (slug) DO NOTHING is the CAS. If two requests race for the same slug, Postgres serializes them at the unique index. Exactly one succeeds; the other sees rowcount == 0.

Write Path 3: Status Transition (Disable / Enable) #

function updateStatus(slug: string, new_status: Status, actor: User) -> ShortLink:

  // Step 1: Fetch current record
  link = db.query("SELECT id, status, version, owner_user_id FROM shortlinks WHERE slug = $1", [slug])
  if not link:
    raise NotFoundError

  // Step 2: Access control (I9)
  if link.owner_user_id != actor.user_id and not actor.is_admin:
    raise UnauthorizedError

  // Step 3: Validate state machine transition (I5)
  valid_transitions = {
    'ACTIVE': ['EXPIRED', 'DISABLED'],
    'DISABLED': ['ACTIVE'],
    'EXPIRED': []  // EXPIRED is terminal unless manually overridden by admin
  }
  if new_status not in valid_transitions[link.status]:
    raise InvalidTransitionError(from=link.status, to=new_status)

  // Step 4: CAS update on (status, version) to prevent concurrent conflicting writes
  result = db.execute("""
    UPDATE shortlinks
    SET status = $1, version = version + 1, updated_at = now()
    WHERE slug = $2 AND status = $3 AND version = $4
  """, [new_status, slug, link.status, link.version])

  if result.rowcount == 0:
    // Concurrent update; retry from Step 1 (optimistic locking)
    raise ConflictError  // caller should retry

  // Step 5: Invalidate cache
  redis.del(f"shortlink:{slug}")
  cdn.purge(f"https://bit.ly/{slug}")  // async, best-effort

  return fetchUpdated(slug)

Retry on conflict: The caller (service layer) retries up to 3 times with exponential backoff. Slug status changes are rare; conflict probability is negligible.

Read Path: Redirect #

function redirect(slug: string) -> HTTPResponse:

  // Layer 1: Process-local cache (LRU, 10K entries, 1-second TTL)
  cached = localCache.get(slug)
  if cached:
    if cached.status != 'ACTIVE' or cached.is_expired():
      return HTTP_410
    return HTTP_302(Location=cached.target_url)

  // Layer 2: Redis (60-second TTL)
  cached = redis.get(f"shortlink:{slug}")
  if cached:
    localCache.set(slug, cached, ttl=1s)
    if cached.status != 'ACTIVE' or cached.is_expired():
      return HTTP_410
    emit_click_event_async(slug, request)  // fire-and-forget to Kafka
    return HTTP_302(Location=cached.target_url)

  // Layer 3: Postgres (cache miss — this is a failure condition under load)
  // Use coalescing to prevent thundering herd
  lock_key = f"fetching:{slug}"
  acquired = redis.set(lock_key, '1', NX=true, EX=1)  // 1-second lock

  if not acquired:
    // Another request is fetching; wait briefly and check Redis again
    sleep(10ms)
    cached = redis.get(f"shortlink:{slug}")
    if cached:
      // Successfully coalesced
      ... (same as Layer 2 hit above)
    else:
      // Timeout; fall through to DB anyway
      pass

  link = db.query("""
    SELECT target_url, status, expires_at FROM shortlinks WHERE slug = $1
  """, [slug])

  redis.del(lock_key)

  if not link:
    redis.set(f"shortlink:{slug}", NEGATIVE_CACHE_SENTINEL, EX=60)  // negative cache
    return HTTP_404

  // Populate caches
  redis.set(f"shortlink:{slug}", serialize(link), EX=60)
  localCache.set(slug, link, ttl=1s)

  if link.status != 'ACTIVE' or (link.expires_at and link.expires_at < now()):
    return HTTP_410

  emit_click_event_async(slug, request)
  return HTTP_302(Location=link.target_url)

Negative caching: Non-existent slugs also get cached (with a sentinel value) to prevent DB hammering for 404s from bots.

Async Path: Click Event Emission #

// Fire-and-forget from redirect handler (runs in separate goroutine/thread)
function emit_click_event_async(slug: string, request: HTTPRequest):
  event = ClickEvent{
    event_id: uuid4(),       // globally unique, used for dedup in consumer
    slug: slug,
    occurred_at: now_utc(),
    country: geoip_country(request.client_ip),
    city: geoip_city(request.client_ip),
    referrer: request.headers.get('Referer', '')[:500],  // truncated
    user_agent: request.headers.get('User-Agent', '')[:500],
    ip_hash: sha256(request.client_ip)[:16]  // privacy-preserving
  }

  // Kafka append — partition by slug to preserve per-slug ordering
  kafka.produce(
    topic='click_events',
    key=slug,                // partition key
    value=protobuf_encode(event),
    acks='1'                 // leader ack only — async throughput over durability
  )
  // If Kafka produce fails, log and drop — analytics loss is acceptable
  // Do NOT block the redirect response on this

State Machine Diagram #

              CREATE
                │
                ▼
           ┌─────────┐
           │  ACTIVE  │◄────────────────────┐
           └─────────┘                      │ (admin re-activate)
               │   │                        │
     expires_at│   │owner/admin             │
       reached │   │ disables               │
               ▼   ▼                        │
           ┌─────────┐  ┌──────────────────┐│
           │ EXPIRED │  │    DISABLED      ├┘
           └─────────┘  └──────────────────┘
              (terminal  (owner can re-activate
               for users) if plan allows)

Step 9 — Logical Data Model #

Table: shortlinks #

CREATE TABLE shortlinks (
    id              BIGINT          PRIMARY KEY DEFAULT nextval('shortlinks_id_seq'),
    slug            VARCHAR(64)     NOT NULL,
    target_url      TEXT            NOT NULL,           -- max 8192 chars
    owner_user_id   BIGINT          NOT NULL REFERENCES users(id),
    status          VARCHAR(16)     NOT NULL DEFAULT 'ACTIVE'
                    CHECK (status IN ('ACTIVE', 'EXPIRED', 'DISABLED')),
    created_at      TIMESTAMPTZ     NOT NULL DEFAULT now(),
    expires_at      TIMESTAMPTZ,                        -- NULL = never expires
    updated_at      TIMESTAMPTZ     NOT NULL DEFAULT now(),
    version         BIGINT          NOT NULL DEFAULT 1, -- for CAS on status updates
    is_custom_slug  BOOLEAN         NOT NULL DEFAULT false,
    alias           VARCHAR(256),                       -- human-readable name
    custom_domain   VARCHAR(256),                       -- premium feature

    CONSTRAINT uq_slug UNIQUE (slug),                  -- I1, I4 enforcement
    CONSTRAINT uq_custom_domain_slug UNIQUE (custom_domain, slug)
);

-- Secondary index for user dashboard (I9, UserLinkIndex projection)
CREATE INDEX idx_shortlinks_owner ON shortlinks(owner_user_id, created_at DESC);

-- Index for expiry scheduler (TTL enforcement)
CREATE INDEX idx_shortlinks_expires ON shortlinks(expires_at) WHERE expires_at IS NOT NULL AND status = 'ACTIVE';

Partition key: slug — all redirect lookups are by slug. The unique index on slug IS the partition key for the redirect path.

Dedup key: id (sequence) for auto-generated slugs; slug (unique index) for custom slugs.

Table: idempotency_keys #

CREATE TABLE idempotency_keys (
    key             VARCHAR(128)    PRIMARY KEY,        -- client-supplied idempotency key
    shortlink_id    BIGINT          NOT NULL REFERENCES shortlinks(id),
    created_at      TIMESTAMPTZ     NOT NULL DEFAULT now()
);

-- TTL: rows older than 7 days can be deleted (cron job)
CREATE INDEX idx_idem_created ON idempotency_keys(created_at);

Table: users #

CREATE TABLE users (
    id              BIGINT          PRIMARY KEY DEFAULT nextval('users_id_seq'),
    email           VARCHAR(320)    NOT NULL UNIQUE,
    plan            VARCHAR(16)     NOT NULL DEFAULT 'FREE'
                    CHECK (plan IN ('FREE', 'PRO', 'ENTERPRISE')),
    created_at      TIMESTAMPTZ     NOT NULL DEFAULT now(),
    password_hash   TEXT            NOT NULL
);

Table: click_events (ClickHouse) #

-- ClickHouse DDL
CREATE TABLE click_events (
    event_id        UUID,
    slug            String,
    occurred_at     DateTime64(3, 'UTC'),
    country         LowCardinality(String),
    city            String,
    referrer        String,
    user_agent      String,
    ip_hash         FixedString(16)
)
ENGINE = ReplacingMergeTree(occurred_at)
PARTITION BY toYYYYMM(occurred_at)
ORDER BY (slug, occurred_at, event_id);
-- ReplacingMergeTree deduplicates on (slug, occurred_at, event_id) — I8 enforcement

Table: click_aggregates (ClickHouse Materialized View) #

CREATE MATERIALIZED VIEW click_agg_daily
ENGINE = AggregatingMergeTree()
PARTITION BY toYYYYMM(date)
ORDER BY (slug, date, country)
AS SELECT
    slug,
    toDate(occurred_at)     AS date,
    country,
    countState()            AS click_count
FROM click_events
GROUP BY slug, date, country;

-- Query view:
CREATE VIEW click_agg_daily_view AS
SELECT slug, date, country, countMerge(click_count) AS clicks
FROM click_agg_daily
GROUP BY slug, date, country;

Step 10 — Technology Landscape #

Mapping procedure: capability → shape → specific product.

DP	Capability Required	Shape	Selected Product	Justification
Atomic conditional insert, CAS updates, idempotency store	Serializable ACID transactions, unique indexes	Relational OLTP	PostgreSQL 16	Proven unique index semantics; `ON CONFLICT` CAS; `nextval()` for sequence generation; wide ecosystem
Low-latency KV cache for redirect lookup	Sub-ms read, TTL, ~20GB dataset, high concurrency	In-memory KV store	Redis 7 (Redis Cluster)	<1ms p99 reads; TTL per key; cluster for horizontal read scaling; wide client support
Click event pipeline — durable ordered log	High-throughput append, ordered per partition, at-least-once delivery, replay	Partitioned durable log	Apache Kafka	1M+ events/sec per partition; log retention for replay; partition-by-slug for ordering; mature ecosystem
Analytics storage — click events and aggregates	Append-heavy writes, column-scan aggregation queries, time-series partitioning	Columnar OLAP	ClickHouse	1B+ rows/day insertions; `ReplacingMergeTree` for dedup; materialized views for pre-aggregation; 10–100x faster than Postgres for analytics
CDN edge redirect serving	Geo-distributed caching, HTTP redirect serving, cache purge API	CDN	Cloudflare (or Fastly)	300+ PoPs; <5ms to 95% of world population; cache purge API; Cloudflare Workers for edge logic
Geo-IP resolution for click events	IP → country/city mapping, <1ms latency, ~1GB dataset	In-process library with mmdb	MaxMind GeoLite2 (mmdb)	In-process lookup, no network roundtrip; updated weekly; covers 99%+ of IPs
Process-local cache in redirect service	LRU eviction, 1s TTL, in-memory	In-process LRU	go-cache / Caffeine (language-specific)	Zero network roundtrip; fits in L2/L3 CPU cache for hot slugs
Expiry scheduler	Periodic TTL check and status transition	Cron + DB query	pg_cron (Postgres extension) or dedicated Go worker	`UPDATE shortlinks SET status='EXPIRED' WHERE expires_at < now() AND status='ACTIVE'` — runs every 60 seconds

Step 11 — Deployment Topology #

Service Boundaries #

┌─────────────────────────────────────────────────────────────────────┐
│  CDN Edge (Cloudflare)                                              │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │  Redirect cache (max-age=300, surrogate-key={slug})          │   │
│  │  Cloudflare Workers: edge slug validation + cache miss proxy │   │
│  └─────────────────────────────────────────────────────────────┘   │
└──────────────────────────┬──────────────────────────────────────────┘
                           │ Cache miss only (~1-5% of traffic)
                           ▼
┌─────────────────────────────────────────────────────────────────────┐
│  Region: us-east-1 (Primary)        Region: eu-west-1 (Secondary)  │
│                                                                     │
│  ┌─────────────────┐   ┌─────────────────────────────────────┐    │
│  │  Redirect Service│   │  Write Service                      │    │
│  │  (stateless)    │   │  (ShortLink creation, status update) │    │
│  │  50 instances   │   │  10 instances                        │    │
│  │  Auto-scales    │   │                                      │    │
│  └────────┬────────┘   └──────────────────────┬──────────────┘    │
│           │                                    │                    │
│           ▼                                    ▼                    │
│  ┌─────────────────┐   ┌─────────────────────────────────────┐    │
│  │  Redis Cluster  │   │  PostgreSQL (Primary + 2 replicas)  │    │
│  │  6 shards       │   │  Primary: writes                    │    │
│  │  3 replicas each│   │  Replicas: dashboard reads          │    │
│  └────────┬────────┘   └──────────────────────────────────────┘   │
│           │                                                         │
│           ▼                                                         │
│  ┌─────────────────┐   ┌─────────────────────────────────────┐    │
│  │  Kafka Cluster  │──►│  Analytics Workers                  │    │
│  │  12 brokers     │   │  (Kafka → ClickHouse pipeline)      │    │
│  │  click_events   │   │  5 consumer instances               │    │
│  │  (slug-partitioned)│  └──────────────────────┬──────────────┘   │
│  └─────────────────┘                            ▼                  │
│                          ┌─────────────────────────────────────┐   │
│                          │  ClickHouse Cluster                 │   │
│                          │  3 shards, 2 replicas each          │   │
│                          └─────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────┘

Partition Topology #

Component	Partition Strategy	Rationale
Postgres shortlinks	Single primary + read replicas (no sharding at initial scale)	300M rows × 500 bytes = 150GB — fits one Postgres instance
Redis Cluster	6 shards by slug (consistent hashing)	Each shard holds ~3M hot slugs; horizontal scale
Kafka click_events	48 partitions, partition key = slug	Per-slug ordering; 48 consumers can process in parallel
ClickHouse	3 shards, partition by month	Time-series queries partition-pruned by month

Failure Domains #

Failure	Scope	Impact
Single redirect service instance fails	Instance	None: load balancer removes from pool within 5s
Redis shard failure	1/6 of hot slugs	Cache miss surge for affected slugs; DB absorbs for ~30s until Redis replica promotes
Postgres primary failure	All writes	Failover to replica in ~30s; writes blocked during failover; reads continue from replica
Kafka broker failure	1/12 of partitions	Other brokers take over; click events in-flight may be lost (analytics tolerance)
CDN PoP outage	Geographic region	Traffic fails over to other PoPs; latency increase for affected region
ClickHouse node failure	Analytics only	Redirects unaffected; analytics queries degrade until replica promotes

Step 12 — Consistency Model #

Path	Consistency Model	Reason
Short URL creation (write)	Linearizable (strong)	Postgres serializable transaction on unique slug index. Two concurrent requests for the same slug: exactly one commits.
Redirect lookup (cache hit)	Bounded stale (ε = 60s Redis TTL + 300s CDN)	Acceptable: a newly created URL may not be visible at CDN for up to 5 minutes. A disabled URL may still redirect for up to 60s.
Redirect lookup (cache miss, DB read)	Linearizable read	Postgres synchronous read from primary (or replica with `synchronous_commit = remote_apply`). Returns current state.
Status update visibility	Bounded stale (ε = 5s post-invalidation)	Redis DEL + CDN purge on status change. CDN purge propagates in ~1–3s. Redis DEL is synchronous.
Click event recording	Eventual	Kafka at-least-once; ClickHouse async ingestion. Event appears in aggregates within ε = 60s.
Click aggregate staleness	Eventual (ε = 60s)	Materialized view refreshes every 60 seconds. Acceptable for analytics dashboards.
Dashboard (user link list)	Eventually consistent (replica lag ε < 1s)	Dashboard reads from Postgres replica. Replica lag < 100ms under normal conditions.

Reasoning for key choices:

The redirect lookup being bounded-stale is correct. The invariant I3 requires ACTIVE-only redirects, but the ε is set by operational requirements (not a hard safety guarantee). A 60-second window where a disabled URL still redirects is acceptable; a 60-second window where a newly created URL fails to redirect is acceptable. This is a product decision, not an architectural flaw.

The click event pipeline being eventual is explicitly sanctioned by invariant I6 (ε = 60s). Analytics is not in the critical path of any user action.

Step 13 — Scaling Model #

Scale Type Classification #

Component	Scale Type	Primary Bottleneck
Redirect path	Read-heavy + hotspot-heavy	Viral slug concentrates millions of req/sec on single cache key
URL creation	Write-heavy (but low absolute volume)	Sequence generation is single-threaded in Postgres
Click event pipeline	Write-heavy, fanout-heavy	10B events/month → Kafka producer throughput
Analytics aggregation	Aggregation-heavy	ClickHouse query scan over billions of rows

Hotspot Keys #

Hotspot	Location	Mechanism
Viral slug (e.g., Super Bowl ad link)	Redis key + CDN URL	Multi-layer cache; CDN absorbs 99%; local process cache absorbs most of remainder
Postgres `shortlinks_id_seq`	Postgres	Sequence generation is fast (<1μs per nextval); batching nextval in groups of 1000 reduces contention
Kafka slug partition	Partition for viral slug	If one slug gets 100K/sec, its Kafka partition is a bottleneck for the consumer (analytics, not redirect). Redirect does not use Kafka on the hot path.

Scaling Strategies #

Redirect service (read-heavy):

Horizontal scale: stateless Go/Rust service; 50 instances behind L7 load balancer.
Process-local LRU absorbs viral slugs at CPU speed.
CDN absorbs geo-distributed traffic before it reaches the origin.
Redis Cluster scales read capacity with additional replicas per shard.

URL creation (low-volume write):

Single Postgres primary is sufficient for 115 writes/sec.
If write rate grows 100x: switch to per-region Postgres with UUID-based IDs (no cross-region sequence required). Custom slugs still require a global uniqueness check (cross-region coordination via a global shard or pessimistic reservation).

Click event ingestion (high-volume append):

Kafka scales horizontally; 48 partitions × 1M events/sec/partition = 48M events/sec capacity.
At 10B/month = ~3,800 events/sec, current capacity is 12,000x surplus.
ClickHouse ingestion scales with more shards.

Analytics query serving (aggregation-heavy):

Pre-aggregate with ClickHouse materialized views. Dashboard queries hit the aggregate, not raw events.
Cache dashboard results in Redis with 60-second TTL.
For enterprise customers with custom date ranges: ClickHouse ad-hoc query over raw event table.

Step 14 — Failure Model #

Failure Taxonomy #

Failure Mode	Can It Happen?	Correct Behavior	Mechanism	Recovery
Duplicate slug creation (two concurrent requests for same custom slug)	Yes — two users simultaneously claim `bit.ly/launch`	One succeeds (HTTP 201), one fails (HTTP 409 SlugTaken)	Postgres unique index, `ON CONFLICT DO NOTHING` returns 0 rows	Client-side: show error to losing requester
Duplicate click event counted twice in analytics	Yes — Kafka at-least-once delivery; consumer crashes mid-batch	Exactly one count per physical click	`ReplacingMergeTree` + consumer dedup on `event_id`	Self-healing: ClickHouse dedup occurs at merge time
Cache miss storm on viral URL (CDN + Redis TTL expire simultaneously)	Yes — TTL expiry is a sharp boundary	DB absorbs spike; does not crash	Request coalescing via Redis SETNX lock; CDN staggered TTL (CDN cache TTL > Redis TTL)	Redis TTL = 60s; CDN TTL = 300s; they expire at different times reducing simultaneity
Redirect service returns expired URL as ACTIVE	Yes — cache entry is stale during ε window	Client receives 302 to a URL that should have returned 410	Cache TTL bounds the window; expiry scheduler flips status; Redis invalidation on status change	Self-healing within ε = 60s; explicit invalidation on EXPIRED transition
Postgres primary failure (crash)	Yes	Writes blocked; reads continue from replica	Postgres streaming replication + automatic failover (Patroni/Pgbouncer)	Replica promotes in ~30s; write path unavailable during promotion
Kafka broker failure	Yes	Click events produced to that broker’s partition are buffered or dropped	Kafka replication factor 3; producer retries for ~5s	Surviving brokers take over within seconds; events in-flight during failure may be lost (analytics tolerance)
Analytics consumer crashes mid-batch	Yes — consumer commits Kafka offset after writing to ClickHouse	Events in uncommitted batch are re-read and re-processed	Kafka consumer group offset management; ClickHouse dedup on `event_id`	Auto-restart consumer; dedup prevents double-counting
GeoIP lookup failure	Yes — mmdb file becomes unavailable	Click event recorded with empty country field	Fail-open: emit event with `country = ''`; do not block redirect	GeoIP is optional metadata; blank is acceptable
Redis cluster split-brain during network partition	Yes	Reads may return stale data; writes to both sides of partition	Redis Cluster majority-quorum for writes; minority nodes reject writes	Post-partition: minority side cache invalidated; repopulates from Postgres
Idempotency key collision (different clients generate same key)	Astronomically unlikely with UUID4	Second client’s request is treated as a retry of the first	UUID4 has 2^122 space; collision probability for 1B requests = 10^-28	Not a practical concern
Short URL target is a malicious or phishing URL	Always possible	Redirect proceeds (Bit.ly is not a content filter by default)	Rate-limit creation per user; optional URL scanning integration (VirusTotal API async)	Flag URL; admin can DISABLE the ShortLink

Step 15 — SLOs #

Redirect Path (Hot Path) #

Metric	Target	Measurement Method
P95 redirect latency (CDN hit)	< 5ms	CDN edge timing headers
P99 redirect latency (CDN hit)	< 15ms	CDN edge timing headers
P95 redirect latency (Redis hit, CDN miss)	< 10ms	Server-side histogram in redirect service
P99 redirect latency (Redis hit, CDN miss)	< 25ms	Server-side histogram
P95 redirect latency (DB hit, cache miss)	< 100ms	Server-side histogram
P99 redirect latency (DB hit, cache miss)	< 200ms	Server-side histogram
Redirect availability	> 99.99% (52 min/year downtime)	Synthetic probes every 10s from 5 regions
Redirect correctness (correct 302 target)	> 99.999%	Automated canary: create URL, follow redirect, verify target

A cache miss is a latency failure, not a correctness failure. P99 of 25ms for Redis-hit cases is the operational target, not 200ms.

Write Path (Creation) #

Metric	Target
P95 URL creation latency	< 200ms
P99 URL creation latency	< 500ms
Slug uniqueness correctness	100% — no two shortlinks may share a slug
Idempotency correctness	100% — same idempotency key returns same result
Creation availability	> 99.9% (8.7 hours/year downtime)

Analytics Path #

Metric	Target
Click count staleness	< 60 seconds for 99th percentile
Dashboard load latency (P95)	< 2 seconds
Click count accuracy	> 99.9% of actual clicks counted (0.1% loss acceptable for analytics)
Analytics availability	> 99.5% (43 hours/year downtime — analytics is non-critical)

Throughput #

Metric	Sustained	Peak (10x)
Redirects	4,000 req/sec	40,000 req/sec
URL creations	120 req/sec	1,200 req/sec
Click event ingestion	4,000 events/sec	40,000 events/sec

Step 16 — Operational Parameters #

Every tunable lever with its range and effect.

Parameter	Location	Default	Range	Effect if Increased	Effect if Decreased
`redis_ttl_seconds`	Redirect service config	60	10–3600	Fewer DB reads; more stale data served	More DB reads; fresher data
`cdn_max_age_seconds`	CDN cache rule	300	30–3600	Fewer origin hits; more stale data at edge	More origin hits; fresher edge data
`local_cache_size_entries`	Redirect service	10,000	1K–100K	More memory per pod; fewer Redis hits	Less memory; more Redis hits
`local_cache_ttl_seconds`	Redirect service	1	0.1–10	Longer stale window for viral slugs; fewer Redis hits	More Redis hits; fresher data
`kafka_producer_acks`	Click event producer	1 (leader ack)	0, 1, all	`all`: durability, more latency; `0`: fire-forget, possible loss	Lower acks = lower latency, more data loss risk
`kafka_consumer_batch_size`	Analytics consumer	10,000	1K–100K	Larger ClickHouse inserts (more efficient); higher lag	More frequent small inserts; lower lag
`clickhouse_merge_interval`	ClickHouse config	600s	60–3600s	Less dedup merge frequency; more storage used; faster ingest	More frequent merges; more CPU; faster dedup
`postgres_max_connections`	Postgres	200	50–1000	More concurrent queries; more memory per connection	Connection starvation under load
`idempotency_key_retention_days`	Cron job	7	1–30	More storage; longer idempotency window	Less storage; shorter idempotency window
`expiry_scheduler_interval_seconds`	Expiry worker	60	10–300	Less frequent expiry; expired links may redirect briefly past deadline	More frequent; lower lag but more DB load
`redis_coalesce_lock_ttl_ms`	Redirect service	1000	100–5000	More waiting during coalesce; prevents more thundering herd	Faster fallback to DB; less coalescing benefit
`redis_cluster_read_replicas_per_shard`	Redis Cluster	2	1–5	More read capacity per shard; more memory	Less read capacity

Step 17 — Runbooks #

Runbook R1: Viral URL Cache Miss Storm #

Trigger: redirect_db_hit_rate > 5% for more than 2 minutes. (Normally < 0.1%.)

Diagnosis:

Check redis_keyspace_miss_rate metric — high indicates cache pressure.
Check top_slugs_by_request_rate dashboard — identify the viral slug.
Check Redis memory usage — if near 100%, eviction is happening.

Mitigations (in order):

Immediate: Force Redis re-population for the viral slug:
```
redis-cli SET shortlink:{slug} {value} EX 3600
```
This sets a 1-hour TTL, giving time to address root cause.
If Redis memory full: Increase Redis maxmemory by adding a read replica to the affected shard.
If DB is overwhelmed: Enable read connection pooling on PgBouncer; scale out DB read replicas.
Long-term: Increase redis_ttl_seconds for slugs with request_rate > 10,000/min.

Recovery signal: redirect_db_hit_rate returns below 0.5%.

Runbook R2: Postgres Primary Failover #

Trigger: PagerDuty alert postgres_primary_unreachable.

Immediate action:

Do NOT manually intervene for 60 seconds — Patroni automatic failover is running.
Verify Patroni status: patronictl -c /etc/patroni/config.yml list
If automatic failover completes: verify replica is now primary, application connections have reconnected via PgBouncer.

If failover has not completed after 90 seconds: manually promote the replica:

patronictl failover bitly-postgres --master {old_primary} --candidate {replica}

During failover (30–90 seconds):

Redirect traffic: unaffected (Redis serving most requests).
URL creation: fails with 503. Client should retry with idempotency key.
Dashboard reads: may fail or return stale data.

Post-failover:

Verify new primary accepts writes: INSERT INTO shortlinks ... ON CONFLICT DO NOTHING.
Verify replication lag on new replica: SELECT now() - pg_last_xact_replay_timestamp().
Monitor for Redis cache miss increase (new primary may be slower initially).
File incident report and investigate old primary.

Runbook R3: Analytics Lag Spike #

Trigger: click_aggregate_lag_seconds > 120 for more than 5 minutes.

Diagnosis:

Check Kafka consumer lag: kafka-consumer-groups.sh --describe --group analytics-consumer
Check ClickHouse insert queue depth.
Check analytics worker CPU and memory.

Mitigations:

If Kafka lag growing: Scale out analytics consumer instances (add 2–3 more).
If ClickHouse insert slow: Reduce kafka_consumer_batch_size (smaller batches insert faster under write pressure).
If analytics worker OOM: Increase pod memory limit.
If ClickHouse node down: Check ClickHouse replica status; failover to replica.

Recovery signal: click_aggregate_lag_seconds < 60 for 10 minutes.

User impact: Analytics dashboards show counts up to lag seconds stale. No impact on redirects.

Runbook R4: Slug Uniqueness Violation (Invariant Breach) #

Trigger: duplicate_slug_count_in_postgres > 0 (this should never fire; it indicates a bug).

Immediate actions:

Freeze all write traffic to the creation service.
Run: SELECT slug, count(*) FROM shortlinks GROUP BY slug HAVING count(*) > 1.
For each duplicate slug: determine which is the legitimate record (lowest id = created first).
Rename the duplicate to a new auto-generated slug.
Notify affected user of the slug change.

Root cause investigation: This invariant cannot be violated if the unique index exists. Check: \d shortlinks to verify CONSTRAINT uq_slug UNIQUE (slug) is present. If missing, re-add immediately:

CREATE UNIQUE INDEX CONCURRENTLY uq_slug ON shortlinks(slug);

Step 18 — Observability #

Metrics #

Metric	Component	Type	Alert Threshold
`redirect_latency_p95_ms`	Redirect service	Histogram	> 25ms for > 2 min
`redirect_latency_p99_ms`	Redirect service	Histogram	> 100ms for > 2 min
`redirect_cache_hit_rate`	Redirect service	Gauge	< 95% for > 5 min
`redirect_db_hit_rate`	Redirect service	Gauge	> 5% for > 2 min
`redis_memory_used_bytes`	Redis Cluster	Gauge	> 85% of maxmemory
`redis_keyspace_misses_per_sec`	Redis Cluster	Counter	> 1000/sec
`postgres_replication_lag_seconds`	Postgres replicas	Gauge	> 10s
`postgres_connections_active`	Postgres primary	Gauge	> 180 (of 200)
`kafka_consumer_lag_records`	Analytics consumer	Gauge	> 100,000 records
`click_events_produced_per_sec`	Redirect service	Counter	Alert on sharp drop: < 50% of 5-min avg
`clickhouse_insert_errors_per_min`	Analytics worker	Counter	> 10
`short_url_creation_rate_per_sec`	Write service	Counter	Alert on >10x normal (abuse detection)
`slug_collision_rate`	Write service	Counter	> 0 for custom slugs (expected 0 on success path)
`http_5xx_rate`	All services	Counter	> 1% for > 1 min
`cdn_origin_hit_rate`	CDN	Gauge	> 10% of total CDN traffic

Distributed Traces #

Every redirect request carries a trace_id header (OpenTelemetry W3C TraceContext). Spans emitted:

Span	Component	Key Attributes
`redirect.handle`	Redirect service	`slug`, `cache_layer` (local/redis/db), `status_code`
`redis.get`	Redis client	`key`, `hit` (bool), `latency_ms`
`postgres.select`	DB client	`table=shortlinks`, `rows_returned`, `latency_ms`
`kafka.produce`	Kafka client	`topic=click_events`, `partition`, `latency_ms`
`geoip.lookup`	GeoIP module	`country`, `latency_us`

Structured Logs (Sampled) #

{
  "ts": "2026-04-01T12:00:01.234Z",
  "level": "info",
  "event": "redirect",
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "slug": "AbCd3F",
  "cache_layer": "redis",
  "status_code": 302,
  "latency_ms": 3.2,
  "country": "US",
  "referrer_domain": "twitter.com"
}

Log sampling: 1% of redirects for hot slugs; 100% for cache misses and errors.

Dashboards #

Redirect health dashboard: P50/P95/P99 latency, cache hit rates by layer, error rate, top 10 slugs by request rate.
Analytics pipeline dashboard: Kafka consumer lag, ClickHouse insert rate, aggregate staleness.
Write path dashboard: URL creation rate, custom slug collision rate, Postgres write latency.
Infrastructure dashboard: Redis memory, Postgres replication lag, Kafka broker health.

Step 19 — Cost Model #

Growth Drivers per Component #

Component	Primary Cost Driver	Unit Cost	Projected Monthly Cost at Scale
Postgres (primary + 2 replicas)	Storage (150GB + growth) + compute for IOPS	~$500/node	~$1,500/month (3 nodes)
Redis Cluster (6 shards × 3 nodes)	Memory (20GB/shard for hot slug cache)	~$300/node (r7g.large)	~$5,400/month (18 nodes)
Redirect service (50 instances)	CPU (high req/sec, mostly in-memory work)	~$100/instance (c6g.medium)	~$5,000/month
Kafka (12 brokers + ZooKeeper)	Network throughput + storage (30-day retention)	~$400/broker	~$4,800/month
ClickHouse (3 shards × 2 nodes)	Storage (time-series event data) + compute for merges	~$600/node	~$3,600/month
Analytics workers (5 instances)	CPU (Kafka consume + ClickHouse insert)	~$100/instance	~$500/month
CDN (Cloudflare)	Bandwidth + requests	$0.01/GB + $1/M requests	~$2,000/month at 10B redirects
Total			~$22,800/month

Cost Growth Model #

Growth Event	Primary Cost Driver	Component That Scales
10x redirect traffic	CDN bandwidth scales linearly	CDN cost grows 10x (~$20K)
10x URL creation rate	Postgres write IOPS	May require sharding or larger instance (~$3K)
10x click event volume	Kafka storage + ClickHouse storage	Storage-linear growth (~$10K)
10x active slug count	Redis memory (200 bytes × 10x slugs)	More Redis nodes (~$15K)

Dominant cost at current scale: Redis (24%) and redirect service compute (22%). CDN becomes dominant at 10x traffic.

Cost optimization levers:

Increase CDN cache hit rate → reduces redirect service instance count.
Use slug access frequency to evict cold slugs from Redis → reduce Redis tier cost.
ClickHouse data tiering: move partitions older than 90 days to S3-backed cold storage.

Step 20 — Evolution #

Stage 1: MVP (0 → 1M short URLs, < 100 req/sec) #

Architecture: Single Postgres, single Redis, no Kafka.

Click analytics: Write directly to Postgres click_events table (synchronous write during redirect). Not viable at scale but acceptable at < 100 req/sec.

Upgrade signal: P99 redirect latency > 100ms (DB bottleneck), or Postgres write IOPS > 80% capacity.

Changes needed to advance:

Introduce Kafka for async click event pipeline (decouple analytics from redirect path).
Introduce Redis as a caching layer.

Stage 2: Growth (1M → 100M short URLs, 100 → 5,000 req/sec) #

Architecture: Postgres primary + replicas, Redis Cluster (2 shards), Kafka + ClickHouse analytics pipeline, CDN integration.

Click analytics: Async via Kafka → ClickHouse. Redirect path has zero analytics latency.

Upgrade signal: Redis memory > 80% capacity, Postgres replication lag > 1s under load, CDN origin hit rate > 20%.

Changes needed to advance:

Expand Redis cluster to 6 shards.
Introduce process-local cache in redirect service (reduces Redis load by 10x for viral slugs).
Consider geo-distributed deployment for sub-50ms latency in non-US regions.

Stage 3: Scale (100M → 1B short URLs, 5,000 → 50,000 req/sec) #

Architecture: Multi-region active-active, Postgres with read replicas per region (writes to primary region only), Redis per-region, Kafka cross-region replication.

Custom slug challenge: Custom slug uniqueness across regions requires a global uniqueness layer. Options:

Option A: Route all custom slug creation requests to a single “slug authority” region (single writer, globally consistent). Acceptable if creation latency is not latency-sensitive.
Option B: Use a global distributed key-value store (e.g., Google Spanner) as the slug uniqueness arbiter. Higher operational complexity.

Upgrade signal: Single-region creation latency > 500ms (DB round-trip from non-US users), or Postgres max connections reached.

Changes needed to advance:

Introduce global slug authority service.
Consider Postgres sharding for shortlinks table (horizontal partition by slug hash range).
Introduce auto-generated slug pre-allocation (batch fetch sequence ranges from DB, generate slugs locally in redirect service).

Stage 4: Hyperscale (> 1B short URLs, > 100K req/sec) #

Architecture: Full multi-region active-active with CRDT-based slug reservation, ClickHouse sharded to 20+ nodes, Redis Federation, CDN custom logic (Cloudflare Workers) for edge-side slug validation.

Key architectural evolution:

Redirect path fully served at CDN edge with Cloudflare Workers reading from Cloudflare KV (edge KV store, not Postgres). Postgres is the source of truth but is not in the hot path at all.
Click events processed at the CDN edge before returning to origin.

Upgrade signal: Cross-region DB roundtrip > 200ms for slug uniqueness check, or CDN origin cost exceeds $50K/month.

Summary: Upgrade Decision Matrix #

Signal	Upgrade Action
`redirect_db_hit_rate > 5%`	Expand Redis capacity; increase CDN TTL
`postgres_write_iops > 80%`	Add read replicas; introduce Kafka analytics pipeline
`redis_memory_used > 80%`	Add Redis shards; evict cold slugs (access-frequency TTL)
`cdn_origin_hit_rate > 20%`	Increase CDN TTL; audit cache invalidation frequency
`creation_latency_p99 > 500ms`	Shard Postgres; pre-allocate slug ranges
`analytics_lag > 5min`	Scale out Kafka consumers; increase ClickHouse shards
Single-region creation latency > 1s for global users	Multi-region active-active + global slug authority

End of Bit.ly system design derivation. All 20 steps have produced explicit output artifacts. The design is derivable from the invariants; the invariants are derivable from the normalized requirements; the normalized requirements are derivable from the product requirements. No design choice is unmotivated.