Table of Contents
Bit.ly: System Design #
Derived using the 20-step derivation framework. Every step produces an explicit output artifact. No steps are skipped or abbreviated.
Ordering Principle #
Product requirements
→ normalize into operations over state (Step 1)
→ extract primary objects (Step 2)
→ assign ownership, ordering, evolution (Step 3)
→ extract invariants (Step 4)
→ derive minimal DPs from invariants (Step 5)
→ select concrete mechanisms (Step 6)
→ validate independence and source-of-truth (Step 7)
→ specify exact algorithms (Step 8)
→ define logical data model (Step 9)
→ map to technology landscape (Step 10)
→ define deployment topology (Step 11)
→ classify consistency per path (Step 12)
→ identify scaling dimensions and hotspots (Step 13)
→ enumerate failure modes (Step 14)
→ define SLOs (Step 15)
→ define operational parameters (Step 16)
→ write runbooks (Step 17)
→ define observability (Step 18)
→ estimate costs (Step 19)
→ plan evolution (Step 20)
Context and Scale #
Bit.ly is a URL shortening service. The core proposition: long URLs become short codes (bit.ly/AbCd3F), which redirect to the original. Analytics are a premium feature.
Traffic asymmetry is the defining characteristic. Redirect traffic is approximately 10,000x write traffic. Creating a short URL is the rare event. Following one is the core function. Any design that does not start from this asymmetry will fail at scale.
Reference scale:
- 300 million short URLs created per month → ~115 writes/sec
- 10 billion redirects per month → ~3,800 redirects/sec average, peaks 10–100x that
- P99 redirect latency budget: 10ms end-to-end (cache hit), 100ms (cold)
- Analytics: click events must not block the redirect path
Step 1 — Problem Normalization #
Goal: Rewrite each functional requirement as (actor, operation, state).
| Original Requirement | Actor | Operation | State Touched |
|---|---|---|---|
| User creates a short URL from a long URL | User / API client | Create record (append immutable mapping) | ShortLink record |
| User follows a short URL and is redirected | HTTP client (browser/bot) | Read redirect target; append click event | ShortLink.target_url (read); ClickEvent (append) |
| User sees click analytics (count, geo, referrer) | Authenticated user | Read derived aggregate | ClickAggregate (derived from ClickEvents) |
| User claims a custom slug | Authenticated user | Conditional insert (claim unique name) | ShortLink.slug (uniqueness domain) |
| Short URL expires after configured TTL | Scheduler / system | Transition state of ShortLink to EXPIRED | ShortLink.status |
| Admin disables / deletes a short URL | Admin actor | Overwrite ShortLink.status to DISABLED | ShortLink.status |
| User views their dashboard of short URLs | Authenticated user | Read projection over ShortLink records owned by user | UserLinkIndex (derived) |
Key observations from normalization:
- The redirect is a read of ShortLink.target_url followed by an append of a ClickEvent. It is not a write to any mutable state. This is critical for scaling: redirects are pure reads.
- ClickAggregate is not state — it is a derived view computed from ClickEvents. Treating it as primary state would create a contention-heavy write path on every redirect.
- Custom slug creation is a conditional insert — the operation must fail if the slug is already taken. This is an atomic uniqueness check, not a simple insert.
- URL expiry is a state machine transition from ACTIVE to EXPIRED. It can be driven by a TTL check on read (lazy expiry) or a scheduler (eager expiry). These have different tradeoffs.
Step 2 — Object Extraction #
Goal: Identify primary objects, classify each, and apply the four purity tests.
Primary Objects #
1. ShortLink #
Class: Stable entity (long-lived record with identity)
Description: The canonical mapping from a short code (slug) to a long URL. Created once, read billions of times.
Fields: slug (PK), target_url, owner_user_id, created_at, expires_at (nullable), status (ACTIVE | EXPIRED | DISABLED), custom_domain (nullable), alias (human-readable name for dashboard)
Purity tests:
- Ownership: Single writer after creation (owner and admin only). Slug assignment happens once at creation.
- Evolution: State machine (ACTIVE → EXPIRED, ACTIVE → DISABLED). Target URL is immutable after creation (append-only by convention; overwriting would break caches).
- Ordering: Total order within slug domain (each slug has exactly one owner; slug assignment is atomic and final).
- Derivability: Not derivable from any other object. ShortLink IS the source of truth for the redirect target.
Verdict: Primary object. Not derivable. Cannot be merged.
2. ClickEvent #
Class: Immutable event (append-only fact)
Description: A single redirect event — one click on one short URL. Contains the slug, timestamp, geo (country/city inferred from IP), referrer header, user-agent.
Fields: event_id (UUID), slug, occurred_at, country, city, referrer, user_agent, ip_hash (privacy-preserving)
Purity tests:
- Ownership: The redirect service is the sole writer. The user initiating the redirect does not own this record.
- Evolution: Append-only. A click that happened cannot un-happen. Immutable.
- Ordering: Total order by occurred_at within slug (for analytics queries). Partial order across slugs (no cross-slug ordering needed).
- Derivability: Not derivable. ClickEvents are the source facts. All analytics are derived from them.
Verdict: Primary object. Event class. Source of truth for all click analytics.
3. ClickAggregate #
Class: Derived view (computed from ClickEvents)
Description: Pre-computed counts and breakdowns (by day, country, referrer) for a slug. Serves dashboard queries without scanning billions of raw events.
Purity tests:
- Ownership: Written by the analytics pipeline, not by any user action.
- Evolution: Merge-friendly (counts are commutative: adding a new click increments a counter).
- Ordering: No meaningful order on the aggregate itself.
- Derivability: Fully derivable from ClickEvents by
SELECT slug, date_trunc('day', occurred_at), country, count(*) FROM ClickEvent GROUP BY .... Can be rebuilt from scratch.
Verdict: NOT a primary object. Derived view. Must not be treated as source of truth.
4. User #
Class: Stable entity
Description: Account that owns ShortLinks. Relevant for authentication, ownership enforcement, and dashboard queries.
Fields: user_id, email, plan (FREE | PRO | ENTERPRISE), created_at
Verdict: Primary object. Standard account entity. Not special to the core redirect path.
5. SlugNamespace #
Class: Relationship / uniqueness domain
Description: The global set of slug strings that are already claimed. Not a table per se, but the uniqueness invariant enforced across all ShortLink records.
Purity tests:
- Derivability: Derivable from ShortLink records (the set of all slugs in use). However, the uniqueness invariant must be enforced atomically at write time, not derived. The namespace IS the constraint surface.
Verdict: Not a standalone object. The uniqueness constraint belongs on ShortLink.slug as a unique index. The enforcement mechanism is a DP, not an object.
6. UserLinkIndex #
Class: Derived view (projection)
Description: The list of ShortLinks belonging to a user, ordered by created_at descending. Powers the user dashboard.
Derivability: Fully derivable from ShortLink records filtered by owner_user_id. Rebuild: SELECT * FROM ShortLink WHERE owner_user_id = ? ORDER BY created_at DESC.
Verdict: Derived view. Not a primary object. A secondary index on ShortLink.owner_user_id suffices.
Object Summary Table #
| Object | Class | Primary? | Source of Truth For |
|---|---|---|---|
| ShortLink | Stable entity | Yes | Redirect target, ownership, status |
| ClickEvent | Immutable event | Yes | All click analytics |
| ClickAggregate | Derived view | No | Nothing — projection only |
| User | Stable entity | Yes | Authentication, plan |
| SlugNamespace | Uniqueness constraint | No (constraint on ShortLink) | Enforced by unique index |
| UserLinkIndex | Derived view | No | Nothing — secondary index |
Step 3 — Axis Assignment #
Goal: For each primary object, assign ownership (who writes?), evolution (append/overwrite/state-machine/merge), and ordering (total/partial/none, bound to scope).
ShortLink #
Ownership: Multi-writer at creation (any authenticated user may create);
single writer after creation (owner or admin modifies status).
Slug assignment: system assigns auto-increment ID → Base62, or
user claims custom slug (one winner per slug — CAS semantic).
Evolution: State machine.
Valid transitions:
(none) → ACTIVE [at creation]
ACTIVE → EXPIRED [by scheduler when expires_at < now()]
ACTIVE → DISABLED [by admin or owner]
DISABLED → ACTIVE [by owner, premium only]
target_url is immutable after creation (no valid transition modifies it).
Ordering: Total order within slug (each slug is assigned once, atomically).
Causal lifecycle order for status transitions (ACTIVE must precede EXPIRED).
No cross-slug ordering needed.
ClickEvent #
Ownership: Single writer (redirect service instance that handled the request).
Multiple redirect service instances may write concurrently —
but each individual event has a single, unambiguous writer.
Evolution: Append-only. Events are immutable facts.
Ordering: Total order by occurred_at within slug (for per-slug analytics).
Approximate total order (clock skew across redirect nodes < 1ms
acceptable for analytics — not a strict invariant).
No ordering required across slugs.
User #
Ownership: Single writer (the user themselves for profile; system for plan changes).
Evolution: Overwrite for mutable fields (email, plan). Append-only for audit log.
Ordering: No meaningful order among users.
Step 4 — Invariant Extraction #
Goal: Derive precise, testable, concurrency-aware invariants from the normalized requirements.
Invariant List #
I1. [Uniqueness] Slug uniqueness across ShortLinks. For any two distinct ShortLink records L1 and L2, L1.slug ≠ L2.slug. This holds at all times, including under concurrent creation requests.
I2. [Uniqueness / Idempotency] Duplicate creation requests produce the same ShortLink. If the same client submits the same creation request N times (same idempotency key), exactly one ShortLink is created. Subsequent submissions return the previously created record, not an error and not a second record.
I3. [Eligibility] Redirect returns the target URL only for ACTIVE ShortLinks. A redirect request for slug S returns HTTP 302 only if ShortLink(S).status == ACTIVE AND (expires_at IS NULL OR expires_at > now()). For EXPIRED or DISABLED slugs, the correct response is HTTP 410.
I4. [Eligibility] Custom slug creation succeeds only if the slug is unclaimed. A custom slug claim for slug S succeeds only if no ShortLink with slug == S currently exists. If another actor claims the same slug concurrently, exactly one claim succeeds.
I5. [Ordering] Status transitions follow the valid state machine. ShortLink.status may only transition via valid edges. Specifically: DISABLED → EXPIRED is forbidden. EXPIRED → DISABLED is forbidden without explicit re-activation first. Any transition that is not an edge in the defined state machine must be rejected.
I6. [Accounting] ClickAggregate(slug, window) = count(ClickEvents where slug = S and occurred_at in window). The aggregate click count for any slug and time window exactly equals the number of ClickEvent records in that window. The aggregate may be stale by at most ε = 60 seconds under normal operation.
I7. [Propagation] A newly created or status-changed ShortLink is reflected in the redirect path within ε. After a ShortLink is created or its status changes, the redirect service must return the updated result within ε_redirect = 5 seconds. (Cache TTL bound.)
I8. [Uniqueness] Each ClickEvent is processed exactly once in analytics aggregation. A ClickEvent that is written to the event stream is counted exactly once in ClickAggregate. Duplicate deliveries (from retry or at-least-once delivery) must be deduplicated before incrementing the aggregate.
I9. [Access-control] Only the ShortLink owner or admin may modify ShortLink.status or ShortLink.alias. Reads of target_url during redirect are unauthenticated. Writes to any ShortLink field require proof of ownership (owner_user_id match) or admin role.
I10. [Accounting] Auto-generated slugs are globally unique without retry. The slug generation process for auto-generated (non-custom) slugs must produce a collision-free slug deterministically, without requiring optimistic retry. (Base62 of auto-incremented ID satisfies this; random generation does not without collision checking.)
Step 5 — DP Derivation #
Goal: Identify the minimal enforcing mechanism (design parameter) per invariant cluster.
A DP is not a technology name. It is the minimal runtime capability required to make the invariant cluster enforceable.
| Invariant Cluster | DP | Reasoning |
|---|---|---|
| I1, I4 — Slug uniqueness (system-assigned and custom) | Atomic conditional insert on slug as unique key | Only one mechanism can guarantee that exactly one writer wins: an atomic insert that fails if the key already exists. No application-level check-then-insert is sufficient (race condition). |
| I2 — Idempotent creation | Idempotency key store | A dedup table (idempotency_key → shortlink_id) with conditional insert. First write creates; subsequent writes with same key return cached result. |
| I3 — Redirect eligibility check | Low-latency key-value lookup with TTL-aware cache | The redirect path reads ShortLink by slug. Must be sub-millisecond for cache hit. Requires a cache that can serve millions of reads/sec. |
| I5 — State machine enforcement | CAS (compare-and-swap) on (status, version) | A status transition is valid only from a specific source state. Concurrent transitions to conflicting states must fail. CAS on (current_status, version) enforces this atomically. |
| I6, I8 — Click counting with exactly-once semantics | Append-only event log + idempotent consumer | Click events are appended to a durable log. Consumers read the log and aggregate. Dedup on event_id ensures exactly-once processing despite at-least-once delivery. |
| I7 — Redirect cache freshness | TTL-bounded cache with invalidation | Cache entries for ShortLink records expire after ε_redirect = 5 seconds. On status change, explicit cache invalidation reduces lag to near-zero for known-changed keys. |
| I9 — Access control | Token-gated write path | Every write request carries an authenticated session token. Service layer checks owner_user_id == authenticated_user_id before executing any mutation. |
| I10 — Collision-free auto-slug | Monotonic global counter → Base62 encoding | A single globally ordered counter (database sequence or distributed counter) guarantees uniqueness without collision probability. Base62 encoding produces short codes. |
Step 6 — Mechanism Selection #
Goal: The mechanical bridge from DPs to concrete mechanisms. Apply the full discrimination procedure for the three key paths.
6.1 DP Classification by Invariant Type #
| DP | Invariant Type | Mechanism Family |
|---|---|---|
| Atomic conditional insert (slug uniqueness) | Uniqueness | Locking / CAS family |
| Idempotency key store | Uniqueness / Idempotency | Conditional insert / dedup table |
| Low-latency KV lookup | Propagation | Cache family |
| CAS on status transitions | Eligibility + Ordering | CAS / optimistic locking family |
| Append-only event log | Accounting | Log / stream family |
| Idempotent consumer | Accounting | Dedup + aggregation family |
| TTL-bounded cache | Propagation | Cache family |
| Token-gated write path | Access-control | Auth middleware family |
| Monotonic counter + Base62 | Uniqueness | Sequential assignment family |
6.2 Ownership × Evolution Table #
| Object | Ownership | Evolution | Table Result |
|---|---|---|---|
| ShortLink (creation) | Multi-writer, one winner per slug | State machine | → CAS on (slug, version) |
| ShortLink (status update) | Single writer (owner/admin) | State machine | → CAS on (status, version) to prevent concurrent conflicting transitions |
| ClickEvent | Multi-writer (many redirect nodes), all succeed | Append-only | → No CAS needed; append to partitioned log |
| ClickAggregate | Single writer (analytics consumer) | Merge (commutative count) | → Idempotent increment with dedup (CRDT G-Counter semantics) |
6.3 Detailed Mechanical Derivation #
Derivation A: Slug Uniqueness Enforcement (Invariant I1, I4, I10) #
Step 6.3.1 — Q1 (Scope): The uniqueness constraint is within one service (the ShortLink creation service). It is not cross-region partitioned (a slug must be globally unique, not per-region unique). Therefore scope = within service → distributed CAS.
But “distributed CAS” requires a serialization point. Two options:
- Option A: Database unique index (the DB serializes inserts on the unique key). The database becomes the arbiter.
- Option B: Distributed lock (acquire lock on slug string before inserting). Adds network roundtrip and failure surface.
Option A (DB unique index) is strictly superior for slug uniqueness because:
- The slug IS the database key. No separate lock namespace.
- The insert IS the CAS. If it fails (duplicate key error), the caller knows exactly one other writer won.
- No lock timeout or holder crash to handle.
Step 6.3.2 — Q2 (Failure): Crash of the writer during creation: the row either committed or did not. No partial state. Network partition: the creation request fails; the client retries with an idempotency key.
Q2 → Idempotency Key required. Crash of the creation service after the DB insert but before returning to the client means the client retries. Without an idempotency key, retry would hit the unique index (duplicate key error) and the client would incorrectly report failure. With an idempotency key:
- First attempt: INSERT INTO idempotency_keys (key, shortlink_id) first, then INSERT ShortLink.
- Retry: SELECT from idempotency_keys returns existing shortlink_id → return success.
Step 6.3.3 — Q3 (Data):
The slug for auto-generated links is computed as Base62(auto_increment_id). The auto-increment is the source of truth for ordering and uniqueness. Base62 encoding is deterministic and bijective over the integer domain used. No collision probability.
For custom slugs: the slug string is user-supplied. The unique index is the collision mechanism.
Step 6.3.4 — Required combination: CAS (unique index INSERT) always requires an Idempotency Key. Applied.
Final mechanism for slug uniqueness:
- Database sequence generates monotonically increasing
link_id. slug = Base62(link_id)— deterministic, no collision, no retry needed.- For custom slugs:
INSERT INTO shortlinks (slug, ...) ON CONFLICT (slug) DO NOTHING— returns affected rows; if 0, slug was taken. - Idempotency key table:
idempotency_keys(key TEXT PK, shortlink_id BIGINT, created_at TIMESTAMP).
Derivation B: Click Event Pipeline (Invariants I6, I8) #
Step 6.3.1 — Q1 (Scope): Click events are written by many redirect service instances across many servers. This is multi-writer, append-only. The destination is a durable log. Scope = cross-service (redirect service → analytics service). This eliminates in-transaction handling. The mechanism family is async guaranteed delivery.
Step 6.3.2 — Q5 (Coupling): The redirect path must not block on analytics writes (10ms latency budget). Therefore coupling is async, guaranteed delivery → Outbox pattern or direct append to a durable log.
Outbox pattern (write to DB, relay to stream) is appropriate when the event must be durably captured in the same transaction as the primary state change. But for click events:
- The primary state change IS the click (redirect happened). There is no transactional coupling to a DB row.
- The redirect service already responded 302 to the client.
- Therefore: direct append to a durable partitioned log (e.g., Kafka) is correct. No outbox needed.
Write-ahead logging (CDC) would require a DB write on every redirect — that DB write IS the bottleneck we are trying to avoid.
Step 6.3.3 — Q2 (Failure): Crash of redirect node after responding 302 but before appending to log: click is lost. Acceptable — analytics is eventually consistent (I6 allows ε = 60 seconds of staleness; losing a small fraction of clicks is an acceptable analytics approximation at this scale). If exactness is required, the redirect node can fire-and-forget to Kafka asynchronously before responding, accepting the tiny window of loss.
Duplicate delivery from Kafka at-least-once: I8 requires exactly-once counting. Mechanism: each ClickEvent has a event_id = UUID. The analytics consumer maintains a dedup window: processed_event_ids (Bloom filter or Redis SET with TTL). Before incrementing ClickAggregate, check event_id not in dedup set.
For ClickHouse (columnar OLAP): use ReplacingMergeTree or AggregatingMergeTree with event_id dedup key to handle at-least-once delivery natively.
Step 6.3.4 — Q3 (Data): Click counts are commutative and associative (count += 1). This is G-Counter CRDT semantics. Each analytics consumer shard can independently accumulate partial counts and merge. No serialization point needed for counting.
Step 6.3.5 — Q4 (Access): Analytics is read » write for the aggregation layer (10B events/month, queries from dashboards). The aggregate is a pre-computed materialized view. Pattern: CQRS — ClickEvent log is the write model; ClickAggregate tables are the read model.
Final mechanism for click event pipeline:
Redirect service
→ fire-and-forget async write to Kafka topic: click_events
→ partition key: slug (ensures per-slug ordering within partition)
Kafka consumer (analytics worker)
→ reads click_events topic at-least-once
→ deduplicates on event_id (Bloom filter in memory, 5-minute window)
→ batches 10,000 events
→ bulk-inserts ClickEvent rows to ClickHouse
→ ClickHouse ReplacingMergeTree merges in background
ClickHouse aggregate table
→ materialized view: SELECT slug, toDate(occurred_at), country, count()
FROM click_events GROUP BY slug, date, country
→ refreshed every 60 seconds (satisfies ε = 60s from I6)
Derivation C: Redirect Lookup Optimization (Invariant I3, I7) #
Step 6.3.1 — Q1 (Scope): The redirect lookup is a read of ShortLink by slug. With 10B redirects/month (~3,800 req/sec average, 100K+ req/sec peak for viral URLs), this is read-heavy. Pattern: Cache-Aside.
Step 6.3.2 — Q4 (Access): Read » Write (10,000x). The correct pattern is a multi-layer cache:
- Layer 1: CDN edge cache (geo-distributed) — serves redirects at the PoP nearest to the user. Eliminates 80%+ of traffic from reaching origin. Cache-Control headers with
max-age=300(5 minutes, matching I7’s ε_redirect = 5 seconds is too aggressive for CDN — use 30–300 seconds depending on expected change frequency; new links are rarely changed). - Layer 2: Redis in-memory cache — serves slugs not in CDN cache. Sub-millisecond. Cache TTL = 60 seconds for hot slugs. Capacity: 100M active slugs × 200 bytes = 20GB.
- Layer 3: Postgres — authoritative source. Hit only on cache miss.
Step 6.3.3 — Hotspot problem:
A viral URL (slug abc123) may receive millions of redirects/sec. All hit the same Redis key. This is the thundering herd / hotspot problem.
Mechanisms:
- CDN absorbs 95%+ of viral traffic — viral URLs have extremely high cache hit rates because many users in many regions access the same URL. CDN caches at edge.
- Local in-process cache in redirect service — each redirect service instance maintains a local LRU cache (10K entries, 1-second TTL). Viral slugs are served from process memory without Redis roundtrip.
- Redis read replicas — Redis Cluster with multiple read replicas per shard. Viral slug key can be read from any replica.
Step 6.3.4 — Q2 (Failure): Cache miss on viral URL going cold (TTL expires at CDN and Redis simultaneously):
- Thundering herd: thousands of requests flood Redis/DB simultaneously.
- Mechanism: probabilistic early re-caching (extend TTL before expiry for hot keys), OR request coalescing at Redis layer (SETNX-based mutex: first miss acquires “I am fetching” lock, others wait briefly then get result).
Step 6.3.5 — Cache invalidation on status change: When ShortLink.status changes (ACTIVE → DISABLED), cached entries must be invalidated:
- Send invalidation message to Redis:
DEL shortlink:{slug}. - CDN purge API call:
PURGE https://bit.ly/{slug}. - Residual staleness = max(CDN purge propagation time, ~1–3 seconds).
- Acceptable per I7 (ε_redirect = 5 seconds).
Step 6.3.6 — Required combination: Cache-Aside always needs a TTL (prevents stale-forever entries if invalidation fails). Applied: Redis TTL = 60 seconds, CDN max-age = 300 seconds (or purged on change).
Final mechanism for redirect lookup:
Client request: GET https://bit.ly/{slug}
→ CDN edge (layer 1):
HIT: return 302, target_url, Cache-Control: max-age=300
MISS: forward to redirect service
→ Redirect service (layer 2 — local process cache, 1s TTL LRU):
HIT: return 302
MISS: forward to Redis
→ Redis (layer 3 — 60s TTL):
HIT: populate local cache, return 302
MISS: coalesce concurrent fetches (SETNX lock)
→ Postgres (authoritative):
SELECT target_url, status, expires_at FROM shortlinks WHERE slug = ?
If status != ACTIVE or expires_at < now(): return 410
Return 302, populate Redis, return to caller
Cache miss is a failure. The redirect path must never be allowed to become a DB read path under load. Redis must be provisioned to handle 100% of redirect traffic if CDN is bypassed.
Step 7 — Axiomatic Validation #
Source-of-Truth Table #
| State Domain | Source of Truth | Location |
|---|---|---|
| Slug → target_url mapping | ShortLink table | Postgres (primary) |
| ShortLink.status | ShortLink table | Postgres (primary) |
| Click facts | ClickEvent table | ClickHouse (append-only) |
| Click aggregates | ClickAggregate materialized view | ClickHouse (derived) |
| User accounts | User table | Postgres |
| Hot slug cache | Redis key shortlink:{slug} | Redis (cache — not source of truth) |
| Idempotency keys | idempotency_keys table | Postgres |
Dependency table:
| Derived Object | Depends On | Rebuild Path |
|---|---|---|
| ClickAggregate | ClickEvent | INSERT INTO click_agg SELECT slug, date, country, count(*) FROM click_events GROUP BY ... |
| Redis cache entry | ShortLink (Postgres) | On cache miss: fetch from Postgres, repopulate Redis |
| UserLinkIndex | ShortLink | SELECT * FROM shortlinks WHERE owner_user_id = ? (secondary index) |
| CDN cached redirect | ShortLink (Postgres, via redirect service) | Purge CDN + TTL expiry; rebuild on next request |
Projections with rebuild paths:
ClickAggregate rebuild: If ClickHouse is corrupted, replay ClickEvent log from Kafka (retain 30 days) → rebuild all aggregates. Duration: hours for full history, minutes for recent windows.
Redis cache rebuild: If Redis is wiped, cache warms organically on first miss per slug. No explicit rebuild needed. Hot slugs repopulate within seconds under live traffic.
UserLinkIndex: It is a secondary index on Postgres. If the index is dropped, it can be rebuilt in minutes with
CREATE INDEX CONCURRENTLY.
Independence check:
- ClickEvent is not derived from ShortLink. Click events reference slug as a foreign key (denormalized — slug may be deleted but click events are retained for historical analytics). This is correct: events are immutable facts; deleting the ShortLink does not delete the analytics history.
- ClickAggregate is fully derivable from ClickEvent. No circular dependency.
- Redis cache is fully derivable from Postgres. No circular dependency.
Step 8 — Algorithm Design #
Write Path 1: Short URL Creation (Auto-generated Slug) #
function createShortLink(request: CreateRequest, idempotency_key: string) -> ShortLink:
// Step 1: Idempotency check
existing = db.query(
"SELECT shortlink_id FROM idempotency_keys WHERE key = $1",
[idempotency_key]
)
if existing:
return db.query("SELECT * FROM shortlinks WHERE id = $1", [existing.shortlink_id])
// Step 2: Validate input
if not isValidURL(request.target_url):
raise InvalidURLError
if request.expires_at != null and request.expires_at < now() + 60s:
raise InvalidExpiryError // must expire in the future
// Step 3: Generate slug from auto-increment ID
// The DB sequence guarantees monotonicity and uniqueness.
// Base62 encoding: 0-9 = '0'-'9', 10-35 = 'a'-'z', 36-61 = 'A'-'Z'
link_id = db.nextval('shortlinks_id_seq')
slug = base62_encode(link_id) // deterministic, no collision possible
// Step 4: Insert ShortLink (cannot fail on slug uniqueness for auto-generated)
db.execute("""
INSERT INTO shortlinks (id, slug, target_url, owner_user_id, created_at, expires_at, status)
VALUES ($1, $2, $3, $4, now(), $5, 'ACTIVE')
""", [link_id, slug, request.target_url, request.user_id, request.expires_at])
// Step 5: Record idempotency key atomically
db.execute("""
INSERT INTO idempotency_keys (key, shortlink_id, created_at)
VALUES ($1, $2, now())
""", [idempotency_key, link_id])
// Step 6: Return created record
return ShortLink{id: link_id, slug: slug, ...}
Idempotency: Steps 1–2 are atomic: the idempotency key is checked before any side effect. If the service crashes after step 4 but before step 5, the shortlink exists but the idempotency key is not recorded. On retry, step 4 would fail with a sequence ID conflict (different sequence value). Mitigation: wrap steps 4 and 5 in a single DB transaction.
Revised (correct) version:
BEGIN TRANSACTION
link_id = nextval('shortlinks_id_seq')
slug = base62_encode(link_id)
INSERT INTO shortlinks (id, slug, ...)
INSERT INTO idempotency_keys (key, shortlink_id, ...)
COMMIT
If the transaction rolls back, neither row is written. The client retries with the same idempotency key. The idempotency check at step 1 returns nothing, and the transaction is retried cleanly.
Write Path 2: Custom Slug Claim #
function claimCustomSlug(request: CustomSlugRequest, idempotency_key: string) -> ShortLink:
// Step 1: Idempotency check (same as above)
existing = checkIdempotency(idempotency_key)
if existing: return existing
// Step 2: Validate slug format
if not isValidSlugFormat(request.slug): // alphanum + hyphen, 3-50 chars
raise InvalidSlugError
// Step 3: Attempt atomic insert (CAS on slug uniqueness)
BEGIN TRANSACTION
result = db.execute("""
INSERT INTO shortlinks (id, slug, target_url, owner_user_id, status, created_at)
VALUES (nextval('shortlinks_id_seq'), $1, $2, $3, 'ACTIVE', now())
ON CONFLICT (slug) DO NOTHING
RETURNING id, slug
""", [request.slug, request.target_url, request.user_id])
if result.rowcount == 0:
ROLLBACK
raise SlugAlreadyTakenError(slug=request.slug) // I4 enforced
link_id = result.rows[0].id
INSERT INTO idempotency_keys (key, shortlink_id, ...) VALUES (idempotency_key, link_id, now())
COMMIT
return ShortLink{slug: request.slug, ...}
State machine: ON CONFLICT (slug) DO NOTHING is the CAS. If two requests race for the same slug, Postgres serializes them at the unique index. Exactly one succeeds; the other sees rowcount == 0.
Write Path 3: Status Transition (Disable / Enable) #
function updateStatus(slug: string, new_status: Status, actor: User) -> ShortLink:
// Step 1: Fetch current record
link = db.query("SELECT id, status, version, owner_user_id FROM shortlinks WHERE slug = $1", [slug])
if not link:
raise NotFoundError
// Step 2: Access control (I9)
if link.owner_user_id != actor.user_id and not actor.is_admin:
raise UnauthorizedError
// Step 3: Validate state machine transition (I5)
valid_transitions = {
'ACTIVE': ['EXPIRED', 'DISABLED'],
'DISABLED': ['ACTIVE'],
'EXPIRED': [] // EXPIRED is terminal unless manually overridden by admin
}
if new_status not in valid_transitions[link.status]:
raise InvalidTransitionError(from=link.status, to=new_status)
// Step 4: CAS update on (status, version) to prevent concurrent conflicting writes
result = db.execute("""
UPDATE shortlinks
SET status = $1, version = version + 1, updated_at = now()
WHERE slug = $2 AND status = $3 AND version = $4
""", [new_status, slug, link.status, link.version])
if result.rowcount == 0:
// Concurrent update; retry from Step 1 (optimistic locking)
raise ConflictError // caller should retry
// Step 5: Invalidate cache
redis.del(f"shortlink:{slug}")
cdn.purge(f"https://bit.ly/{slug}") // async, best-effort
return fetchUpdated(slug)
Retry on conflict: The caller (service layer) retries up to 3 times with exponential backoff. Slug status changes are rare; conflict probability is negligible.
Read Path: Redirect #
function redirect(slug: string) -> HTTPResponse:
// Layer 1: Process-local cache (LRU, 10K entries, 1-second TTL)
cached = localCache.get(slug)
if cached:
if cached.status != 'ACTIVE' or cached.is_expired():
return HTTP_410
return HTTP_302(Location=cached.target_url)
// Layer 2: Redis (60-second TTL)
cached = redis.get(f"shortlink:{slug}")
if cached:
localCache.set(slug, cached, ttl=1s)
if cached.status != 'ACTIVE' or cached.is_expired():
return HTTP_410
emit_click_event_async(slug, request) // fire-and-forget to Kafka
return HTTP_302(Location=cached.target_url)
// Layer 3: Postgres (cache miss — this is a failure condition under load)
// Use coalescing to prevent thundering herd
lock_key = f"fetching:{slug}"
acquired = redis.set(lock_key, '1', NX=true, EX=1) // 1-second lock
if not acquired:
// Another request is fetching; wait briefly and check Redis again
sleep(10ms)
cached = redis.get(f"shortlink:{slug}")
if cached:
// Successfully coalesced
... (same as Layer 2 hit above)
else:
// Timeout; fall through to DB anyway
pass
link = db.query("""
SELECT target_url, status, expires_at FROM shortlinks WHERE slug = $1
""", [slug])
redis.del(lock_key)
if not link:
redis.set(f"shortlink:{slug}", NEGATIVE_CACHE_SENTINEL, EX=60) // negative cache
return HTTP_404
// Populate caches
redis.set(f"shortlink:{slug}", serialize(link), EX=60)
localCache.set(slug, link, ttl=1s)
if link.status != 'ACTIVE' or (link.expires_at and link.expires_at < now()):
return HTTP_410
emit_click_event_async(slug, request)
return HTTP_302(Location=link.target_url)
Negative caching: Non-existent slugs also get cached (with a sentinel value) to prevent DB hammering for 404s from bots.
Async Path: Click Event Emission #
// Fire-and-forget from redirect handler (runs in separate goroutine/thread)
function emit_click_event_async(slug: string, request: HTTPRequest):
event = ClickEvent{
event_id: uuid4(), // globally unique, used for dedup in consumer
slug: slug,
occurred_at: now_utc(),
country: geoip_country(request.client_ip),
city: geoip_city(request.client_ip),
referrer: request.headers.get('Referer', '')[:500], // truncated
user_agent: request.headers.get('User-Agent', '')[:500],
ip_hash: sha256(request.client_ip)[:16] // privacy-preserving
}
// Kafka append — partition by slug to preserve per-slug ordering
kafka.produce(
topic='click_events',
key=slug, // partition key
value=protobuf_encode(event),
acks='1' // leader ack only — async throughput over durability
)
// If Kafka produce fails, log and drop — analytics loss is acceptable
// Do NOT block the redirect response on this
State Machine Diagram #
CREATE
│
▼
┌─────────┐
│ ACTIVE │◄────────────────────┐
└─────────┘ │ (admin re-activate)
│ │ │
expires_at│ │owner/admin │
reached │ │ disables │
▼ ▼ │
┌─────────┐ ┌──────────────────┐│
│ EXPIRED │ │ DISABLED ├┘
└─────────┘ └──────────────────┘
(terminal (owner can re-activate
for users) if plan allows)
Step 9 — Logical Data Model #
Table: shortlinks #
CREATE TABLE shortlinks (
id BIGINT PRIMARY KEY DEFAULT nextval('shortlinks_id_seq'),
slug VARCHAR(64) NOT NULL,
target_url TEXT NOT NULL, -- max 8192 chars
owner_user_id BIGINT NOT NULL REFERENCES users(id),
status VARCHAR(16) NOT NULL DEFAULT 'ACTIVE'
CHECK (status IN ('ACTIVE', 'EXPIRED', 'DISABLED')),
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
expires_at TIMESTAMPTZ, -- NULL = never expires
updated_at TIMESTAMPTZ NOT NULL DEFAULT now(),
version BIGINT NOT NULL DEFAULT 1, -- for CAS on status updates
is_custom_slug BOOLEAN NOT NULL DEFAULT false,
alias VARCHAR(256), -- human-readable name
custom_domain VARCHAR(256), -- premium feature
CONSTRAINT uq_slug UNIQUE (slug), -- I1, I4 enforcement
CONSTRAINT uq_custom_domain_slug UNIQUE (custom_domain, slug)
);
-- Secondary index for user dashboard (I9, UserLinkIndex projection)
CREATE INDEX idx_shortlinks_owner ON shortlinks(owner_user_id, created_at DESC);
-- Index for expiry scheduler (TTL enforcement)
CREATE INDEX idx_shortlinks_expires ON shortlinks(expires_at) WHERE expires_at IS NOT NULL AND status = 'ACTIVE';
Partition key: slug — all redirect lookups are by slug. The unique index on slug IS the partition key for the redirect path.
Dedup key: id (sequence) for auto-generated slugs; slug (unique index) for custom slugs.
Table: idempotency_keys #
CREATE TABLE idempotency_keys (
key VARCHAR(128) PRIMARY KEY, -- client-supplied idempotency key
shortlink_id BIGINT NOT NULL REFERENCES shortlinks(id),
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
-- TTL: rows older than 7 days can be deleted (cron job)
CREATE INDEX idx_idem_created ON idempotency_keys(created_at);
Table: users #
CREATE TABLE users (
id BIGINT PRIMARY KEY DEFAULT nextval('users_id_seq'),
email VARCHAR(320) NOT NULL UNIQUE,
plan VARCHAR(16) NOT NULL DEFAULT 'FREE'
CHECK (plan IN ('FREE', 'PRO', 'ENTERPRISE')),
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
password_hash TEXT NOT NULL
);
Table: click_events (ClickHouse) #
-- ClickHouse DDL
CREATE TABLE click_events (
event_id UUID,
slug String,
occurred_at DateTime64(3, 'UTC'),
country LowCardinality(String),
city String,
referrer String,
user_agent String,
ip_hash FixedString(16)
)
ENGINE = ReplacingMergeTree(occurred_at)
PARTITION BY toYYYYMM(occurred_at)
ORDER BY (slug, occurred_at, event_id);
-- ReplacingMergeTree deduplicates on (slug, occurred_at, event_id) — I8 enforcement
Table: click_aggregates (ClickHouse Materialized View) #
CREATE MATERIALIZED VIEW click_agg_daily
ENGINE = AggregatingMergeTree()
PARTITION BY toYYYYMM(date)
ORDER BY (slug, date, country)
AS SELECT
slug,
toDate(occurred_at) AS date,
country,
countState() AS click_count
FROM click_events
GROUP BY slug, date, country;
-- Query view:
CREATE VIEW click_agg_daily_view AS
SELECT slug, date, country, countMerge(click_count) AS clicks
FROM click_agg_daily
GROUP BY slug, date, country;
Step 10 — Technology Landscape #
Mapping procedure: capability → shape → specific product.
| DP | Capability Required | Shape | Selected Product | Justification |
|---|---|---|---|---|
| Atomic conditional insert, CAS updates, idempotency store | Serializable ACID transactions, unique indexes | Relational OLTP | PostgreSQL 16 | Proven unique index semantics; ON CONFLICT CAS; nextval() for sequence generation; wide ecosystem |
| Low-latency KV cache for redirect lookup | Sub-ms read, TTL, ~20GB dataset, high concurrency | In-memory KV store | Redis 7 (Redis Cluster) | <1ms p99 reads; TTL per key; cluster for horizontal read scaling; wide client support |
| Click event pipeline — durable ordered log | High-throughput append, ordered per partition, at-least-once delivery, replay | Partitioned durable log | Apache Kafka | 1M+ events/sec per partition; log retention for replay; partition-by-slug for ordering; mature ecosystem |
| Analytics storage — click events and aggregates | Append-heavy writes, column-scan aggregation queries, time-series partitioning | Columnar OLAP | ClickHouse | 1B+ rows/day insertions; ReplacingMergeTree for dedup; materialized views for pre-aggregation; 10–100x faster than Postgres for analytics |
| CDN edge redirect serving | Geo-distributed caching, HTTP redirect serving, cache purge API | CDN | Cloudflare (or Fastly) | 300+ PoPs; <5ms to 95% of world population; cache purge API; Cloudflare Workers for edge logic |
| Geo-IP resolution for click events | IP → country/city mapping, <1ms latency, ~1GB dataset | In-process library with mmdb | MaxMind GeoLite2 (mmdb) | In-process lookup, no network roundtrip; updated weekly; covers 99%+ of IPs |
| Process-local cache in redirect service | LRU eviction, 1s TTL, in-memory | In-process LRU | go-cache / Caffeine (language-specific) | Zero network roundtrip; fits in L2/L3 CPU cache for hot slugs |
| Expiry scheduler | Periodic TTL check and status transition | Cron + DB query | pg_cron (Postgres extension) or dedicated Go worker | UPDATE shortlinks SET status='EXPIRED' WHERE expires_at < now() AND status='ACTIVE' — runs every 60 seconds |
Step 11 — Deployment Topology #
Service Boundaries #
┌─────────────────────────────────────────────────────────────────────┐
│ CDN Edge (Cloudflare) │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Redirect cache (max-age=300, surrogate-key={slug}) │ │
│ │ Cloudflare Workers: edge slug validation + cache miss proxy │ │
│ └─────────────────────────────────────────────────────────────┘ │
└──────────────────────────┬──────────────────────────────────────────┘
│ Cache miss only (~1-5% of traffic)
▼
┌─────────────────────────────────────────────────────────────────────┐
│ Region: us-east-1 (Primary) Region: eu-west-1 (Secondary) │
│ │
│ ┌─────────────────┐ ┌─────────────────────────────────────┐ │
│ │ Redirect Service│ │ Write Service │ │
│ │ (stateless) │ │ (ShortLink creation, status update) │ │
│ │ 50 instances │ │ 10 instances │ │
│ │ Auto-scales │ │ │ │
│ └────────┬────────┘ └──────────────────────┬──────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────────────┐ ┌─────────────────────────────────────┐ │
│ │ Redis Cluster │ │ PostgreSQL (Primary + 2 replicas) │ │
│ │ 6 shards │ │ Primary: writes │ │
│ │ 3 replicas each│ │ Replicas: dashboard reads │ │
│ └────────┬────────┘ └──────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────┐ ┌─────────────────────────────────────┐ │
│ │ Kafka Cluster │──►│ Analytics Workers │ │
│ │ 12 brokers │ │ (Kafka → ClickHouse pipeline) │ │
│ │ click_events │ │ 5 consumer instances │ │
│ │ (slug-partitioned)│ └──────────────────────┬──────────────┘ │
│ └─────────────────┘ ▼ │
│ ┌─────────────────────────────────────┐ │
│ │ ClickHouse Cluster │ │
│ │ 3 shards, 2 replicas each │ │
│ └─────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
Partition Topology #
| Component | Partition Strategy | Rationale |
|---|---|---|
| Postgres shortlinks | Single primary + read replicas (no sharding at initial scale) | 300M rows × 500 bytes = 150GB — fits one Postgres instance |
| Redis Cluster | 6 shards by slug (consistent hashing) | Each shard holds ~3M hot slugs; horizontal scale |
| Kafka click_events | 48 partitions, partition key = slug | Per-slug ordering; 48 consumers can process in parallel |
| ClickHouse | 3 shards, partition by month | Time-series queries partition-pruned by month |
Failure Domains #
| Failure | Scope | Impact |
|---|---|---|
| Single redirect service instance fails | Instance | None: load balancer removes from pool within 5s |
| Redis shard failure | 1/6 of hot slugs | Cache miss surge for affected slugs; DB absorbs for ~30s until Redis replica promotes |
| Postgres primary failure | All writes | Failover to replica in ~30s; writes blocked during failover; reads continue from replica |
| Kafka broker failure | 1/12 of partitions | Other brokers take over; click events in-flight may be lost (analytics tolerance) |
| CDN PoP outage | Geographic region | Traffic fails over to other PoPs; latency increase for affected region |
| ClickHouse node failure | Analytics only | Redirects unaffected; analytics queries degrade until replica promotes |
Step 12 — Consistency Model #
| Path | Consistency Model | Reason |
|---|---|---|
| Short URL creation (write) | Linearizable (strong) | Postgres serializable transaction on unique slug index. Two concurrent requests for the same slug: exactly one commits. |
| Redirect lookup (cache hit) | Bounded stale (ε = 60s Redis TTL + 300s CDN) | Acceptable: a newly created URL may not be visible at CDN for up to 5 minutes. A disabled URL may still redirect for up to 60s. |
| Redirect lookup (cache miss, DB read) | Linearizable read | Postgres synchronous read from primary (or replica with synchronous_commit = remote_apply). Returns current state. |
| Status update visibility | Bounded stale (ε = 5s post-invalidation) | Redis DEL + CDN purge on status change. CDN purge propagates in ~1–3s. Redis DEL is synchronous. |
| Click event recording | Eventual | Kafka at-least-once; ClickHouse async ingestion. Event appears in aggregates within ε = 60s. |
| Click aggregate staleness | Eventual (ε = 60s) | Materialized view refreshes every 60 seconds. Acceptable for analytics dashboards. |
| Dashboard (user link list) | Eventually consistent (replica lag ε < 1s) | Dashboard reads from Postgres replica. Replica lag < 100ms under normal conditions. |
Reasoning for key choices:
The redirect lookup being bounded-stale is correct. The invariant I3 requires ACTIVE-only redirects, but the ε is set by operational requirements (not a hard safety guarantee). A 60-second window where a disabled URL still redirects is acceptable; a 60-second window where a newly created URL fails to redirect is acceptable. This is a product decision, not an architectural flaw.
The click event pipeline being eventual is explicitly sanctioned by invariant I6 (ε = 60s). Analytics is not in the critical path of any user action.
Step 13 — Scaling Model #
Scale Type Classification #
| Component | Scale Type | Primary Bottleneck |
|---|---|---|
| Redirect path | Read-heavy + hotspot-heavy | Viral slug concentrates millions of req/sec on single cache key |
| URL creation | Write-heavy (but low absolute volume) | Sequence generation is single-threaded in Postgres |
| Click event pipeline | Write-heavy, fanout-heavy | 10B events/month → Kafka producer throughput |
| Analytics aggregation | Aggregation-heavy | ClickHouse query scan over billions of rows |
Hotspot Keys #
| Hotspot | Location | Mechanism |
|---|---|---|
| Viral slug (e.g., Super Bowl ad link) | Redis key + CDN URL | Multi-layer cache; CDN absorbs 99%; local process cache absorbs most of remainder |
Postgres shortlinks_id_seq | Postgres | Sequence generation is fast (<1μs per nextval); batching nextval in groups of 1000 reduces contention |
| Kafka slug partition | Partition for viral slug | If one slug gets 100K/sec, its Kafka partition is a bottleneck for the consumer (analytics, not redirect). Redirect does not use Kafka on the hot path. |
Scaling Strategies #
Redirect service (read-heavy):
- Horizontal scale: stateless Go/Rust service; 50 instances behind L7 load balancer.
- Process-local LRU absorbs viral slugs at CPU speed.
- CDN absorbs geo-distributed traffic before it reaches the origin.
- Redis Cluster scales read capacity with additional replicas per shard.
URL creation (low-volume write):
- Single Postgres primary is sufficient for 115 writes/sec.
- If write rate grows 100x: switch to per-region Postgres with UUID-based IDs (no cross-region sequence required). Custom slugs still require a global uniqueness check (cross-region coordination via a global shard or pessimistic reservation).
Click event ingestion (high-volume append):
- Kafka scales horizontally; 48 partitions × 1M events/sec/partition = 48M events/sec capacity.
- At 10B/month = ~3,800 events/sec, current capacity is 12,000x surplus.
- ClickHouse ingestion scales with more shards.
Analytics query serving (aggregation-heavy):
- Pre-aggregate with ClickHouse materialized views. Dashboard queries hit the aggregate, not raw events.
- Cache dashboard results in Redis with 60-second TTL.
- For enterprise customers with custom date ranges: ClickHouse ad-hoc query over raw event table.
Step 14 — Failure Model #
Failure Taxonomy #
| Failure Mode | Can It Happen? | Correct Behavior | Mechanism | Recovery |
|---|---|---|---|---|
| Duplicate slug creation (two concurrent requests for same custom slug) | Yes — two users simultaneously claim bit.ly/launch | One succeeds (HTTP 201), one fails (HTTP 409 SlugTaken) | Postgres unique index, ON CONFLICT DO NOTHING returns 0 rows | Client-side: show error to losing requester |
| Duplicate click event counted twice in analytics | Yes — Kafka at-least-once delivery; consumer crashes mid-batch | Exactly one count per physical click | ReplacingMergeTree + consumer dedup on event_id | Self-healing: ClickHouse dedup occurs at merge time |
| Cache miss storm on viral URL (CDN + Redis TTL expire simultaneously) | Yes — TTL expiry is a sharp boundary | DB absorbs spike; does not crash | Request coalescing via Redis SETNX lock; CDN staggered TTL (CDN cache TTL > Redis TTL) | Redis TTL = 60s; CDN TTL = 300s; they expire at different times reducing simultaneity |
| Redirect service returns expired URL as ACTIVE | Yes — cache entry is stale during ε window | Client receives 302 to a URL that should have returned 410 | Cache TTL bounds the window; expiry scheduler flips status; Redis invalidation on status change | Self-healing within ε = 60s; explicit invalidation on EXPIRED transition |
| Postgres primary failure (crash) | Yes | Writes blocked; reads continue from replica | Postgres streaming replication + automatic failover (Patroni/Pgbouncer) | Replica promotes in ~30s; write path unavailable during promotion |
| Kafka broker failure | Yes | Click events produced to that broker’s partition are buffered or dropped | Kafka replication factor 3; producer retries for ~5s | Surviving brokers take over within seconds; events in-flight during failure may be lost (analytics tolerance) |
| Analytics consumer crashes mid-batch | Yes — consumer commits Kafka offset after writing to ClickHouse | Events in uncommitted batch are re-read and re-processed | Kafka consumer group offset management; ClickHouse dedup on event_id | Auto-restart consumer; dedup prevents double-counting |
| GeoIP lookup failure | Yes — mmdb file becomes unavailable | Click event recorded with empty country field | Fail-open: emit event with country = ''; do not block redirect | GeoIP is optional metadata; blank is acceptable |
| Redis cluster split-brain during network partition | Yes | Reads may return stale data; writes to both sides of partition | Redis Cluster majority-quorum for writes; minority nodes reject writes | Post-partition: minority side cache invalidated; repopulates from Postgres |
| Idempotency key collision (different clients generate same key) | Astronomically unlikely with UUID4 | Second client’s request is treated as a retry of the first | UUID4 has 2^122 space; collision probability for 1B requests = 10^-28 | Not a practical concern |
| Short URL target is a malicious or phishing URL | Always possible | Redirect proceeds (Bit.ly is not a content filter by default) | Rate-limit creation per user; optional URL scanning integration (VirusTotal API async) | Flag URL; admin can DISABLE the ShortLink |
Step 15 — SLOs #
Redirect Path (Hot Path) #
| Metric | Target | Measurement Method |
|---|---|---|
| P95 redirect latency (CDN hit) | < 5ms | CDN edge timing headers |
| P99 redirect latency (CDN hit) | < 15ms | CDN edge timing headers |
| P95 redirect latency (Redis hit, CDN miss) | < 10ms | Server-side histogram in redirect service |
| P99 redirect latency (Redis hit, CDN miss) | < 25ms | Server-side histogram |
| P95 redirect latency (DB hit, cache miss) | < 100ms | Server-side histogram |
| P99 redirect latency (DB hit, cache miss) | < 200ms | Server-side histogram |
| Redirect availability | > 99.99% (52 min/year downtime) | Synthetic probes every 10s from 5 regions |
| Redirect correctness (correct 302 target) | > 99.999% | Automated canary: create URL, follow redirect, verify target |
A cache miss is a latency failure, not a correctness failure. P99 of 25ms for Redis-hit cases is the operational target, not 200ms.
Write Path (Creation) #
| Metric | Target |
|---|---|
| P95 URL creation latency | < 200ms |
| P99 URL creation latency | < 500ms |
| Slug uniqueness correctness | 100% — no two shortlinks may share a slug |
| Idempotency correctness | 100% — same idempotency key returns same result |
| Creation availability | > 99.9% (8.7 hours/year downtime) |
Analytics Path #
| Metric | Target |
|---|---|
| Click count staleness | < 60 seconds for 99th percentile |
| Dashboard load latency (P95) | < 2 seconds |
| Click count accuracy | > 99.9% of actual clicks counted (0.1% loss acceptable for analytics) |
| Analytics availability | > 99.5% (43 hours/year downtime — analytics is non-critical) |
Throughput #
| Metric | Sustained | Peak (10x) |
|---|---|---|
| Redirects | 4,000 req/sec | 40,000 req/sec |
| URL creations | 120 req/sec | 1,200 req/sec |
| Click event ingestion | 4,000 events/sec | 40,000 events/sec |
Step 16 — Operational Parameters #
Every tunable lever with its range and effect.
| Parameter | Location | Default | Range | Effect if Increased | Effect if Decreased |
|---|---|---|---|---|---|
redis_ttl_seconds | Redirect service config | 60 | 10–3600 | Fewer DB reads; more stale data served | More DB reads; fresher data |
cdn_max_age_seconds | CDN cache rule | 300 | 30–3600 | Fewer origin hits; more stale data at edge | More origin hits; fresher edge data |
local_cache_size_entries | Redirect service | 10,000 | 1K–100K | More memory per pod; fewer Redis hits | Less memory; more Redis hits |
local_cache_ttl_seconds | Redirect service | 1 | 0.1–10 | Longer stale window for viral slugs; fewer Redis hits | More Redis hits; fresher data |
kafka_producer_acks | Click event producer | 1 (leader ack) | 0, 1, all | all: durability, more latency; 0: fire-forget, possible loss | Lower acks = lower latency, more data loss risk |
kafka_consumer_batch_size | Analytics consumer | 10,000 | 1K–100K | Larger ClickHouse inserts (more efficient); higher lag | More frequent small inserts; lower lag |
clickhouse_merge_interval | ClickHouse config | 600s | 60–3600s | Less dedup merge frequency; more storage used; faster ingest | More frequent merges; more CPU; faster dedup |
postgres_max_connections | Postgres | 200 | 50–1000 | More concurrent queries; more memory per connection | Connection starvation under load |
idempotency_key_retention_days | Cron job | 7 | 1–30 | More storage; longer idempotency window | Less storage; shorter idempotency window |
expiry_scheduler_interval_seconds | Expiry worker | 60 | 10–300 | Less frequent expiry; expired links may redirect briefly past deadline | More frequent; lower lag but more DB load |
redis_coalesce_lock_ttl_ms | Redirect service | 1000 | 100–5000 | More waiting during coalesce; prevents more thundering herd | Faster fallback to DB; less coalescing benefit |
redis_cluster_read_replicas_per_shard | Redis Cluster | 2 | 1–5 | More read capacity per shard; more memory | Less read capacity |
Step 17 — Runbooks #
Runbook R1: Viral URL Cache Miss Storm #
Trigger: redirect_db_hit_rate > 5% for more than 2 minutes. (Normally < 0.1%.)
Diagnosis:
- Check
redis_keyspace_miss_ratemetric — high indicates cache pressure. - Check
top_slugs_by_request_ratedashboard — identify the viral slug. - Check Redis memory usage — if near 100%, eviction is happening.
Mitigations (in order):
- Immediate: Force Redis re-population for the viral slug:
This sets a 1-hour TTL, giving time to address root cause.redis-cli SET shortlink:{slug} {value} EX 3600 - If Redis memory full: Increase Redis maxmemory by adding a read replica to the affected shard.
- If DB is overwhelmed: Enable read connection pooling on PgBouncer; scale out DB read replicas.
- Long-term: Increase
redis_ttl_secondsfor slugs withrequest_rate > 10,000/min.
Recovery signal: redirect_db_hit_rate returns below 0.5%.
Runbook R2: Postgres Primary Failover #
Trigger: PagerDuty alert postgres_primary_unreachable.
Immediate action:
- Do NOT manually intervene for 60 seconds — Patroni automatic failover is running.
- Verify Patroni status:
patronictl -c /etc/patroni/config.yml list - If automatic failover completes: verify replica is now primary, application connections have reconnected via PgBouncer.
- If failover has not completed after 90 seconds: manually promote the replica:
patronictl failover bitly-postgres --master {old_primary} --candidate {replica}
During failover (30–90 seconds):
- Redirect traffic: unaffected (Redis serving most requests).
- URL creation: fails with 503. Client should retry with idempotency key.
- Dashboard reads: may fail or return stale data.
Post-failover:
- Verify new primary accepts writes:
INSERT INTO shortlinks ... ON CONFLICT DO NOTHING. - Verify replication lag on new replica:
SELECT now() - pg_last_xact_replay_timestamp(). - Monitor for Redis cache miss increase (new primary may be slower initially).
- File incident report and investigate old primary.
Runbook R3: Analytics Lag Spike #
Trigger: click_aggregate_lag_seconds > 120 for more than 5 minutes.
Diagnosis:
- Check Kafka consumer lag:
kafka-consumer-groups.sh --describe --group analytics-consumer - Check ClickHouse insert queue depth.
- Check analytics worker CPU and memory.
Mitigations:
- If Kafka lag growing: Scale out analytics consumer instances (add 2–3 more).
- If ClickHouse insert slow: Reduce
kafka_consumer_batch_size(smaller batches insert faster under write pressure). - If analytics worker OOM: Increase pod memory limit.
- If ClickHouse node down: Check ClickHouse replica status; failover to replica.
Recovery signal: click_aggregate_lag_seconds < 60 for 10 minutes.
User impact: Analytics dashboards show counts up to lag seconds stale. No impact on redirects.
Runbook R4: Slug Uniqueness Violation (Invariant Breach) #
Trigger: duplicate_slug_count_in_postgres > 0 (this should never fire; it indicates a bug).
Immediate actions:
- Freeze all write traffic to the creation service.
- Run:
SELECT slug, count(*) FROM shortlinks GROUP BY slug HAVING count(*) > 1. - For each duplicate slug: determine which is the legitimate record (lowest id = created first).
- Rename the duplicate to a new auto-generated slug.
- Notify affected user of the slug change.
Root cause investigation: This invariant cannot be violated if the unique index exists. Check: \d shortlinks to verify CONSTRAINT uq_slug UNIQUE (slug) is present. If missing, re-add immediately:
CREATE UNIQUE INDEX CONCURRENTLY uq_slug ON shortlinks(slug);
Step 18 — Observability #
Metrics #
| Metric | Component | Type | Alert Threshold |
|---|---|---|---|
redirect_latency_p95_ms | Redirect service | Histogram | > 25ms for > 2 min |
redirect_latency_p99_ms | Redirect service | Histogram | > 100ms for > 2 min |
redirect_cache_hit_rate | Redirect service | Gauge | < 95% for > 5 min |
redirect_db_hit_rate | Redirect service | Gauge | > 5% for > 2 min |
redis_memory_used_bytes | Redis Cluster | Gauge | > 85% of maxmemory |
redis_keyspace_misses_per_sec | Redis Cluster | Counter | > 1000/sec |
postgres_replication_lag_seconds | Postgres replicas | Gauge | > 10s |
postgres_connections_active | Postgres primary | Gauge | > 180 (of 200) |
kafka_consumer_lag_records | Analytics consumer | Gauge | > 100,000 records |
click_events_produced_per_sec | Redirect service | Counter | Alert on sharp drop: < 50% of 5-min avg |
clickhouse_insert_errors_per_min | Analytics worker | Counter | > 10 |
short_url_creation_rate_per_sec | Write service | Counter | Alert on >10x normal (abuse detection) |
slug_collision_rate | Write service | Counter | > 0 for custom slugs (expected 0 on success path) |
http_5xx_rate | All services | Counter | > 1% for > 1 min |
cdn_origin_hit_rate | CDN | Gauge | > 10% of total CDN traffic |
Distributed Traces #
Every redirect request carries a trace_id header (OpenTelemetry W3C TraceContext). Spans emitted:
| Span | Component | Key Attributes |
|---|---|---|
redirect.handle | Redirect service | slug, cache_layer (local/redis/db), status_code |
redis.get | Redis client | key, hit (bool), latency_ms |
postgres.select | DB client | table=shortlinks, rows_returned, latency_ms |
kafka.produce | Kafka client | topic=click_events, partition, latency_ms |
geoip.lookup | GeoIP module | country, latency_us |
Structured Logs (Sampled) #
{
"ts": "2026-04-01T12:00:01.234Z",
"level": "info",
"event": "redirect",
"trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
"slug": "AbCd3F",
"cache_layer": "redis",
"status_code": 302,
"latency_ms": 3.2,
"country": "US",
"referrer_domain": "twitter.com"
}
Log sampling: 1% of redirects for hot slugs; 100% for cache misses and errors.
Dashboards #
- Redirect health dashboard: P50/P95/P99 latency, cache hit rates by layer, error rate, top 10 slugs by request rate.
- Analytics pipeline dashboard: Kafka consumer lag, ClickHouse insert rate, aggregate staleness.
- Write path dashboard: URL creation rate, custom slug collision rate, Postgres write latency.
- Infrastructure dashboard: Redis memory, Postgres replication lag, Kafka broker health.
Step 19 — Cost Model #
Growth Drivers per Component #
| Component | Primary Cost Driver | Unit Cost | Projected Monthly Cost at Scale |
|---|---|---|---|
| Postgres (primary + 2 replicas) | Storage (150GB + growth) + compute for IOPS | ~$500/node | ~$1,500/month (3 nodes) |
| Redis Cluster (6 shards × 3 nodes) | Memory (20GB/shard for hot slug cache) | ~$300/node (r7g.large) | ~$5,400/month (18 nodes) |
| Redirect service (50 instances) | CPU (high req/sec, mostly in-memory work) | ~$100/instance (c6g.medium) | ~$5,000/month |
| Kafka (12 brokers + ZooKeeper) | Network throughput + storage (30-day retention) | ~$400/broker | ~$4,800/month |
| ClickHouse (3 shards × 2 nodes) | Storage (time-series event data) + compute for merges | ~$600/node | ~$3,600/month |
| Analytics workers (5 instances) | CPU (Kafka consume + ClickHouse insert) | ~$100/instance | ~$500/month |
| CDN (Cloudflare) | Bandwidth + requests | $0.01/GB + $1/M requests | ~$2,000/month at 10B redirects |
| Total | ~$22,800/month |
Cost Growth Model #
| Growth Event | Primary Cost Driver | Component That Scales |
|---|---|---|
| 10x redirect traffic | CDN bandwidth scales linearly | CDN cost grows 10x (~$20K) |
| 10x URL creation rate | Postgres write IOPS | May require sharding or larger instance (~$3K) |
| 10x click event volume | Kafka storage + ClickHouse storage | Storage-linear growth (~$10K) |
| 10x active slug count | Redis memory (200 bytes × 10x slugs) | More Redis nodes (~$15K) |
Dominant cost at current scale: Redis (24%) and redirect service compute (22%). CDN becomes dominant at 10x traffic.
Cost optimization levers:
- Increase CDN cache hit rate → reduces redirect service instance count.
- Use slug access frequency to evict cold slugs from Redis → reduce Redis tier cost.
- ClickHouse data tiering: move partitions older than 90 days to S3-backed cold storage.
Step 20 — Evolution #
Stage 1: MVP (0 → 1M short URLs, < 100 req/sec) #
Architecture: Single Postgres, single Redis, no Kafka.
Click analytics: Write directly to Postgres click_events table (synchronous write during redirect). Not viable at scale but acceptable at < 100 req/sec.
Upgrade signal: P99 redirect latency > 100ms (DB bottleneck), or Postgres write IOPS > 80% capacity.
Changes needed to advance:
- Introduce Kafka for async click event pipeline (decouple analytics from redirect path).
- Introduce Redis as a caching layer.
Stage 2: Growth (1M → 100M short URLs, 100 → 5,000 req/sec) #
Architecture: Postgres primary + replicas, Redis Cluster (2 shards), Kafka + ClickHouse analytics pipeline, CDN integration.
Click analytics: Async via Kafka → ClickHouse. Redirect path has zero analytics latency.
Upgrade signal: Redis memory > 80% capacity, Postgres replication lag > 1s under load, CDN origin hit rate > 20%.
Changes needed to advance:
- Expand Redis cluster to 6 shards.
- Introduce process-local cache in redirect service (reduces Redis load by 10x for viral slugs).
- Consider geo-distributed deployment for sub-50ms latency in non-US regions.
Stage 3: Scale (100M → 1B short URLs, 5,000 → 50,000 req/sec) #
Architecture: Multi-region active-active, Postgres with read replicas per region (writes to primary region only), Redis per-region, Kafka cross-region replication.
Custom slug challenge: Custom slug uniqueness across regions requires a global uniqueness layer. Options:
- Option A: Route all custom slug creation requests to a single “slug authority” region (single writer, globally consistent). Acceptable if creation latency is not latency-sensitive.
- Option B: Use a global distributed key-value store (e.g., Google Spanner) as the slug uniqueness arbiter. Higher operational complexity.
Upgrade signal: Single-region creation latency > 500ms (DB round-trip from non-US users), or Postgres max connections reached.
Changes needed to advance:
- Introduce global slug authority service.
- Consider Postgres sharding for shortlinks table (horizontal partition by slug hash range).
- Introduce auto-generated slug pre-allocation (batch fetch sequence ranges from DB, generate slugs locally in redirect service).
Stage 4: Hyperscale (> 1B short URLs, > 100K req/sec) #
Architecture: Full multi-region active-active with CRDT-based slug reservation, ClickHouse sharded to 20+ nodes, Redis Federation, CDN custom logic (Cloudflare Workers) for edge-side slug validation.
Key architectural evolution:
- Redirect path fully served at CDN edge with Cloudflare Workers reading from Cloudflare KV (edge KV store, not Postgres). Postgres is the source of truth but is not in the hot path at all.
- Click events processed at the CDN edge before returning to origin.
Upgrade signal: Cross-region DB roundtrip > 200ms for slug uniqueness check, or CDN origin cost exceeds $50K/month.
Summary: Upgrade Decision Matrix #
| Signal | Upgrade Action |
|---|---|
redirect_db_hit_rate > 5% | Expand Redis capacity; increase CDN TTL |
postgres_write_iops > 80% | Add read replicas; introduce Kafka analytics pipeline |
redis_memory_used > 80% | Add Redis shards; evict cold slugs (access-frequency TTL) |
cdn_origin_hit_rate > 20% | Increase CDN TTL; audit cache invalidation frequency |
creation_latency_p99 > 500ms | Shard Postgres; pre-allocate slug ranges |
analytics_lag > 5min | Scale out Kafka consumers; increase ClickHouse shards |
| Single-region creation latency > 1s for global users | Multi-region active-active + global slug authority |
End of Bit.ly system design derivation. All 20 steps have produced explicit output artifacts. The design is derivable from the invariants; the invariants are derivable from the normalized requirements; the normalized requirements are derivable from the product requirements. No design choice is unmotivated.
There's no articles to list here yet.