Table of Contents
News Aggregator: System Design #
A complete mechanical derivation of a Google News / Flipboard-style news aggregator — from functional requirements to operable architecture, following the 20-step derivation framework without hand-wavy steps.
Ordering Principle #
Product requirements
→ normalize into operations over state (Step 1)
→ extract primary objects (Step 2)
→ assign ownership, evolution, ordering (Step 3)
→ extract invariants (Step 4)
→ derive minimal DPs from invariants (Step 5)
→ select concrete mechanisms (Step 6)
→ validate independence and source-of-truth (Step 7)
→ specify exact algorithms (Step 8)
→ define logical data model (Step 9)
→ map to technology landscape (Step 10)
→ define deployment topology (Step 11)
→ classify consistency per path (Step 12)
→ identify scaling dimensions and hotspots (Step 13)
→ enumerate failure modes (Step 14)
→ define SLOs (Step 15)
→ define operational parameters (Step 16)
→ write runbooks (Step 17)
→ define observability (Step 18)
→ estimate costs (Step 19)
→ plan evolution (Step 20)
Step 1 — Problem Normalization #
Convert product-language requirements into precise operations over state.
Original Functional Requirements:
- The system ingests articles from RSS feeds, APIs, and scraped sources
- Articles are deduplicated (same story from multiple sources = one article cluster)
- Articles are categorized by topic (ML classification)
- Users browse articles by category, topic, or keyword search
- Users see a personalized feed ranked by relevance to their interests
- Trending articles are surfaced (high engagement in short time window)
- Users can search for articles by keyword
Normalized Table:
| # | Original Requirement | Actor | Operation | State Touched |
|---|---|---|---|---|
| 1a | System ingests articles from RSS/API/scrape | Scheduler + CrawlWorker | append event (raw fetch) | ArticleFetch event; Source.last_crawled_at overwrite |
| 1b | Schedule controls when each source is re-crawled | System scheduler | read intent constraint; trigger crawl | CrawlSchedule (intent/future constraint) |
| 2 | Deduplicate articles — same story = one cluster | System pipeline | conditional create on URL hash; approximate-match on content hash | Article (by URL key); StoryCluster (merge) |
| 3 | Categorize articles by topic | System ML pipeline | overwrite metadata on Article | Article.category_labels (mutable metadata) |
| 4a | Users browse by category | User | read projection filtered by category | CategoryFeed (derived view over Article) |
| 4b | Users browse by topic | User | read projection filtered by topic | TopicFeed (derived view over Article) |
| 4c | Users browse by keyword search | User | query indexed projection | SearchIndex (Elasticsearch projection) |
| 5 | Users see personalized feed | User | read pre-computed ranked projection | PersonalizedFeed (derived view over Article × UserInterest) |
| 5a | Feed ranking updates as user reads articles | System | overwrite interest vector | UserInterest (overwrite on read event) |
| 5b | User reads an article | User | append immutable event | ReadEvent |
| 6 | Trending articles surfaced | System | read sliding-window aggregate | TrendingScore (derived view: count of ReadEvents per article per window) |
| 7 | Users search for articles by keyword | User | query indexed projection | SearchIndex (Elasticsearch) |
Key normalizations:
- FR 1 hides two distinct operations: the crawl scheduling loop (an intent/future constraint that persists across many executions) and the article fetch event (a point-in-time result).
- FR 2 “deduplication” is actually two distinct mechanisms: exact dedup (URL as natural key — one URL = one Article) and near-duplicate clustering (same story from different sources — requires approximate matching, producing a StoryCluster).
- FR 5 “see a personalized feed” hides a hidden write: every read event updates UserInterest. The feed itself is a derived view, not primary state.
- FR 6 “trending” is NOT primary state. TrendingScore is a derived aggregate over ReadEvents within a sliding window.
- FR 4c and FR 7 both resolve to the same mechanism (SearchIndex) — not two separate systems.
Step 2 — Object Extraction #
Identify the minimal set of primary state objects; apply four purity tests to each candidate.
Primary Objects (Candidates) #
Source #
- Class: Stable entity
- Ownership purity: Single writer — admin registers/updates sources. Pipeline updates
last_crawled_at. Split: admin writesurl, crawl_interval, status; pipeline writeslast_crawled_at. Both writes are on different fields under different conditions. Accept as one object. - Evolution purity: Overwrite — current registration state;
last_crawled_atupdated after each crawl. - Ordering purity: No meaningful order among sources.
- Non-derivability: Cannot be reconstructed from ArticleFetch events — source configuration (URL, interval) is primary. Pass.
- Verdict: Valid primary object.
CrawlSchedule #
- Class: Intent / future constraint
- Description: Encodes “source S must next be crawled at time T.” Persists until the crawl happens, then re-arms for the next interval. Analogous to AlertSubscription in a price-drop system.
- Ownership purity: System only — scheduler writes next_crawl_at after each completed crawl.
- Evolution purity: Overwrite — one current next-crawl-at per source.
- Ordering purity: No meaningful ordering across different sources; total order within one source (each crawl produces one new schedule entry overwriting the last).
- Non-derivability: The intent (next scheduled time) cannot be reconstructed solely from past ArticleFetch events without the crawl_interval. Pass.
- Verdict: Valid primary object (intent/future constraint class).
Article #
- Class: Stable entity
- Description: The parsed, normalized representation of one published piece of content at one URL.
- Ownership purity: Pipeline only writes. No user writes. Two fields with different evolution: immutable content (
url,title,body,published_at,source_id,content_hash) vs. mutable metadata (category_labels,cluster_id,engagement_count). The mutable metadata is written by different sub-pipelines. This is a tension — but both sets of fields live naturally on the same entity and have the same natural key (URL). Accept: the immutable content is set once on insert; metadata fields are updated independently by labeled sub-pipelines. - Evolution purity: Content = append-only (set once, immutable). Metadata = overwrite. Mixed, but the entity is the natural unit of reference. Accept.
- Ordering purity: Total order by
published_atwithinsource_id; byingested_atwithin the system. - Non-derivability: The parsed text, title, and URL are primary — they come from the external world and cannot be derived internally. Pass.
- Verdict: Valid primary object.
ArticleFetch #
- Class: Event
- Description: The raw result of one crawl attempt — raw HTML/RSS XML, HTTP status, fetch timestamp.
- Ownership purity: CrawlWorker only writes.
- Evolution purity: Append-only and immutable.
- Ordering purity: Total order by
fetched_atwithinsource_id. - Non-derivability: The raw bytes from the external source are the provenance record; they cannot be reconstructed. Pass.
- Verdict: Valid primary object (event class). Used as input to the parse pipeline and for archival.
StoryCluster #
- Class: Stable entity (with merge-based evolution)
- Description: A group of Articles that cover the same real-world story, from different sources (e.g., CNN + BBC + Reuters all cover the same plane crash). Cluster membership is determined by near-duplicate similarity.
- Ownership purity: System pipeline only — the clustering pipeline writes cluster assignments.
- Evolution purity: Merge — new articles can join an existing cluster; clusters can merge if two previously separate clusters are found to be about the same story. Commutative: adding article A then B is the same as adding B then A.
- Ordering purity: No meaningful ordering among cluster members; cluster has a
canonical_article_idpointing to the most authoritative source. - Non-derivability: The cluster identity and membership cannot be derived without running the near-duplicate algorithm. The result of MinHash/LSH is not mechanically derivable from Article content without the algorithm state (band signatures, hash functions). Accept.
- Verdict: Valid primary object.
ReadEvent #
- Class: Event
- Description: User reads an article. Immutable point-in-time record.
- Ownership purity: Single writer (the user, via the client/API layer).
- Evolution purity: Append-only, immutable.
- Ordering purity: Total order by
timestampwithinuser_id. - Non-derivability: The fact that user U read article A at time T is primary. Cannot be reconstructed. Pass.
- Verdict: Valid primary object.
UserInterest #
- Class: Stable entity
- Description: The current interest vector for a user — a weighted bag of topics/categories derived from their reading history. Used as input to feed ranking.
- Ownership purity: System pipeline only (the interest-update pipeline writes after each ReadEvent). Users never write directly.
- Evolution purity: Overwrite — always reflects the latest computed interest vector, not a history.
- Ordering purity: No meaningful ordering.
- Non-derivability: Could theoretically be reconstructed by replaying all ReadEvents through the interest model. However, as an online-updated vector (exponential moving average with decay), the current vector is the operational state; full reconstruction would be extremely expensive and model-dependent. Accept as primary state (the current vector is the operational source of truth).
- Verdict: Valid primary object.
Derived / Rejected Objects #
| Candidate | Problem | Resolution |
|---|---|---|
| TrendingScore | Derivable from ReadEvents in sliding window | Derived view. Computed by Flink windowing job; stored in Redis sorted set with TTL. Not primary state. |
| PersonalizedFeed | Derivable from Article × UserInterest × ranking model | Derived view (materialized). CQRS read model; rebuilt on UserInterest update. Not primary state. |
| SearchIndex | Projection of Article content into inverted index | Derived view. Elasticsearch index maintained via CDC from Postgres Article table. Not primary state. |
| CategoryFeed | Derivable from Article filtered by category | Derived view. Can be served by querying Postgres/Elasticsearch with category filter. Not materialized separately. |
| TopicFeed | Derivable from Article filtered by topic | Derived view. Same as CategoryFeed. |
Step 3 — Axis Assignment #
For every primary object, define ownership, evolution, and ordering.
Object: Source
Ownership: Multi-writer: admin sets registration config; pipeline updates last_crawled_at.
In practice: admin is the authority on URL/interval; pipeline is authority on crawl timestamps.
No contention between admin and pipeline (different fields, admin changes rare).
Evolution: Overwrite (current registration state is the truth; last_crawled_at replaced after each crawl)
Ordering: No meaningful order among sources
Object: CrawlSchedule
Ownership: System-only (scheduler writes next_crawl_at; crawl worker updates it after completion)
Evolution: Overwrite (one next_crawl_at per source_id; replaced after each crawl completes)
Ordering: No meaningful order among schedule entries; within a source, each entry supersedes the last
Object: Article
Ownership: Multi-writer pipeline (crawler writes base record; classifier pipeline writes category_labels;
clustering pipeline writes cluster_id; engagement pipeline writes engagement_count)
Each field has single-pipeline authority. No two pipelines write the same field.
Evolution: Content fields: append-only (set on first insert, never modified)
Metadata fields: overwrite (each pipeline overwrites its own field)
Ordering: Total order by published_at within source_id (for browsing)
Total order by ingested_at across all articles (for recency feed)
Object: ArticleFetch
Ownership: Single writer per partition (crawl worker writes; one worker per source)
Evolution: Append-only (immutable after creation)
Ordering: Total order by fetched_at within source_id
Object: StoryCluster
Ownership: System-only pipeline (clustering pipeline — MinHash/LSH job)
Evolution: Merge (articles join clusters; clusters may merge — commutative, associative operation)
Ordering: No meaningful order among cluster members; cluster itself ordered by article count desc for display
Object: ReadEvent
Ownership: Single writer (the requesting user, via API)
Evolution: Append-only (immutable)
Ordering: Total order by timestamp within user_id
Object: UserInterest
Ownership: System-only (interest-update pipeline writes after each ReadEvent batch)
Evolution: Overwrite (current interest vector replaces previous; EMA-based decay)
Ordering: No meaningful order (set of weighted topics)
Step 4 — Invariant Extraction #
Precise, testable properties that must hold in every valid execution.
I-1: Article URL Uniqueness (Uniqueness/Idempotency) #
For every valid system state, there is at most one Article record with any given
url. If the same URL is fetched N times, exactly one Article record exists.
Scope: System-wide (url is globally unique)
Implication: Insert of Article must be INSERT ... ON CONFLICT (url) DO NOTHING or equivalent upsert. The URL is the natural idempotency key.
I-2: Article Content Immutability (Ordering) #
Once an Article’s
content_hash,title,body, andsource_idfields are set, they are never modified. Only metadata fields (category_labels,cluster_id,engagement_count) may be updated after initial insert.
Scope: Per article (url) Implication: Write path for content must be insert-once. Metadata update paths must not touch content columns.
I-3: CrawlSchedule Freshness (Eligibility) #
A source S must not be fetched again before
next_crawl_at(S). Equivalently, any ArticleFetch for source S must havefetched_at >= CrawlSchedule.next_crawl_atfor that source.
Scope: Per source_id Implication: Scheduler must enforce the crawl interval. If a crawl is triggered early (e.g., by operator), the CrawlSchedule must be updated atomically before the crawl fires.
I-4: Deduplication — Near-Duplicate Articles Map to One Cluster (Uniqueness) #
For any two articles A1 and A2 whose MinHash similarity exceeds threshold θ (= 0.85), they must share the same
cluster_id. No two articles covering the same real-world story should have different cluster_ids after the clustering pipeline completes.
Scope: Approximate — this is a best-effort invariant, not an atomic one. The clustering pipeline may run with some lag. Implication: The invariant is probabilistic (MinHash is approximate). Accept: the system provides near-dedup with bounded false-negative rate, not perfect dedup.
I-5: ReadEvent Idempotency (Uniqueness/Idempotency) #
If user U submits a read event for article A multiple times (due to retry or double-tap), at most one ReadEvent is recorded per (user_id, article_id, session_id).
Scope: Per (user_id, article_id, session_id) Implication: ReadEvent insert must dedup by (user_id, article_id, session_id). Use session_id as idempotency key.
I-6: TrendingScore Reflects Recent ReadEvents (Propagation) #
The TrendingScore for article A at time T reflects all ReadEvents for A in the window [T − W, T], where W = 1 hour. Staleness bound: ε ≤ 30 seconds.
Scope: Per article_id, per window Implication: Stream aggregation must process ReadEvents within 30 seconds. The score is approximate (Count-Min Sketch acceptable).
I-7: PersonalizedFeed Reflects Current UserInterest (Propagation) #
The PersonalizedFeed for user U reflects the UserInterest vector as of at most 5 minutes ago. If UserInterest has not changed in 24 hours, the feed may be stale by up to 24 hours (no reads = no updates).
Scope: Per user_id Implication: Feed materialization must be triggered within 5 minutes of a UserInterest update. TTL-based cache invalidation acceptable.
I-8: SearchIndex Reflects Article Table (Propagation) #
The SearchIndex (Elasticsearch) contains every Article that has been in the Postgres Article table for more than 60 seconds. No Article is permanently absent from the SearchIndex once ingested.
Scope: System-wide Implication: CDC from Postgres to Elasticsearch must have guaranteed delivery and must handle reindex failures with retry.
I-9: UserInterest Updated on Every ReadEvent Batch (Accounting) #
The UserInterest vector for user U reflects all ReadEvents for U with
timestamp ≤ T − 5 minutes. The interest model must process every ReadEvent exactly once (no duplicates, no drops).
Scope: Per user_id Implication: Exactly-once processing of ReadEvents into UserInterest. Dedup by (user_id, article_id, session_id) before applying to interest model.
I-10: Access Control — Users Read Own Data (Access Control) #
User U may read only their own ReadEvents, UserInterest, and PersonalizedFeed. Category and topic feeds are public. Article content is public. Trending scores are public.
Scope: Per user_id Implication: All user-scoped read endpoints must authenticate and authorize by user_id. Shared/public endpoints require no user scoping.
Step 5 — DP Derivation #
Derive the minimal enforcing mechanism per invariant cluster.
| Invariant | Cluster | Minimal DP |
|---|---|---|
| I-1 (URL uniqueness) | Uniqueness | Unique index on Article.url in Postgres; INSERT ON CONFLICT DO NOTHING in crawl pipeline |
| I-2 (content immutability) | Ordering | Write path for content uses insert-only; metadata update queries restrict to metadata columns only (UPDATE article SET category_labels = ? WHERE url = ?) |
| I-3 (crawl freshness) | Eligibility | Scheduler reads CrawlSchedule.next_crawl_at; fires crawl only when now() >= next_crawl_at; updates next_crawl_at = now() + crawl_interval atomically after dispatch |
| I-4 (near-dedup clustering) | Uniqueness (approximate) | MinHash signatures computed on Article ingest; LSH lookup for candidate near-duplicates; cluster merge via pipeline-side union-find; cluster_id written back to Article |
| I-5 (ReadEvent idempotency) | Uniqueness/Idempotency | Unique index on ReadEvent(user_id, article_id, session_id); Kafka consumer uses event_id as dedup key before DB insert |
| I-6 (trending freshness) | Propagation + Accounting | Flink sliding window (1h / 30s slide); Count-Min Sketch per article; result written to Redis sorted set with TTL |
| I-7 (feed freshness) | Propagation | UserInterest update triggers feed recompute; Redis cache key feed:{user_id} with 5-minute TTL; feed service falls back to recompute on cache miss |
| I-8 (search freshness) | Propagation | Debezium CDC from Postgres Article table → Kafka topic → Elasticsearch sink; Kafka consumer retries on ES failure; dead-letter queue for persistent failures |
| I-9 (interest accounting) | Accounting + Idempotency | Flink job consumes ReadEvent Kafka topic; dedup state keyed by (user_id, article_id, session_id); exactly-once semantics with Flink checkpointing + Kafka transactional producer to UserInterest update topic |
| I-10 (access control) | Access Control | API gateway enforces JWT authentication; user-scoped endpoints validate user_id from token matches path parameter |
Step 6 — Mechanism Selection #
The mechanical bridge from invariants to concrete implementation decisions.
6.1 — Invariant Type → Mechanism Family #
| Invariant Type | Mechanism Family |
|---|---|
| Eligibility (I-3) | Optimistic scheduler with compare-and-update on next_crawl_at |
| Ordering (I-2) | Insert-once + column-restricted update paths |
| Accounting (I-6, I-9) | Stream aggregation (Flink) + windowing |
| Uniqueness/Idempotency (I-1, I-5) | Natural key unique index + ON CONFLICT; dedup by event_id in stream |
| Propagation (I-6, I-7, I-8) | CDC (Debezium) + cache TTL + stream aggregation |
| Access Control (I-10) | API gateway JWT validation |
6.2 — Ownership × Evolution → Concurrency Mechanism #
| Object | Ownership | Evolution | Concurrency Mechanism |
|---|---|---|---|
| Source | Multi-writer (admin + pipeline, different fields) | Overwrite | No coordination needed — fields are non-overlapping; each writer owns its columns |
| CrawlSchedule | System-only | Overwrite | Single writer per source_id partition → no coordination needed |
| Article (content) | Pipeline only | Append-only (insert once) | Unique index enforces single-insert; ON CONFLICT DO NOTHING for retries |
| Article (metadata) | Multi-pipeline, each field owned by one pipeline | Overwrite | No coordination needed — each pipeline owns distinct columns |
| ArticleFetch | Single writer per source | Append-only | No contention mechanism needed; dedup required (event_id as idempotency key) |
| StoryCluster | System-only pipeline | Merge (commutative) | CRDT-compatible: union-find cluster merge is commutative and associative; pipeline applies merges without coordination |
| ReadEvent | Single writer (user) | Append-only | No contention; dedup by (user_id, article_id, session_id) via unique index |
| UserInterest | System-only pipeline | Overwrite | Single writer per user_id partition → no coordination needed |
6.3 — Q1-Q5 Full Derivation for Key Flows #
Flow 1: Article Deduplication and Story Clustering #
Q1 — Scope: The dedup check (URL uniqueness) is within-service (single Postgres instance per shard). Near-duplicate clustering (StoryCluster) is within-service but batched.
Q2 — Failure: CrawlWorker may crash after fetching but before writing to Postgres. The ArticleFetch event is published to Kafka first; Postgres write is downstream. At-least-once delivery from Kafka → dedup by URL (Postgres unique index absorbs retries). Slow degradation (Postgres down): Circuit Breaker on Postgres write path; crawler buffers in Kafka.
Q3 — Data: URL is content-addressed (URL is the natural hash = idempotency key for free). Near-duplicate content: MinHash signature is a commutative, order-independent summary of the article’s tokens → stream aggregation (signature computed at ingest; LSH lookup in Redis or in-memory index).
Q4 — Access: Dedup is a write-path concern, not a read concern.
Q5 — Coupling: CrawlWorker publishes ArticleFetch event to Kafka (Outbox-equivalent pattern: Kafka acts as the durable log). Parse-and-dedup pipeline consumes from Kafka → writes to Postgres. Kafka decouples the crawler from the database.
Mechanism selected: Pipeline pattern (CrawlWorker → Kafka → ParseWorker → Postgres). URL dedup via INSERT ON CONFLICT (url) DO NOTHING. Near-duplicate clustering via MinHash + LSH: compute 128-band MinHash signatures at parse time, store in Redis sorted set indexed by LSH band, look up candidates, compute Jaccard similarity, assign or create cluster via pipeline-side union-find, write cluster_id back to Article in a separate update.
Dedup + stream aggregation always paired: ArticleFetch events on Kafka are consumed with exactly-once semantics (Kafka consumer group with committed offsets). Before Postgres insert, check URL uniqueness. Event_id dedup in the consumer prevents double-processing on consumer restart.
Flow 2: Trending Score Computation #
Q1 — Scope: TrendingScore is a derived view. The source of truth is the ReadEvent stream. Trending is global (not per-user). Within-service scope (Flink job consuming from Kafka).
Q2 — Failure: Flink job crashes → restores from checkpoint; ReadEvents are replayed from Kafka offset. At-least-once delivery → dedup by event_id in Flink state before counting. Slow degradation: Count-Min Sketch is naturally tolerant of small error; trending score degrades gracefully.
Q3 — Data: ReadEvents are time-windowed (1-hour sliding window with 30-second slide). Pattern: Windowing. High cardinality (millions of articles): Count-Min Sketch per window for approximate counting. Temporal Decay: older reads within the window count equally (sliding window handles this); could add exponential decay weight for recency bias.
Q4 — Access: Read » write for trending (millions of reads, one write path per 30s slide). Pattern: Cache-Aside. Trending scores are written to Redis sorted set (trending:{window_start_epoch}) with TTL = 2 * window_slide = 60s. API reads from Redis. On Redis miss, fall back to Flink-managed state (or return stale score).
Q5 — Coupling: ReadEvent → Kafka → Flink (sliding window job) → Redis. Flink writes to Redis asynchronously every slide interval. No synchronous coupling between read event and trending update.
Mechanism selected: Flink sliding window (1 hour / 30 second slide). Count-Min Sketch per article per window (width=2048, depth=5 — error ε=0.001, probability δ=0.03). Result materialized to Redis sorted set ZADD trending:{epoch} {score} {article_id} with EXPIRE 60. Stream aggregation + dedup per event (Flink keyed state with event_id dedup; TTL 2 hours to bound state size).
Required combination: Stream aggregation + dedup per event always — per framework rule 6.4. Flink maintains Set<event_id> per article_id keyed state with 2-hour TTL for dedup before counting.
Flow 3: Personalized Feed Generation #
Q1 — Scope: PersonalizedFeed is per-user. UserInterest is per-user. Article metadata is global. Cross-service: UserInterest update (stream pipeline) triggers feed recompute (feed service). Scope is cross-service → Saga pattern for coordination? No — feed recompute is not a transaction; it’s an async triggered computation. Pattern: Outbox + Relay (UserInterest update → event on Kafka → feed recompute service).
Q2 — Failure: UserInterest update fails → feed uses stale interest vector (acceptable per I-7: 5-minute staleness bound). Feed recompute service crashes → cache TTL expires → next request triggers recompute. No data loss.
Q3 — Data: Read shape ≠ write shape: write model is ReadEvent stream; read model is ranked list of articles. Pattern: CQRS. Write side: Flink consumes ReadEvents, updates UserInterest vector (EMA update). Read side: Feed service reads UserInterest + Article metadata, computes ranked list, caches in Redis. Materialized View: pre-computed feed stored in Redis. Time-bounded: TTL 5 minutes on cached feed.
Q4 — Access: Read » write. Millions of feed reads per second; one interest update per user per session. Pattern: CQRS + Materialized View. Feed reads hit Redis cache. Cache miss triggers synchronous recompute (top-K articles by dot product with interest vector, from candidate set fetched from Elasticsearch/Postgres by recency). Rebuild path: on UserInterest update, invalidate cache key feed:{user_id} and async-enqueue recompute job.
Q5 — Coupling: UserInterest update is written to Postgres AND publishes event to Kafka (CDC via Debezium or explicit outbox pattern). Feed recompute service consumes from Kafka, recomputes feed, writes to Redis. Async decoupling between interest update and feed availability.
Circuit topology: The personalized feed is a closed-loop feedback system:
- Plant: Article ranking function (dot product of UserInterest vector with Article topic vector)
- Feedback signal: ReadEvent stream (user reads article A → increases weight of A’s topics in UserInterest)
- Error signal: Difference between predicted interest (current interest vector) and observed behavior (what user actually reads)
- Controller: EMA interest update (learning rate α = 0.1, decay γ = 0.95 per day)
- Stability: The system is stable because α < 1 and γ < 1. Without decay, the interest vector could be dominated by old readings (integrator windup). Decay is the RC filter preventing windup.
Mechanism selected: CQRS + Materialized View + Cache-Aside. Write path: Flink interest-update job (ReadEvent → EMA update → Postgres UserInterest). Read path: Feed service reads Redis cache; miss → compute top-K via ANN (Approximate Nearest Neighbor) search of Article topic vectors against UserInterest vector using FAISS or Elasticsearch KNN; write result to Redis with 5-minute TTL. Rebuild path on cache miss or UserInterest update.
Required combination (6.4): CQRS + Materialized View requires explicit rebuild path — implemented as: (a) cache TTL expiry triggers lazy rebuild on next request, (b) UserInterest update event triggers proactive cache invalidation and async rebuild via Kafka consumer.
Flow 4: Crawl Scheduling (The Scheduler Pattern) #
Q1 — Scope: CrawlSchedule is within-service. No cross-service coordination needed. One scheduler process per region.
Q2 — Failure: Scheduler crashes → on restart, reads all CrawlSchedule rows with next_crawl_at <= now() and re-fires pending crawls. At-least-once crawl delivery → ArticleFetch dedup by (source_id, fetched_at window) prevents duplicate article processing. Slow degradation: if crawl workers are overwhelmed, scheduler backs off via exponential backoff on next_crawl_at.
Q3 — Data: CrawlSchedule is an intent/future constraint. The pattern: scheduled jobs are a special class of state where the intent (crawl this source every N minutes) must persist across restarts. The scheduler is a polling loop: SELECT * FROM crawl_schedule WHERE next_crawl_at <= NOW() AND status = 'PENDING' LIMIT 100 FOR UPDATE SKIP LOCKED.
FOR UPDATE SKIP LOCKED is the critical mechanism: allows multiple scheduler workers to claim rows without contention. Each worker claims a batch, fires crawl jobs, updates next_crawl_at = NOW() + crawl_interval.
Q4 — Access: Scheduler is internal only. No external read concern.
Q5 — Coupling: Scheduler writes CrawlTrigger event to Kafka. CrawlWorker pool consumes from Kafka, fetches source, publishes ArticleFetch event. This decouples scheduler throughput from crawl worker throughput.
Mechanism selected: Database-backed scheduler using FOR UPDATE SKIP LOCKED on crawl_schedule table. Scheduler daemon runs in loop: poll every 10 seconds, claim batch of due sources, publish to Kafka crawl-triggers topic, update next_crawl_at. Multiple scheduler instances are safe (SKIP LOCKED prevents double-claiming). CrawlWorker pool (auto-scaled) consumes crawl-triggers, fetches source, publishes to article-fetches Kafka topic.
No contention mechanism needed at the crawl worker level (pipeline ownership — system-only writes, single writer per source_id because crawl_schedule row is locked before dispatch). Dedup required: ArticleFetch events carry fetch_id = sha256(source_id + fetched_at_epoch_bucket) as idempotency key.
Step 7 — Axiomatic Validation #
Source-of-truth table: no dual truth.
| State | Primary Store | Derived Stores | Allowed Staleness | Rebuild Mechanism |
|---|---|---|---|---|
| Source configuration | Postgres source table | None | — | — |
| CrawlSchedule | Postgres crawl_schedule table | None | — | — |
| Article content (immutable) | Postgres article table | S3 (raw HTML archive) | S3 is archive, not truth | — |
| Article metadata (category_labels, cluster_id) | Postgres article table | SearchIndex (ES), CategoryFeed cache | ES: 60s; Feed cache: 5min | CDC replay; cache TTL |
| ArticleFetch (raw HTML) | S3 (primary archive) + Kafka (transient) | None | Kafka is transit, not truth | S3 is the archive |
| StoryCluster membership | Postgres story_cluster + article.cluster_id | None | — | Re-run clustering pipeline on Article table |
| ReadEvent | Postgres read_event table (partitioned) | Kafka (transit) | Kafka is transit | — |
| UserInterest | Postgres user_interest table | Redis feed cache | 5 min | Recompute from ReadEvent log |
| TrendingScore | Redis sorted set (TTL 60s) | None (ephemeral) | 30s | Recompute by replaying ReadEvent stream from Kafka |
| PersonalizedFeed | Redis cache (TTL 5min) | None | 5 min | Recompute from UserInterest + Article table |
| SearchIndex | Elasticsearch | None (ES is the index) | 60s lag from Postgres | Full reindex from Postgres |
Dual-truth risks identified and mitigated:
Risk: Article metadata in Postgres vs. Elasticsearch could diverge. Mitigation: ES is explicitly classified as a derived projection. CDC (Debezium) guarantees eventual delivery. ES is never written directly — only via CDC pipeline.
Risk: UserInterest could be dual-written by two concurrent interest-update pipeline workers for the same user. Mitigation: Flink partitions by user_id — exactly one Flink task per user_id; no concurrent writes. Postgres update uses
WHERE user_id = ? AND version = ?(optimistic lock) for safety.Risk: TrendingScore in Redis vs. raw ReadEvent count in Postgres. Mitigation: TrendingScore is explicitly a derived view; it is not reconciled against Postgres ReadEvent counts in real-time. The Count-Min Sketch is the operational truth for trending; accuracy is bounded by sketch error parameters.
Step 8 — Algorithm Design #
Pseudocode for every write path and state machine.
8.1 — Crawl Pipeline (Scheduler → Fetch → Parse → Ingest) #
// Scheduler daemon (runs every 10 seconds)
function scheduler_tick():
BEGIN TRANSACTION
sources = SELECT * FROM crawl_schedule
WHERE next_crawl_at <= NOW()
AND status = 'PENDING'
ORDER BY next_crawl_at ASC
LIMIT 100
FOR UPDATE SKIP LOCKED
for each source in sources:
trigger = {
source_id: source.source_id,
url: source.url,
fetch_type: source.feed_type, // RSS | API | SCRAPE
fetch_id: sha256(source.source_id + floor(now() / 60)) // idempotency key
}
kafka_publish('crawl-triggers', key=source.source_id, value=trigger)
UPDATE crawl_schedule
SET status = 'IN_FLIGHT', last_triggered_at = NOW()
WHERE source_id = source.source_id
COMMIT
// CrawlWorker (Kafka consumer, auto-scaled pool)
function crawl_worker(trigger):
// Idempotency check
if redis_setnx('crawl:in-flight:' + trigger.fetch_id, 1, TTL=600):
raw_content = http_fetch(trigger.url) // with timeout 30s, Circuit Breaker
s3_key = 'raw/' + trigger.source_id + '/' + trigger.fetch_id + '.html'
s3_put(s3_key, raw_content)
fetch_event = {
fetch_id: trigger.fetch_id,
source_id: trigger.source_id,
fetched_at: now(),
s3_key: s3_key,
http_status: raw_content.status,
content_length: len(raw_content.body)
}
kafka_publish('article-fetches', key=trigger.source_id, value=fetch_event)
// Update crawl schedule
UPDATE crawl_schedule
SET status = 'PENDING',
next_crawl_at = NOW() + source.crawl_interval,
last_crawled_at = NOW()
WHERE source_id = trigger.source_id
// ParseWorker (Kafka consumer)
function parse_worker(fetch_event):
if fetch_event.http_status != 200:
emit_metric('crawl.fetch_error', {source_id: fetch_event.source_id})
return
raw_html = s3_get(fetch_event.s3_key)
articles_raw = parse_feed(raw_html, fetch_event.source_id) // RSS/Atom/HTML
for each raw_article in articles_raw:
url = normalize_url(raw_article.url)
content_hash = sha256(raw_article.title + raw_article.body)
// URL uniqueness enforced by DB
result = INSERT INTO article (url, title, body, content_hash, published_at, source_id, ingested_at)
VALUES (url, ...)
ON CONFLICT (url) DO NOTHING
RETURNING article_id
if result.rows_affected > 0:
// New article — enqueue for classification and clustering
kafka_publish('new-articles', key=url, value={article_id, content_hash, url})
8.2 — Article Classification Pipeline #
// ClassifierWorker (Kafka consumer, reads 'new-articles')
function classify_worker(new_article_event):
article = fetch_article_content(new_article_event.article_id)
category_labels = ml_classify(article.title + ' ' + article.body[:2000])
// Returns list of (category, confidence) pairs
UPDATE article
SET category_labels = category_labels,
classified_at = NOW()
WHERE article_id = new_article_event.article_id
kafka_publish('classified-articles', key=new_article_event.article_id, value={
article_id, category_labels
})
8.3 — Story Clustering (Near-Duplicate Detection) #
// ClusterWorker (Kafka consumer, reads 'classified-articles')
function cluster_worker(classified_event):
article = fetch_article(classified_event.article_id)
// Step 1: Compute MinHash signature
tokens = shingle(article.title + ' ' + article.body, k=3) // 3-gram shingles
signature = minhash(tokens, num_hashes=128)
// Step 2: LSH candidate lookup
// Divide 128 hashes into 16 bands of 8 hashes each
// Each band → one LSH bucket key
candidates = []
for band_idx in range(16):
band = signature[band_idx*8 : (band_idx+1)*8]
bucket_key = 'lsh:band:' + band_idx + ':' + hash(band)
bucket_members = redis_smembers(bucket_key)
candidates.extend(bucket_members)
redis_sadd(bucket_key, article.article_id)
redis_expire(bucket_key, 7 * 86400) // 7-day TTL
// Step 3: Compute exact Jaccard similarity for candidates
// (to filter LSH false positives)
best_match = None
best_similarity = 0
for candidate_id in set(candidates):
candidate_sig = get_minhash_signature(candidate_id) // from Redis or DB
similarity = jaccard_estimate(signature, candidate_sig)
if similarity > 0.85 and similarity > best_similarity:
best_match = candidate_id
best_similarity = similarity
// Step 4: Assign cluster
if best_match is None:
// Create new cluster — this article is the canonical
new_cluster_id = uuid()
INSERT INTO story_cluster (cluster_id, canonical_article_id, created_at)
VALUES (new_cluster_id, article.article_id, NOW())
cluster_id = new_cluster_id
else:
// Join existing cluster
cluster_id = get_cluster_id(best_match)
// If best_match has no cluster (race condition), create one
if cluster_id is None:
cluster_id = uuid()
INSERT INTO story_cluster ... (same as above for best_match)
UPDATE article SET cluster_id = cluster_id WHERE article_id = article.article_id
store_minhash_signature(article.article_id, signature) // Redis, TTL 7 days
8.4 — ReadEvent Recording and Interest Update #
// API handler: POST /articles/{article_id}/read
function record_read(user_id, article_id, session_id):
event_id = sha256(user_id + article_id + session_id)
// Idempotency: unique index on (user_id, article_id, session_id)
result = INSERT INTO read_event (event_id, user_id, article_id, session_id, timestamp)
VALUES (event_id, user_id, article_id, session_id, NOW())
ON CONFLICT (user_id, article_id, session_id) DO NOTHING
if result.rows_affected > 0:
// Publish to Kafka for downstream consumers
kafka_publish('read-events', key=user_id, value={
event_id, user_id, article_id, session_id, timestamp: NOW()
})
return OK
// Flink job: interest-update-pipeline
// Input: 'read-events' Kafka topic, keyed by user_id
// State: per user_id — Set<event_id> (dedup, TTL 48h), current interest vector
function interest_update_flink(read_event):
// Dedup
if dedup_state.contains(read_event.event_id):
return // duplicate, skip
dedup_state.add(read_event.event_id)
dedup_state.expire(read_event.event_id, TTL=48h)
// Fetch article topic vector
article_topics = fetch_article_topics(read_event.article_id)
// Returns: {sports: 0.8, politics: 0.1, tech: 0.05, ...}
// EMA update of interest vector
current_interest = get_user_interest(read_event.user_id)
// Apply decay for time since last update
time_delta_days = (NOW() - current_interest.last_updated) / 86400
decay = pow(0.95, time_delta_days)
new_interest = {}
for topic in all_topics:
new_interest[topic] = decay * current_interest.vector.get(topic, 0.0)
+ 0.1 * article_topics.get(topic, 0.0)
// Normalize to unit vector
norm = sqrt(sum(v^2 for v in new_interest.values()))
new_interest = {k: v/norm for k, v in new_interest.items()}
// Write to Postgres (optimistic lock)
rows = UPDATE user_interest
SET interest_vector = new_interest,
last_updated = NOW(),
version = version + 1
WHERE user_id = read_event.user_id
AND version = current_interest.version
if rows == 0:
// Conflict — another update raced; re-read and retry
retry()
// Invalidate feed cache
redis_del('feed:' + read_event.user_id)
kafka_publish('interest-updated', key=read_event.user_id, value={user_id, timestamp})
8.5 — Trending Score Computation (Flink Sliding Window) #
// Flink job: trending-pipeline
// Input: 'read-events' Kafka topic
// Window: SlidingEventTimeWindows(1 hour, 30 seconds)
// Output: Redis sorted set per window
function trending_flink_window(article_id, window_events):
// Count-Min Sketch (width=2048, depth=5)
sketch = CountMinSketch(width=2048, depth=5)
for event in window_events:
if not dedup_state.contains(event.event_id):
sketch.increment(event.article_id)
dedup_state.add(event.event_id)
count = sketch.estimate(article_id)
window_key = 'trending:' + floor(window.end / 30) * 30
redis_zadd(window_key, count, article_id)
redis_expire(window_key, 120) // 2x slide interval TTL
// API: GET /trending?limit=50
function get_trending(limit):
window_key = 'trending:' + current_window_epoch()
results = redis_zrevrange(window_key, 0, limit-1, withscores=True)
return results
8.6 — Personalized Feed Generation #
// Feed service: GET /feed/{user_id}
function get_feed(user_id, limit=50):
cache_key = 'feed:' + user_id
cached = redis_get(cache_key)
if cached:
return cached
// Cache miss — recompute
interest = fetch_user_interest(user_id)
// interest.vector = {topic: weight, ...}
// Candidate retrieval: fetch recent articles from top-interest categories
top_topics = top_k(interest.vector, k=5)
candidate_articles = []
for topic in top_topics:
// Query Elasticsearch for recent articles in this topic
results = es_search({
filter: {category: topic, published_at: {gte: NOW() - 48h}},
sort: [{published_at: desc}],
size: 100
})
candidate_articles.extend(results)
// Deduplicate candidates by article_id
candidates = deduplicate(candidate_articles, key='article_id')
// Score each candidate: dot product of interest vector with article topic vector
scored = []
for article in candidates:
relevance = dot_product(interest.vector, article.topic_vector)
recency_decay = exp(-0.5 * hours_since(article.published_at))
trending_boost = get_trending_score(article.article_id) / 1000.0
score = relevance * recency_decay + 0.1 * trending_boost
scored.append((score, article))
// Sort by score, take top limit
feed = sort_desc(scored, key=score)[:limit]
// Cache result
redis_setex(cache_key, 300, serialize(feed)) // 5-minute TTL
return feed
// Background job: triggered by 'interest-updated' Kafka events
function proactive_feed_rebuild(user_id):
redis_del('feed:' + user_id) // Invalidate cache
get_feed(user_id) // Rebuild and warm cache
8.7 — Search Index (CDC Pipeline) #
// Debezium CDC configuration:
// Source: Postgres article table
// Sink: Kafka topic 'article-changes'
// Event types: INSERT, UPDATE (only for metadata fields)
// Elasticsearch sink consumer
function es_indexer(article_change_event):
if article_change_event.op == 'INSERT':
es_index({
index: 'articles',
id: article_change_event.url,
body: {
article_id: article_change_event.article_id,
title: article_change_event.title,
body_excerpt: article_change_event.body[:500],
category_labels: article_change_event.category_labels,
published_at: article_change_event.published_at,
source_id: article_change_event.source_id,
cluster_id: article_change_event.cluster_id
}
})
elif article_change_event.op == 'UPDATE':
// Only index metadata fields — content is immutable
es_update({
index: 'articles',
id: article_change_event.url,
doc: {
category_labels: article_change_event.category_labels,
cluster_id: article_change_event.cluster_id,
engagement_count: article_change_event.engagement_count
}
})
// On failure: publish to dead-letter topic, retry with exponential backoff
// On persistent failure: alert on-call, manual reindex via:
// reindex_from_postgres(article_id_range)
Step 9 — Logical Data Model #
Schema with partition keys derived from invariant scope.
Core Tables (Postgres) #
-- Source: registration of crawlable feeds
CREATE TABLE source (
source_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
url TEXT NOT NULL UNIQUE,
feed_type TEXT NOT NULL CHECK (feed_type IN ('RSS', 'ATOM', 'API', 'SCRAPE')),
crawl_interval INTERVAL NOT NULL DEFAULT '15 minutes',
last_crawled_at TIMESTAMPTZ,
status TEXT NOT NULL DEFAULT 'ACTIVE' CHECK (status IN ('ACTIVE', 'PAUSED', 'DEAD')),
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
-- CrawlSchedule: intent/future constraint for when to next crawl
CREATE TABLE crawl_schedule (
source_id UUID PRIMARY KEY REFERENCES source(source_id),
next_crawl_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
status TEXT NOT NULL DEFAULT 'PENDING' CHECK (status IN ('PENDING', 'IN_FLIGHT')),
last_triggered_at TIMESTAMPTZ
);
CREATE INDEX idx_crawl_schedule_next ON crawl_schedule (next_crawl_at) WHERE status = 'PENDING';
-- StoryCluster: near-duplicate story groups
CREATE TABLE story_cluster (
cluster_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
canonical_article_id UUID, -- FK set after Article insert
article_count INT NOT NULL DEFAULT 1,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
-- Article: primary entity, normalized from raw fetches
CREATE TABLE article (
article_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
url TEXT NOT NULL UNIQUE, -- natural dedup key, unique index
title TEXT NOT NULL,
body TEXT NOT NULL,
content_hash BYTEA NOT NULL, -- sha256(title + body)
published_at TIMESTAMPTZ NOT NULL,
source_id UUID NOT NULL REFERENCES source(source_id),
ingested_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
-- Mutable metadata (set by downstream pipelines)
category_labels JSONB, -- [{category, confidence}]
topic_vector JSONB, -- {topic: weight, ...} normalized
cluster_id UUID REFERENCES story_cluster(cluster_id),
engagement_count BIGINT NOT NULL DEFAULT 0,
classified_at TIMESTAMPTZ,
clustered_at TIMESTAMPTZ
) PARTITION BY RANGE (ingested_at);
-- Partition by month for retention management
CREATE TABLE article_2025_01 PARTITION OF article FOR VALUES FROM ('2025-01-01') TO ('2025-02-01');
-- ... create partitions monthly via automation
CREATE INDEX idx_article_source_published ON article (source_id, published_at DESC);
CREATE INDEX idx_article_category_published ON article USING GIN (category_labels) WHERE classified_at IS NOT NULL;
CREATE INDEX idx_article_cluster ON article (cluster_id);
CREATE INDEX idx_article_ingested ON article (ingested_at DESC);
-- ReadEvent: immutable event log
CREATE TABLE read_event (
event_id BYTEA PRIMARY KEY, -- sha256(user_id + article_id + session_id)
user_id UUID NOT NULL,
article_id UUID NOT NULL REFERENCES article(article_id),
session_id UUID NOT NULL,
timestamp TIMESTAMPTZ NOT NULL DEFAULT NOW()
) PARTITION BY RANGE (timestamp);
CREATE UNIQUE INDEX idx_read_event_dedup ON read_event (user_id, article_id, session_id);
CREATE INDEX idx_read_event_user ON read_event (user_id, timestamp DESC);
CREATE INDEX idx_read_event_article ON read_event (article_id, timestamp DESC);
-- UserInterest: current interest vector per user
CREATE TABLE user_interest (
user_id UUID PRIMARY KEY,
interest_vector JSONB NOT NULL DEFAULT '{}', -- {topic: weight, ...} normalized
last_updated TIMESTAMPTZ NOT NULL DEFAULT NOW(),
version BIGINT NOT NULL DEFAULT 0 -- for optimistic locking
);
Redis Key Schema (Derived Views) #
trending:{epoch_30s} → SORTED SET {article_id: score} TTL=120s
feed:{user_id} → STRING serialized ranked article list TTL=300s
crawl:in-flight:{fetch_id} → STRING '1' TTL=600s
lsh:band:{band_idx}:{hash} → SET {article_id, ...} TTL=604800s (7d)
minhash:sig:{article_id} → STRING 128 hash values (binary) TTL=604800s
Elasticsearch Index Schema #
{
"mappings": {
"properties": {
"article_id": { "type": "keyword" },
"url": { "type": "keyword" },
"title": { "type": "text", "analyzer": "english" },
"body_excerpt": { "type": "text", "analyzer": "english" },
"category_labels": { "type": "nested",
"properties": {
"category": { "type": "keyword" },
"confidence": { "type": "float" }
}},
"topic_vector": { "type": "dense_vector", "dims": 128, "index": true, "similarity": "dot_product" },
"published_at": { "type": "date" },
"source_id": { "type": "keyword" },
"cluster_id": { "type": "keyword" },
"engagement_count": { "type": "long" }
}
},
"settings": {
"number_of_shards": 12,
"number_of_replicas": 1,
"index.lifecycle.name": "articles-ilm-policy"
}
}
Step 10 — Technology Landscape #
| Layer | Technology | Rationale | Alternative Considered |
|---|---|---|---|
| Primary data store | PostgreSQL 16 | ACID transactions; FOR UPDATE SKIP LOCKED for scheduler; ON CONFLICT for dedup; range partitioning for ReadEvent; JSONB for vectors and labels | CockroachDB (more complex; not needed at this scale) |
| Article metadata read | PostgreSQL (same instance, read replicas) | Category/topic browsing is simple range queries; no need for separate read store at initial scale | Cassandra (not needed; adds ops complexity) |
| Event stream / message bus | Apache Kafka (3 brokers, replication factor 3) | Durable log for ArticleFetch, ReadEvent, CrawlTrigger; at-least-once delivery with offset-based dedup; Kafka Streams or Flink as consumer | Pulsar (Kafka ecosystem is larger; Debezium support is mature) |
| Stream processing | Apache Flink (3 job managers, 6 task managers) | Sliding window for TrendingScore; stateful exactly-once for UserInterest update; rich windowing API; checkpointing to S3 | Kafka Streams (simpler but less powerful windowing; no cross-key state joins) |
| Cache / derived views | Redis 7 (cluster mode, 6 nodes) | TrendingScore sorted sets; feed cache; crawl in-flight dedup; LSH buckets; low-latency reads | Memcached (no sorted sets, no TTL per key) |
| Search index | Elasticsearch 8 (3 data nodes, 1 coordinating) | Full-text search over article title/body; KNN vector search for feed candidates; faceted filtering by category | OpenSearch (equivalent, but ES has better k-NN support in v8) |
| CDC pipeline | Debezium (Kafka Connect plugin) | Postgres WAL → Kafka → Elasticsearch; guaranteed delivery; handles schema changes | Custom triggers (fragile; Debezium is production-tested) |
| Raw content storage | Amazon S3 | Archival of raw HTML from crawls; S3 Lifecycle for cost management; extremely cheap at scale | GCS (equivalent; S3 chosen for ecosystem) |
| ML classification | Internal ML service (Python, FastAPI) | Topic/category classification via fine-tuned BERT or FastText; called synchronously from ClassifierWorker | AWS Comprehend (vendor lock-in; model not customizable) |
| ANN / vector search | FAISS (via Elasticsearch KNN) | Approximate nearest neighbor for feed candidate retrieval using topic vectors | Milvus (standalone vector DB adds ops overhead; ES KNN sufficient) |
| Crawl scheduling | Postgres-backed scheduler (crawl_schedule table + daemon) | Simple, correct, no new ops dependency; FOR UPDATE SKIP LOCKED handles concurrency | Quartz Scheduler / Celery Beat (adds dependency; Postgres-native is simpler) |
| Service mesh / gateway | Nginx + API Gateway | Authentication (JWT); rate limiting; routing to microservices | Kong (viable alternative) |
| Container orchestration | Kubernetes (EKS) | Auto-scaling for CrawlWorkers, ClassifierWorkers, Feed service | Nomad (smaller ecosystem) |
Step 11 — Deployment Topology #
Logical Service Topology #
┌──────────────────────────────────────────────┐
│ Clients (Browser / Mobile) │
└─────────────────────┬────────────────────────┘
│ HTTPS
┌─────────────────────▼────────────────────────┐
│ API Gateway (Nginx + Auth) │
│ Rate limiting, JWT validation, routing │
└──┬────────────┬───────────────┬──────────────┘
│ │ │
┌────────────▼──┐ ┌──────▼──────┐ ┌───▼───────────┐
│ Feed Service │ │ Search Svc │ │ Trending Svc │
│ (reads Redis, │ │(queries ES) │ │ (reads Redis) │
│ ES, Postgres) │ └──────┬──────┘ └───────────────┘
└────────────┬──┘ │
│ │ ES query
┌─────────────▼──────────────────────────────────┐
│ Redis Cluster │
│ feed:{user_id}, trending:{epoch}, LSH buckets │
└─────────────────────────────────────────────────┘
┌────────────────────────────── Write Path ──────────────────────────────────┐
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌──────────────────────────────┐ │
│ │ Scheduler │───▶│ Kafka │───▶│ CrawlWorker Pool (k8s) │ │
│ │ Daemon │ │ crawl- │ │ Fetches HTML, stores S3, │ │
│ │ (Postgres │ │ triggers │ │ publishes to article-fetches │ │
│ │ FOR UPDATE │ └─────────────┘ └──────────────────────────────┘ │
│ │ SKIP LOCKED)│ │ │
│ └─────────────┘ ┌─────────────▼──────────┐ │
│ │ ParseWorker Pool │ │
│ │ (RSS/HTML parsing, │ │
│ │ INSERT ON CONFLICT) │ │
│ └─────────────┬──────────┘ │
│ │ Postgres │
│ ┌────────────────────────▼────────────┐ │
│ │ Postgres Primary (RDS) │ │
│ │ article, source, crawl_schedule, │ │
│ │ user_interest, read_event │ │
│ └──┬─────────────────────┬────────────┘ │
│ │ Debezium CDC │ Read replicas (2) │
│ ┌───────────▼──┐ ┌────────▼──────────┐ │
│ │ Kafka topic │ │ Read Replica Pool │ │
│ │ article- │ │ (feed queries, │ │
│ │ changes │ │ category browse) │ │
│ └───────────┬──┘ └───────────────────┘ │
│ │ │
│ ┌───────────▼──┐ ┌────────────────────┐ │
│ │ Elasticsearch │ │ Flink Cluster │ │
│ │ Indexer │ │ - trending-job │ │
│ │ (Kafka Connect│ │ - interest-update │ │
│ │ Sink) │ │ - cluster-job │ │
│ └───────────┬──┘ └────────────────────┘ │
│ │ │
│ ┌───────────▼──────────────────────────────────┐ │
│ │ Elasticsearch Cluster (3 nodes) │ │
│ └──────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
Physical Deployment (AWS, Single Region Initial) #
| Component | Instance Type | Count | Notes |
|---|---|---|---|
| Postgres (RDS) | db.r6g.2xlarge | 1 primary + 2 replicas | Multi-AZ; 2TB gp3 |
| Kafka brokers | i3.xlarge | 3 | 3 AZs; 1.9TB NVMe |
| Flink (JobManager) | c6i.xlarge | 3 | HA; ZooKeeper for leader election |
| Flink (TaskManager) | r6i.2xlarge | 6 | 64GB RAM; RocksDB state backend |
| Redis (ElastiCache) | r6g.xlarge | 6 | Cluster mode; 3 shards × 2 replicas |
| Elasticsearch | r6g.2xlarge | 3 data + 1 coord | 500GB gp3 each |
| CrawlWorkers | c6i.large | 10-50 (auto-scaled) | HPA on queue depth |
| ParseWorkers | c6i.large | 5-20 (auto-scaled) | |
| ClassifierWorkers | g4dn.xlarge (GPU) | 3-10 | GPU for BERT inference |
| ClusterWorkers | c6i.large | 3-8 | CPU-bound MinHash |
| Feed Service | c6i.large | 3-10 | Stateless; HPA on CPU |
| Search Service | c6i.large | 3-5 | Stateless |
| Debezium (Kafka Connect) | c6i.large | 3 | Distributed mode |
| S3 | — | — | Raw HTML archive |
Step 12 — Consistency Model #
Classification of consistency per read/write path.
| Path | Consistency Model | Justification |
|---|---|---|
| Article ingest (crawl → Postgres) | Linearizable (single primary) | URL uniqueness invariant requires atomic insert; Postgres primary provides linearizability |
| Article metadata update (classify/cluster → Postgres) | Read-your-writes (within pipeline partition) | Flink partitions by article_id; each article’s metadata updates are sequential |
| CrawlSchedule update (scheduler → Postgres) | Linearizable | FOR UPDATE SKIP LOCKED is pessimistic locking; serializable within transaction |
| ReadEvent recording (API → Postgres) | Linearizable (unique index enforces idempotency) | Postgres unique index on (user_id, article_id, session_id) |
| UserInterest update (Flink → Postgres) | Read-your-writes | Optimistic locking (version field); single Flink partition per user_id ensures no concurrent writers |
| SearchIndex (CDC via Debezium → Elasticsearch) | Eventual consistency (I-8: 60s staleness bound) | CDC is async; Elasticsearch is a derived projection |
| TrendingScore (Flink → Redis) | Eventual consistency (I-6: 30s staleness bound) | Stream aggregation is async; score updates every 30s |
| PersonalizedFeed (Redis cache) | Eventual consistency (I-7: 5-minute TTL) | Cache is a materialized derived view; stale on cache hit |
| Category/topic browse (Postgres read replica) | Eventual consistency (replication lag typically < 1s) | Read replica lag; acceptable for browsing |
| Trending read (Redis sorted set) | Eventual consistency (30s window slide) | Same as above |
| Search query (Elasticsearch) | Eventual consistency (60s CDC lag) | Derived projection |
Consistency anomalies explicitly accepted:
- A user reads an article and it does not immediately appear as read in their interest vector — up to 5 minutes staleness for feed personalization.
- A newly published article may not appear in search for up to 60 seconds.
- An article’s trending score may lag actual read volume by up to 30 seconds.
- Two concurrent requests to read the same article from different sessions may both record ReadEvents (idempotency key prevents double-counting only within the same session).
Step 13 — Scaling Model #
Scaling Dimensions #
| Dimension | Current Scale Target | Scaling Mechanism | Hotspot Risk |
|---|---|---|---|
| Sources (crawled feeds) | 100,000 sources | Horizontal: more CrawlWorkers; Kafka partition count scales with source count | Single high-frequency source (e.g., Reuters publishes 500 articles/hour) → shard by source_id |
| Articles ingested/day | 5 million articles/day | Postgres partitioned by month; ParseWorker pool auto-scaled | Viral event: all sources publish same story simultaneously → URL dedup absorbs burst |
| ReadEvents/second | 50,000 reads/sec peak | ReadEvent table partitioned by day; Kafka handles bursts; Flink auto-scales | Viral article: millions of reads/hour → Count-Min Sketch handles cardinality; Redis sorted set atomic |
| Users | 100 million users | UserInterest rows: 100M × 1KB = 100GB — fits in Postgres; feed cache in Redis cluster | Celebrity user (influencer): their read events processed same as any user |
| Trending computation | 128 article_ids × 1-hour windows | Count-Min Sketch is O(1) memory per article; Redis sorted set scales horizontally | Breaking news: single article gets 10M reads/hour → CMS handles gracefully, Redis ZADD is atomic |
| Search | 100M article index | Elasticsearch 12 shards; expand to 24 shards with reindex | Hot shard: recent articles searched more → ILM hot-warm-cold tier; recent index on hot nodes |
| Personalized feed | 100M users × 5 recomputes/day | Redis cluster holds 100M keys × 10KB = 1TB → scale cluster; proactive rebuild for active users only | Popular user (many active sessions): each session triggers recompute → debounce: only rebuild if cache is cold |
Sharding Strategy #
| Object | Shard Key | Why |
|---|---|---|
| Article | ingested_at month (range partition) | Even distribution; enables time-based retention; no hotspot (new articles spread across sources) |
| ReadEvent | timestamp day (range partition) | Even distribution; enables fast deletion of old partitions for retention |
Kafka read-events | user_id (hash) | Ensures all events for one user are in the same partition → Flink task receives all events for a user |
Kafka new-articles | url hash | Ensures classifer and cluster workers process each article exactly once |
| Redis trending | window_epoch | Different windows are different keys; no hotspot |
| Redis feed cache | user_id hash (Redis Cluster) | Even distribution across Redis shards |
| Elasticsearch | Round-robin (12 shards) | Articles are uniformly distributed; no natural clustering needed |
Auto-Scaling Triggers #
| Service | Scale-Out Trigger | Scale-In Trigger |
|---|---|---|
| CrawlWorkers | Kafka crawl-triggers consumer lag > 1000 messages | Lag < 100 for 5 minutes |
| ParseWorkers | Kafka article-fetches consumer lag > 500 messages | Lag < 50 for 5 minutes |
| ClassifierWorkers | Kafka new-articles consumer lag > 200 messages | Lag < 20 for 5 minutes |
| Feed Service | CPU > 70% | CPU < 30% for 10 minutes |
| Flink TaskManagers | Flink backpressure > 80% | Backpressure < 20% for 10 minutes |
Step 14 — Failure Model #
Failure Modes and Mitigations #
| Failure | Impact | Detection | Mitigation | Recovery |
|---|---|---|---|---|
| Postgres primary failure | Article ingest halts; ReadEvents queue in Kafka; feed reads degrade to cache | RDS health check; Postgres replication lag alert | RDS Multi-AZ automatic failover (30-90s); Kafka buffers events during failover | Resume consumption from Kafka after failover; no data loss (Kafka retained) |
| Kafka broker failure | Read events may queue; crawl triggers queue | Kafka broker health; consumer lag alert | Replication factor 3; one broker failure is transparent | Kafka auto-rebalances partitions to surviving brokers; no intervention |
| Flink job crash (trending) | TrendingScore stale for up to window slide period | Flink job health; trending Redis TTL expiry alert | Flink checkpointing every 30s to S3; auto-restart from checkpoint | Job restarts from checkpoint; replays from last Kafka offset; trending recovers within 30s |
| Flink job crash (interest update) | UserInterest stale; feed personalization degrades | Flink job health; consumer lag alert | Same as above; Kafka ReadEvents retained 7 days | Job restarts; replays from checkpoint offset; no data loss |
| Redis cluster failure | Feed cache miss → higher latency; trending unavailable | Redis health; cache hit rate alert | Redis ElastiCache cluster mode with 2 replicas per shard; automatic failover | Cache rebuilds on demand on cache miss; trending recovers when Flink re-populates |
| Elasticsearch failure | Search unavailable; feed candidate retrieval from ES fails | ES health; search error rate | 3-node ES with 1 replica per shard; coordinating node is stateless | ES auto-recovers; search unavailable during outage; feed service falls back to Postgres query |
| ML classifier service down | Articles not categorized; category browsing shows fewer results | Classifier health; Kafka new-articles consumer lag | ClassifierWorkers retry with exponential backoff; articles remain uncategorized until service recovers | Service recovers; Kafka backlog is processed; articles categorized with delay |
| CrawlWorker crash mid-fetch | ArticleFetch not published; source skipped for one interval | Crawl trigger timeout; next_crawl_at not updated for source | CrawlWorker uses Redis SETNX for in-flight dedup; if worker crashes, lock expires (TTL 600s); scheduler re-triggers | Scheduler re-triggers source after TTL expiry; at-most-one-interval gap in coverage |
| Debezium CDC lag | Elasticsearch stale beyond 60s SLO | CDC lag metric; ES vs Postgres article count difference | Debezium Kafka Connect with retries; dead-letter queue | Reprocess from dead-letter queue; or trigger full reindex for specific article_id range |
| S3 unavailable | Raw HTML archival fails; pipeline still works (S3 write is async) | S3 put error rate | CrawlWorker publishes to Kafka regardless of S3 success; S3 write is best-effort archive | Re-fetch from source if archival required; or accept loss of raw HTML for that fetch |
Cascading Failure Analysis #
Scenario: Breaking news event — all 100K sources publish articles about the same story simultaneously
- CrawlWorkers fetch 100K articles in 1 hour window
- Kafka
article-fetchestopic receives 100K events — ParseWorkers scale out to handle burst - Postgres receives 100K
INSERT ON CONFLICT DO NOTHING— most are the same cluster (SimHash collision) → DB handles gracefully; conflicting URLs deduplicated at insert; 100K inserts → perhaps 5K unique articles (different sources) - MinHash clustering pipeline is briefly overwhelmed → cluster assignment delayed; articles appear unclustered briefly
- ReadEvents spike (everyone reads breaking news) → TrendingScore updates in Redis within 30s → trending feed shows breaking news correctly
- Feed recompute for all users triggered → Redis cache invalidated at scale → Feed Service CPU spikes → HPA scales out
- Circuit breaker: if Postgres read replicas lag, Feed Service falls back to stale cache or returns partial results
Mitigation: Breaking news is the design-time scenario this system is built for. Key protections: URL dedup at Postgres level prevents duplicate storage; Count-Min Sketch handles high-cardinality trending without memory explosion; Redis cache absorbs read spike; Kafka buffers upstream bursts.
Step 15 — SLOs #
| Service | SLO | Measurement | Error Budget (30d) |
|---|---|---|---|
| Article ingest latency | P99 < 30s from publication to ingestion (for sources crawled every 15min, article available within 15min + 30s parse) | ingested_at - published_at for new articles | 0.1% of articles may exceed 30s |
| Feed load time | P99 < 500ms | API response time from request to first byte (with cache) | 0.1% of feed requests > 500ms |
| Feed load time (cache miss) | P99 < 3s | API response time on cache miss (full recompute) | 0.5% of recomputes > 3s |
| Search response time | P99 < 1s | Elasticsearch query + API response | 0.1% of searches > 1s |
| Trending freshness | 95% of articles in trending list within 30s of crossing engagement threshold | TrendingScore update lag | 5% of trending updates may lag |
| Personalized feed freshness | 99% of feeds reflect UserInterest as of ≤ 5 minutes ago | Cache TTL + interest update lag | 1% of feeds may be > 5min stale |
| Search index freshness | 99% of articles appear in search within 60s of ingestion | es_indexed_at - ingested_at | 1% of articles may lag > 60s |
| Availability (feed endpoint) | 99.9% uptime | HTTP 5xx rate < 0.1% | 43.2 minutes/month |
| Availability (search endpoint) | 99.5% uptime | HTTP 5xx rate < 0.5% | 3.6 hours/month |
| Crawl coverage | 99% of active sources crawled within 2× their crawl_interval | Sources where last_crawled_at > now() + 2 * crawl_interval | 1% may miss crawl window |
Step 16 — Operational Parameters #
Tuneable Parameters #
| Parameter | Default | Range | Effect |
|---|---|---|---|
crawl_interval (per source) | 15 minutes | 5 min – 24 hours | Frequency of source crawling; affects Postgres write load |
min_crawl_interval (system floor) | 5 minutes | 1 min – 30 min | Prevents overloading any single source |
| MinHash bands | 16 | 8 – 32 | More bands = higher recall (fewer missed duplicates); more Redis memory |
| MinHash hashes per band | 8 | 4 – 16 | More hashes = higher precision (fewer false positives); slower computation |
| Jaccard similarity threshold | 0.85 | 0.7 – 0.95 | Lower = more aggressive clustering; higher = stricter near-dup detection |
| Trending window size | 1 hour | 15 min – 6 hours | Wider = smoother trending; narrower = more reactive |
| Trending slide interval | 30 seconds | 10 sec – 5 min | Smaller = fresher trending; more Redis writes |
| Count-Min Sketch width | 2048 | 512 – 8192 | Wider = lower error; more memory |
| Count-Min Sketch depth | 5 | 3 – 10 | Deeper = lower probability of error; more memory |
| Feed cache TTL | 300 seconds | 60 – 1800 sec | Shorter = fresher feed; more recompute load |
| UserInterest EMA learning rate (α) | 0.10 | 0.01 – 0.3 | Higher = faster adaptation to new interests; lower = more stable |
| UserInterest decay rate (γ) | 0.95 per day | 0.8 – 1.0 per day | Lower = faster forgetting of old interests; higher = longer memory |
| Feed candidate pool (articles per topic) | 100 | 20 – 500 | Larger pool = better ranking quality; more ES query load |
| Top-K topics for feed candidates | 5 | 2 – 10 | More topics = broader feed; fewer = more focused |
| Kafka ReadEvent retention | 7 days | 1 – 30 days | Longer = more replay capacity for Flink recovery; more storage cost |
Circuit Breaker Configuration #
| Endpoint | Failure Threshold | Timeout | Half-Open Probe Interval |
|---|---|---|---|
| External source HTTP fetch | 5 failures in 60s | 30s per request | 60s |
| Postgres write (article insert) | 3 failures in 10s | 5s | 30s |
| Elasticsearch query (search/feed) | 5 failures in 30s | 3s | 15s |
| Redis read (feed cache) | 10 failures in 10s | 100ms | 10s |
| ML classifier service | 3 failures in 30s | 10s | 60s |
Step 17 — Runbooks #
Runbook 1: Feed Latency Degradation (P99 > 500ms) #
Trigger: Alert on feed_api_latency_p99 > 500ms for 5 consecutive minutes.
Diagnosis steps:
- Check Redis feed cache hit rate:
redis-cli info stats | grep keyspace_hits. If hit rate < 80%, cache is being evicted or TTL is too short. - Check Feed Service CPU:
kubectl top pods -l app=feed-service. If > 80%, scale out:kubectl scale deployment feed-service --replicas=10. - Check Elasticsearch query latency: Kibana → Monitoring → Elasticsearch → Search latency. If P99 > 500ms, check for hot shards.
- Check Postgres read replica lag: CloudWatch → RDS →
ReplicaLag. If > 5s, feed service is reading stale data; acceptable per SLO but investigate replication stall. - Check UserInterest update lag: Kafka consumer group lag for
interest-update-pipeline. If > 10,000 messages, Flink job may be struggling.
Remediation:
- Redis eviction: Increase Redis cluster memory or reduce non-feed TTLs. Short-term:
CONFIG SET maxmemory-policy allkeys-lru. - Feed Service overloaded: Scale out pods. If scaling doesn’t help, check for slow ES queries (see step 3).
- Hot ES shards: Force-merge recent index or move hot shard to dedicated node.
- Flink lag: Check Flink UI for backpressure. Scale TaskManagers if needed.
Runbook 2: Crawl Coverage Drop (sources missing crawl window) #
Trigger: Alert on sources_overdue_ratio > 0.01 (more than 1% of sources haven’t been crawled within 2× their interval).
Diagnosis steps:
- Check scheduler daemon health:
kubectl get pods -l app=scheduler-daemon. If pod is not running, restart:kubectl rollout restart deployment scheduler-daemon. - Check Kafka
crawl-triggersconsumer lag: if lag > 10,000, CrawlWorkers are overwhelmed. - Check CrawlWorker pod count:
kubectl get pods -l app=crawl-worker. Auto-scaling should have triggered. - Check for circuit-breaker trips on external sources: look for high
crawl.fetch_errormetric rate. May indicate network issue or source outage. - Check Postgres
crawl_scheduletable forIN_FLIGHTrows with oldlast_triggered_at(> 10 minutes): these are stuck crawls whose workers crashed without completing.
Remediation:
- Stuck IN_FLIGHT rows:
UPDATE crawl_schedule SET status = 'PENDING' WHERE status = 'IN_FLIGHT' AND last_triggered_at < NOW() - INTERVAL '10 minutes';This re-queues them for the next scheduler tick. - CrawlWorker overload: scale out
kubectl scale deployment crawl-worker --replicas=50. - Network issue: check AWS VPC flow logs; check if specific source domains are down.
Runbook 3: Elasticsearch Index Divergence (search misses recent articles) #
Trigger: Alert on es_index_lag_seconds > 120 (CDC pipeline is more than 2× SLO behind).
Diagnosis steps:
- Check Debezium Kafka Connect status:
curl http://kafka-connect:8083/connectors/postgres-source/status. - Check Kafka
article-changestopic consumer lag. - Check Elasticsearch indexer Kafka Connect sink status.
- Check Elasticsearch cluster health:
curl http://es:9200/_cluster/health.
Remediation:
- Debezium connector failure:
curl -X POST http://kafka-connect:8083/connectors/postgres-source/restart. - Elasticsearch cluster RED/YELLOW: check unassigned shards:
curl http://es:9200/_cat/shards?v | grep UNASSIGNED. Force allocate if needed. - If lag is persistent (> 1 hour): trigger partial reindex for recently ingested articles:
reindex_articles_since(timestamp=NOW() - INTERVAL '2 hours')
Runbook 4: Trending Score Stale (Redis TTL expired, Flink not writing) #
Trigger: Alert on trending_redis_key_missing — the expected Redis key trending:{current_epoch} does not exist.
Diagnosis steps:
- Check Flink trending-job status in Flink UI.
- Check Kafka
read-eventsconsumer lag for trending-job consumer group. - Check Redis cluster health.
Remediation:
- Flink job down:
flink run -d trending-job.jar(or trigger via Kubernetes job restart). - Redis cluster issue: check ElastiCache events in AWS console.
- Cold start (no ReadEvents in past hour): this is expected during low-traffic periods; trending returns empty list — acceptable.
Step 18 — Observability #
Metrics #
| Metric | Type | Labels | Alert Threshold |
|---|---|---|---|
articles_ingested_total | Counter | source_id, status (new/duplicate) | — |
article_parse_latency_seconds | Histogram | feed_type | P99 > 5s |
crawl_fetch_error_rate | Gauge | source_id | > 10% over 5 min |
sources_overdue_ratio | Gauge | — | > 0.01 |
kafka_consumer_lag | Gauge | consumer_group, topic | > 10,000 |
feed_api_latency_seconds | Histogram | cache_status (hit/miss) | P99 > 500ms (hit), P99 > 3s (miss) |
feed_cache_hit_rate | Gauge | — | < 80% |
es_index_lag_seconds | Gauge | — | > 120s |
trending_score_freshness_seconds | Gauge | — | > 60s |
user_interest_update_lag_seconds | Gauge | — | > 300s |
minhash_cluster_assignments_total | Counter | action (new_cluster/joined_cluster) | — |
read_event_dedup_rate | Gauge | — | > 5% (indicates client retry loop) |
flink_job_restarts_total | Counter | job_name | > 3 in 1 hour |
circuit_breaker_open_total | Counter | service | > 0 per minute |
classifier_latency_seconds | Histogram | — | P99 > 10s |
Traces #
Distributed tracing (OpenTelemetry) spans:
feed.request→ spans:redis.get(feed_cache),es.search(candidates),postgres.read(user_interest),rank.compute,redis.set(feed_cache)article.ingest→ spans:kafka.consume(article-fetches),parse.html,postgres.insert(article),kafka.publish(new-articles)crawl.trigger→ spans:postgres.query(crawl_schedule),kafka.publish(crawl-trigger),postgres.update(next_crawl_at)
Logs #
Structured JSON logs with fields: service, trace_id, span_id, level, msg, article_id, user_id, source_id, duration_ms.
Key log events:
article.ingested— every new article (not duplicate):{article_id, url, source_id, published_at, ingested_at}article.duplicate— URL conflict on insert:{url, existing_article_id}cluster.assigned— article assigned to cluster:{article_id, cluster_id, action, jaccard_similarity}interest.updated— user interest vector updated:{user_id, top_topics, version}feed.cache_miss— feed recomputed:{user_id, candidate_count, rank_duration_ms}crawl.error— source fetch failed:{source_id, url, http_status, error}
Dashboards #
- Ingestion Health: articles/hour by source, parse error rate, duplicate rate, cluster assignment rate
- Pipeline Lag: Kafka consumer lag per consumer group, Flink checkpoint duration, CDC lag to Elasticsearch
- User Experience: feed P50/P95/P99 latency (cache hit vs miss), search latency, trending score freshness
- Crawl Coverage: sources overdue, crawl error rate by domain, CrawlWorker queue depth
- Content Quality: articles per cluster (clustering effectiveness), classification confidence distribution
Step 19 — Cost Model #
Monthly cost estimate at 100K sources, 5M articles/day, 50K reads/second peak.
| Component | Sizing | Monthly Cost (USD) | Notes |
|---|---|---|---|
| RDS Postgres (primary + 2 replicas) | db.r6g.2xlarge × 3 | ~$1,800 | 2TB gp3 storage; Multi-AZ |
| Kafka (MSK) | 3 × i3.xlarge | ~$900 | Managed Kafka; 3 AZ |
| Flink (EC2) | 3 JobManager (c6i.xl) + 6 TaskManager (r6i.2xl) | ~$2,400 | Spot for TaskManagers saves ~60% |
| Redis (ElastiCache) | 6 × r6g.xlarge (cluster) | ~$1,800 | |
| Elasticsearch | 3 × r6g.2xlarge + 1 coord | ~$1,600 | 500GB gp3 each |
| CrawlWorkers (EC2 Spot) | avg 20 × c6i.large | ~$400 | Spot pricing |
| ParseWorkers (EC2 Spot) | avg 10 × c6i.large | ~$200 | |
| ClassifierWorkers (GPU Spot) | avg 5 × g4dn.xlarge | ~$700 | GPU Spot |
| ClusterWorkers (EC2 Spot) | avg 5 × c6i.large | ~$100 | |
| Feed/Search Services (EC2) | avg 6 × c6i.large | ~$360 | |
| S3 (raw HTML archive) | 5M articles × 100KB avg × 30 days = 15TB/month | ~$340 | S3 Standard → Glacier after 7 days (~$150 effective) |
| Data transfer (outbound) | 100M users × 50 requests/month × 10KB = 50TB | ~$4,500 | CloudFront CDN reduces this by ~80% → ~$900 effective |
| CloudFront CDN | 50TB outbound × 80% cached | ~$600 | Reduces origin load and egress |
| Kubernetes (EKS) | Cluster overhead | ~$150 | Control plane |
| Debezium/Kafka Connect | 3 × c6i.large | ~$180 | |
| Total (approximate) | ~$12,000 – $15,000/month | Spot discounts and CDN caching bring effective cost down |
Cost optimization levers:
- Use Spot instances for all stateless workers (CrawlWorkers, ParseWorkers, ClassifierWorkers) — 60-70% savings
- S3 Lifecycle: raw HTML to Glacier after 7 days — reduces S3 cost by ~75%
- Flink TaskManagers on Spot with checkpointing to S3 — job restarts on Spot interruption (acceptable; checkpoint recovery < 30s)
- CloudFront for feed API responses: trending and category feeds are cacheable for 30s at edge — reduces origin load by 80%
- Postgres: use RDS Reserved Instances (1-year) — 40% discount → ~$700/month for DB
Step 20 — Evolution Stages #
Stage 1: Foundation (0 → 1M users, 10K sources) #
What is built:
- Postgres for all primary state (article, source, crawl_schedule, read_event, user_interest)
- Kafka with 3 topics: crawl-triggers, article-fetches, read-events
- CrawlWorker pool + ParseWorker pool (manual scaling, 5 workers each)
- Simple rule-based classifier (keyword matching, no ML) for category labels
- No near-duplicate clustering (each URL is its own “cluster”)
- Feed = most recent articles in user’s followed categories (no ML ranking)
- Trending = COUNT(read_events in past 1h) per article, computed by cron job every 5 minutes
- Elasticsearch not yet deployed — search is a Postgres full-text query (
tsvector) - No Redis — feed is computed on every request from Postgres
Compromises accepted: No personalization; no near-dup clustering; search is slow; trending is coarse.
Migration path to Stage 2: Add Redis for feed cache when P99 feed latency exceeds 1 second; add Flink when cron job trending lag becomes user-visible.
Stage 2: Personalization and Real-Time Trending (1M → 10M users, 50K sources) #
What changes:
- Add Flink (2 TaskManagers) for sliding-window trending score
- Add Redis (3-node cluster) for trending sorted set and feed cache
- Add ML classifier (FastText, CPU-only, 3 workers)
- Add UserInterest (Flink EMA update pipeline)
- PersonalizedFeed: top-K from Postgres by topic score dot product (no ES yet)
- ReadEvent deduplication implemented (unique index + session_id)
- CrawlSchedule eviction:
FOR UPDATE SKIP LOCKEDscheduler pattern implemented
New invariants enforced: I-5 (ReadEvent idempotency), I-6 (TrendingScore freshness), I-7 (feed personalization)
Stage 3: Search and Story Clustering (10M → 50M users, 100K sources) #
What changes:
- Deploy Elasticsearch (3 nodes); Debezium CDC from Postgres
- Replace Postgres full-text search with Elasticsearch full-text + dense vector KNN
- Implement MinHash + LSH story clustering (ClusterWorker pool)
- StoryCluster table added; Article.cluster_id populated
- Feed candidate retrieval switches from Postgres to Elasticsearch KNN
- PersonalizedFeed now uses topic_vector (128-dim) instead of category labels
- Add Kafka
new-articlesandclassified-articlestopics - Flink interest-update pipeline achieves exactly-once semantics
New invariants enforced: I-4 (near-dedup clustering), I-8 (search freshness)
Stage 4: Multi-Region (50M → 200M users) #
What changes:
- Deploy read-region in EU and APAC (Postgres read replicas + Redis + ES)
- Feed Service reads from local Redis/ES/Postgres replica
- Kafka MirrorMaker 2 for cross-region ReadEvent replication
- UserInterest updates go to primary region; replicated asynchronously
- Trending is computed per-region (regional trends) + globally (global trending)
- CrawlWorkers deployed per region; each region crawls a partition of sources (avoid duplicate crawls via source → region assignment in crawl_schedule)
New invariants: Cross-region: UserInterest may be stale by replication lag (typically < 1s) — accepted as eventual consistency.
Stage 5: Advanced ML and Publisher Ecosystem (200M+ users) #
What changes:
- Replace FastText classifier with fine-tuned BERT (GPU inference, batched)
- Add publisher API: verified publishers can submit articles directly (no crawl needed) via REST API; articles are inserted into Postgres directly, bypassing CrawlWorker/ParseWorker
- Add user feedback loop: explicit signals (save, share, dislike) are incorporated into UserInterest with higher weight than implicit reads
- Add A/B testing framework: PersonalizedFeed ranking function is parameterized; different ranking models tested in shadow mode
- Story clustering graduates to transformer-based embeddings (SBERT) instead of MinHash for higher accuracy; LSH index replaced by FAISS approximate nearest neighbor
- Breaking news detection: articles with engagement velocity > 3× baseline trending threshold trigger push notifications via a separate notification service
New objects: FeedbackEvent (explicit user signal — append-only, higher weight than ReadEvent), PublisherSubmission (stable entity — verified publishers bypass crawl pipeline), NotificationSubscription (intent/future constraint — user opts in to breaking news alerts).
End of document. All 20 steps have produced explicit output artifacts. The design is complete.
There's no articles to list here yet.