Table of Contents

News Aggregator: System Design #

A complete mechanical derivation of a Google News / Flipboard-style news aggregator — from functional requirements to operable architecture, following the 20-step derivation framework without hand-wavy steps.

Ordering Principle #

Product requirements
  → normalize into operations over state          (Step 1)
  → extract primary objects                       (Step 2)
  → assign ownership, evolution, ordering         (Step 3)
  → extract invariants                            (Step 4)
  → derive minimal DPs from invariants            (Step 5)
  → select concrete mechanisms                    (Step 6)
  → validate independence and source-of-truth     (Step 7)
  → specify exact algorithms                      (Step 8)
  → define logical data model                     (Step 9)
  → map to technology landscape                   (Step 10)
  → define deployment topology                    (Step 11)
  → classify consistency per path                 (Step 12)
  → identify scaling dimensions and hotspots      (Step 13)
  → enumerate failure modes                       (Step 14)
  → define SLOs                                   (Step 15)
  → define operational parameters                 (Step 16)
  → write runbooks                                (Step 17)
  → define observability                          (Step 18)
  → estimate costs                                (Step 19)
  → plan evolution                                (Step 20)

Step 1 — Problem Normalization #

Convert product-language requirements into precise operations over state.

Original Functional Requirements:

The system ingests articles from RSS feeds, APIs, and scraped sources
Articles are deduplicated (same story from multiple sources = one article cluster)
Articles are categorized by topic (ML classification)
Users browse articles by category, topic, or keyword search
Users see a personalized feed ranked by relevance to their interests
Trending articles are surfaced (high engagement in short time window)
Users can search for articles by keyword

Normalized Table:

#	Original Requirement	Actor	Operation	State Touched
1a	System ingests articles from RSS/API/scrape	Scheduler + CrawlWorker	append event (raw fetch)	ArticleFetch event; Source.last_crawled_at overwrite
1b	Schedule controls when each source is re-crawled	System scheduler	read intent constraint; trigger crawl	CrawlSchedule (intent/future constraint)
2	Deduplicate articles — same story = one cluster	System pipeline	conditional create on URL hash; approximate-match on content hash	Article (by URL key); StoryCluster (merge)
3	Categorize articles by topic	System ML pipeline	overwrite metadata on Article	Article.category_labels (mutable metadata)
4a	Users browse by category	User	read projection filtered by category	CategoryFeed (derived view over Article)
4b	Users browse by topic	User	read projection filtered by topic	TopicFeed (derived view over Article)
4c	Users browse by keyword search	User	query indexed projection	SearchIndex (Elasticsearch projection)
5	Users see personalized feed	User	read pre-computed ranked projection	PersonalizedFeed (derived view over Article × UserInterest)
5a	Feed ranking updates as user reads articles	System	overwrite interest vector	UserInterest (overwrite on read event)
5b	User reads an article	User	append immutable event	ReadEvent
6	Trending articles surfaced	System	read sliding-window aggregate	TrendingScore (derived view: count of ReadEvents per article per window)
7	Users search for articles by keyword	User	query indexed projection	SearchIndex (Elasticsearch)

Key normalizations:

FR 1 hides two distinct operations: the crawl scheduling loop (an intent/future constraint that persists across many executions) and the article fetch event (a point-in-time result).
FR 2 “deduplication” is actually two distinct mechanisms: exact dedup (URL as natural key — one URL = one Article) and near-duplicate clustering (same story from different sources — requires approximate matching, producing a StoryCluster).
FR 5 “see a personalized feed” hides a hidden write: every read event updates UserInterest. The feed itself is a derived view, not primary state.
FR 6 “trending” is NOT primary state. TrendingScore is a derived aggregate over ReadEvents within a sliding window.
FR 4c and FR 7 both resolve to the same mechanism (SearchIndex) — not two separate systems.

Step 2 — Object Extraction #

Identify the minimal set of primary state objects; apply four purity tests to each candidate.

Primary Objects (Candidates) #

Source #

Class: Stable entity
Ownership purity: Single writer — admin registers/updates sources. Pipeline updates last_crawled_at. Split: admin writes url, crawl_interval, status; pipeline writes last_crawled_at. Both writes are on different fields under different conditions. Accept as one object.
Evolution purity: Overwrite — current registration state; last_crawled_at updated after each crawl.
Ordering purity: No meaningful order among sources.
Non-derivability: Cannot be reconstructed from ArticleFetch events — source configuration (URL, interval) is primary. Pass.
Verdict: Valid primary object.

CrawlSchedule #

Class: Intent / future constraint
Description: Encodes “source S must next be crawled at time T.” Persists until the crawl happens, then re-arms for the next interval. Analogous to AlertSubscription in a price-drop system.
Ownership purity: System only — scheduler writes next_crawl_at after each completed crawl.
Evolution purity: Overwrite — one current next-crawl-at per source.
Ordering purity: No meaningful ordering across different sources; total order within one source (each crawl produces one new schedule entry overwriting the last).
Non-derivability: The intent (next scheduled time) cannot be reconstructed solely from past ArticleFetch events without the crawl_interval. Pass.
Verdict: Valid primary object (intent/future constraint class).

Article #

Class: Stable entity
Description: The parsed, normalized representation of one published piece of content at one URL.
Ownership purity: Pipeline only writes. No user writes. Two fields with different evolution: immutable content (url, title, body, published_at, source_id, content_hash) vs. mutable metadata (category_labels, cluster_id, engagement_count). The mutable metadata is written by different sub-pipelines. This is a tension — but both sets of fields live naturally on the same entity and have the same natural key (URL). Accept: the immutable content is set once on insert; metadata fields are updated independently by labeled sub-pipelines.
Evolution purity: Content = append-only (set once, immutable). Metadata = overwrite. Mixed, but the entity is the natural unit of reference. Accept.
Ordering purity: Total order by published_at within source_id; by ingested_at within the system.
Non-derivability: The parsed text, title, and URL are primary — they come from the external world and cannot be derived internally. Pass.
Verdict: Valid primary object.

ArticleFetch #

Class: Event
Description: The raw result of one crawl attempt — raw HTML/RSS XML, HTTP status, fetch timestamp.
Ownership purity: CrawlWorker only writes.
Evolution purity: Append-only and immutable.
Ordering purity: Total order by fetched_at within source_id.
Non-derivability: The raw bytes from the external source are the provenance record; they cannot be reconstructed. Pass.
Verdict: Valid primary object (event class). Used as input to the parse pipeline and for archival.

StoryCluster #

Class: Stable entity (with merge-based evolution)
Description: A group of Articles that cover the same real-world story, from different sources (e.g., CNN + BBC + Reuters all cover the same plane crash). Cluster membership is determined by near-duplicate similarity.
Ownership purity: System pipeline only — the clustering pipeline writes cluster assignments.
Evolution purity: Merge — new articles can join an existing cluster; clusters can merge if two previously separate clusters are found to be about the same story. Commutative: adding article A then B is the same as adding B then A.
Ordering purity: No meaningful ordering among cluster members; cluster has a canonical_article_id pointing to the most authoritative source.
Non-derivability: The cluster identity and membership cannot be derived without running the near-duplicate algorithm. The result of MinHash/LSH is not mechanically derivable from Article content without the algorithm state (band signatures, hash functions). Accept.
Verdict: Valid primary object.

ReadEvent #

Class: Event
Description: User reads an article. Immutable point-in-time record.
Ownership purity: Single writer (the user, via the client/API layer).
Evolution purity: Append-only, immutable.
Ordering purity: Total order by timestamp within user_id.
Non-derivability: The fact that user U read article A at time T is primary. Cannot be reconstructed. Pass.
Verdict: Valid primary object.

UserInterest #

Class: Stable entity
Description: The current interest vector for a user — a weighted bag of topics/categories derived from their reading history. Used as input to feed ranking.
Ownership purity: System pipeline only (the interest-update pipeline writes after each ReadEvent). Users never write directly.
Evolution purity: Overwrite — always reflects the latest computed interest vector, not a history.
Ordering purity: No meaningful ordering.
Non-derivability: Could theoretically be reconstructed by replaying all ReadEvents through the interest model. However, as an online-updated vector (exponential moving average with decay), the current vector is the operational state; full reconstruction would be extremely expensive and model-dependent. Accept as primary state (the current vector is the operational source of truth).
Verdict: Valid primary object.

Derived / Rejected Objects #

Candidate	Problem	Resolution
TrendingScore	Derivable from ReadEvents in sliding window	Derived view. Computed by Flink windowing job; stored in Redis sorted set with TTL. Not primary state.
PersonalizedFeed	Derivable from Article × UserInterest × ranking model	Derived view (materialized). CQRS read model; rebuilt on UserInterest update. Not primary state.
SearchIndex	Projection of Article content into inverted index	Derived view. Elasticsearch index maintained via CDC from Postgres Article table. Not primary state.
CategoryFeed	Derivable from Article filtered by category	Derived view. Can be served by querying Postgres/Elasticsearch with category filter. Not materialized separately.
TopicFeed	Derivable from Article filtered by topic	Derived view. Same as CategoryFeed.

Step 3 — Axis Assignment #

For every primary object, define ownership, evolution, and ordering.

Object: Source
  Ownership:   Multi-writer: admin sets registration config; pipeline updates last_crawled_at.
               In practice: admin is the authority on URL/interval; pipeline is authority on crawl timestamps.
               No contention between admin and pipeline (different fields, admin changes rare).
  Evolution:   Overwrite (current registration state is the truth; last_crawled_at replaced after each crawl)
  Ordering:    No meaningful order among sources

Object: CrawlSchedule
  Ownership:   System-only (scheduler writes next_crawl_at; crawl worker updates it after completion)
  Evolution:   Overwrite (one next_crawl_at per source_id; replaced after each crawl completes)
  Ordering:    No meaningful order among schedule entries; within a source, each entry supersedes the last

Object: Article
  Ownership:   Multi-writer pipeline (crawler writes base record; classifier pipeline writes category_labels;
               clustering pipeline writes cluster_id; engagement pipeline writes engagement_count)
               Each field has single-pipeline authority. No two pipelines write the same field.
  Evolution:   Content fields: append-only (set on first insert, never modified)
               Metadata fields: overwrite (each pipeline overwrites its own field)
  Ordering:    Total order by published_at within source_id (for browsing)
               Total order by ingested_at across all articles (for recency feed)

Object: ArticleFetch
  Ownership:   Single writer per partition (crawl worker writes; one worker per source)
  Evolution:   Append-only (immutable after creation)
  Ordering:    Total order by fetched_at within source_id

Object: StoryCluster
  Ownership:   System-only pipeline (clustering pipeline — MinHash/LSH job)
  Evolution:   Merge (articles join clusters; clusters may merge — commutative, associative operation)
  Ordering:    No meaningful order among cluster members; cluster itself ordered by article count desc for display

Object: ReadEvent
  Ownership:   Single writer (the requesting user, via API)
  Evolution:   Append-only (immutable)
  Ordering:    Total order by timestamp within user_id

Object: UserInterest
  Ownership:   System-only (interest-update pipeline writes after each ReadEvent batch)
  Evolution:   Overwrite (current interest vector replaces previous; EMA-based decay)
  Ordering:    No meaningful order (set of weighted topics)

Step 4 — Invariant Extraction #

Precise, testable properties that must hold in every valid execution.

I-1: Article URL Uniqueness (Uniqueness/Idempotency) #

For every valid system state, there is at most one Article record with any given url. If the same URL is fetched N times, exactly one Article record exists.

Scope: System-wide (url is globally unique) Implication: Insert of Article must be INSERT ... ON CONFLICT (url) DO NOTHING or equivalent upsert. The URL is the natural idempotency key.

I-2: Article Content Immutability (Ordering) #

Once an Article’s content_hash, title, body, and source_id fields are set, they are never modified. Only metadata fields (category_labels, cluster_id, engagement_count) may be updated after initial insert.

Scope: Per article (url) Implication: Write path for content must be insert-once. Metadata update paths must not touch content columns.

I-3: CrawlSchedule Freshness (Eligibility) #

A source S must not be fetched again before next_crawl_at(S). Equivalently, any ArticleFetch for source S must have fetched_at >= CrawlSchedule.next_crawl_at for that source.

Scope: Per source_id Implication: Scheduler must enforce the crawl interval. If a crawl is triggered early (e.g., by operator), the CrawlSchedule must be updated atomically before the crawl fires.

I-4: Deduplication — Near-Duplicate Articles Map to One Cluster (Uniqueness) #

For any two articles A1 and A2 whose MinHash similarity exceeds threshold θ (= 0.85), they must share the same cluster_id. No two articles covering the same real-world story should have different cluster_ids after the clustering pipeline completes.

Scope: Approximate — this is a best-effort invariant, not an atomic one. The clustering pipeline may run with some lag. Implication: The invariant is probabilistic (MinHash is approximate). Accept: the system provides near-dedup with bounded false-negative rate, not perfect dedup.

I-5: ReadEvent Idempotency (Uniqueness/Idempotency) #

If user U submits a read event for article A multiple times (due to retry or double-tap), at most one ReadEvent is recorded per (user_id, article_id, session_id).

Scope: Per (user_id, article_id, session_id) Implication: ReadEvent insert must dedup by (user_id, article_id, session_id). Use session_id as idempotency key.

I-6: TrendingScore Reflects Recent ReadEvents (Propagation) #

The TrendingScore for article A at time T reflects all ReadEvents for A in the window [T − W, T], where W = 1 hour. Staleness bound: ε ≤ 30 seconds.

Scope: Per article_id, per window Implication: Stream aggregation must process ReadEvents within 30 seconds. The score is approximate (Count-Min Sketch acceptable).

I-7: PersonalizedFeed Reflects Current UserInterest (Propagation) #

The PersonalizedFeed for user U reflects the UserInterest vector as of at most 5 minutes ago. If UserInterest has not changed in 24 hours, the feed may be stale by up to 24 hours (no reads = no updates).

Scope: Per user_id Implication: Feed materialization must be triggered within 5 minutes of a UserInterest update. TTL-based cache invalidation acceptable.

I-8: SearchIndex Reflects Article Table (Propagation) #

The SearchIndex (Elasticsearch) contains every Article that has been in the Postgres Article table for more than 60 seconds. No Article is permanently absent from the SearchIndex once ingested.

Scope: System-wide Implication: CDC from Postgres to Elasticsearch must have guaranteed delivery and must handle reindex failures with retry.

I-9: UserInterest Updated on Every ReadEvent Batch (Accounting) #

The UserInterest vector for user U reflects all ReadEvents for U with timestamp ≤ T − 5 minutes. The interest model must process every ReadEvent exactly once (no duplicates, no drops).

Scope: Per user_id Implication: Exactly-once processing of ReadEvents into UserInterest. Dedup by (user_id, article_id, session_id) before applying to interest model.

I-10: Access Control — Users Read Own Data (Access Control) #

User U may read only their own ReadEvents, UserInterest, and PersonalizedFeed. Category and topic feeds are public. Article content is public. Trending scores are public.

Scope: Per user_id Implication: All user-scoped read endpoints must authenticate and authorize by user_id. Shared/public endpoints require no user scoping.

Step 5 — DP Derivation #

Derive the minimal enforcing mechanism per invariant cluster.

Invariant	Cluster	Minimal DP
I-1 (URL uniqueness)	Uniqueness	Unique index on `Article.url` in Postgres; `INSERT ON CONFLICT DO NOTHING` in crawl pipeline
I-2 (content immutability)	Ordering	Write path for content uses insert-only; metadata update queries restrict to metadata columns only (`UPDATE article SET category_labels = ? WHERE url = ?`)
I-3 (crawl freshness)	Eligibility	Scheduler reads `CrawlSchedule.next_crawl_at`; fires crawl only when `now() >= next_crawl_at`; updates `next_crawl_at = now() + crawl_interval` atomically after dispatch
I-4 (near-dedup clustering)	Uniqueness (approximate)	MinHash signatures computed on Article ingest; LSH lookup for candidate near-duplicates; cluster merge via pipeline-side union-find; `cluster_id` written back to Article
I-5 (ReadEvent idempotency)	Uniqueness/Idempotency	Unique index on `ReadEvent(user_id, article_id, session_id)`; Kafka consumer uses `event_id` as dedup key before DB insert
I-6 (trending freshness)	Propagation + Accounting	Flink sliding window (1h / 30s slide); Count-Min Sketch per article; result written to Redis sorted set with TTL
I-7 (feed freshness)	Propagation	UserInterest update triggers feed recompute; Redis cache key `feed:{user_id}` with 5-minute TTL; feed service falls back to recompute on cache miss
I-8 (search freshness)	Propagation	Debezium CDC from Postgres Article table → Kafka topic → Elasticsearch sink; Kafka consumer retries on ES failure; dead-letter queue for persistent failures
I-9 (interest accounting)	Accounting + Idempotency	Flink job consumes ReadEvent Kafka topic; dedup state keyed by (user_id, article_id, session_id); exactly-once semantics with Flink checkpointing + Kafka transactional producer to UserInterest update topic
I-10 (access control)	Access Control	API gateway enforces JWT authentication; user-scoped endpoints validate `user_id` from token matches path parameter

Step 6 — Mechanism Selection #

The mechanical bridge from invariants to concrete implementation decisions.

6.1 — Invariant Type → Mechanism Family #

Invariant Type	Mechanism Family
Eligibility (I-3)	Optimistic scheduler with compare-and-update on next_crawl_at
Ordering (I-2)	Insert-once + column-restricted update paths
Accounting (I-6, I-9)	Stream aggregation (Flink) + windowing
Uniqueness/Idempotency (I-1, I-5)	Natural key unique index + ON CONFLICT; dedup by event_id in stream
Propagation (I-6, I-7, I-8)	CDC (Debezium) + cache TTL + stream aggregation
Access Control (I-10)	API gateway JWT validation

6.2 — Ownership × Evolution → Concurrency Mechanism #

Object	Ownership	Evolution	Concurrency Mechanism
Source	Multi-writer (admin + pipeline, different fields)	Overwrite	No coordination needed — fields are non-overlapping; each writer owns its columns
CrawlSchedule	System-only	Overwrite	Single writer per source_id partition → no coordination needed
Article (content)	Pipeline only	Append-only (insert once)	Unique index enforces single-insert; ON CONFLICT DO NOTHING for retries
Article (metadata)	Multi-pipeline, each field owned by one pipeline	Overwrite	No coordination needed — each pipeline owns distinct columns
ArticleFetch	Single writer per source	Append-only	No contention mechanism needed; dedup required (event_id as idempotency key)
StoryCluster	System-only pipeline	Merge (commutative)	CRDT-compatible: union-find cluster merge is commutative and associative; pipeline applies merges without coordination
ReadEvent	Single writer (user)	Append-only	No contention; dedup by (user_id, article_id, session_id) via unique index
UserInterest	System-only pipeline	Overwrite	Single writer per user_id partition → no coordination needed

6.3 — Q1-Q5 Full Derivation for Key Flows #

Flow 1: Article Deduplication and Story Clustering #

Q1 — Scope: The dedup check (URL uniqueness) is within-service (single Postgres instance per shard). Near-duplicate clustering (StoryCluster) is within-service but batched.

Q2 — Failure: CrawlWorker may crash after fetching but before writing to Postgres. The ArticleFetch event is published to Kafka first; Postgres write is downstream. At-least-once delivery from Kafka → dedup by URL (Postgres unique index absorbs retries). Slow degradation (Postgres down): Circuit Breaker on Postgres write path; crawler buffers in Kafka.

Q3 — Data: URL is content-addressed (URL is the natural hash = idempotency key for free). Near-duplicate content: MinHash signature is a commutative, order-independent summary of the article’s tokens → stream aggregation (signature computed at ingest; LSH lookup in Redis or in-memory index).

Q4 — Access: Dedup is a write-path concern, not a read concern.

Q5 — Coupling: CrawlWorker publishes ArticleFetch event to Kafka (Outbox-equivalent pattern: Kafka acts as the durable log). Parse-and-dedup pipeline consumes from Kafka → writes to Postgres. Kafka decouples the crawler from the database.

Mechanism selected: Pipeline pattern (CrawlWorker → Kafka → ParseWorker → Postgres). URL dedup via INSERT ON CONFLICT (url) DO NOTHING. Near-duplicate clustering via MinHash + LSH: compute 128-band MinHash signatures at parse time, store in Redis sorted set indexed by LSH band, look up candidates, compute Jaccard similarity, assign or create cluster via pipeline-side union-find, write cluster_id back to Article in a separate update.

Dedup + stream aggregation always paired: ArticleFetch events on Kafka are consumed with exactly-once semantics (Kafka consumer group with committed offsets). Before Postgres insert, check URL uniqueness. Event_id dedup in the consumer prevents double-processing on consumer restart.

Q1 — Scope: TrendingScore is a derived view. The source of truth is the ReadEvent stream. Trending is global (not per-user). Within-service scope (Flink job consuming from Kafka).

Q2 — Failure: Flink job crashes → restores from checkpoint; ReadEvents are replayed from Kafka offset. At-least-once delivery → dedup by event_id in Flink state before counting. Slow degradation: Count-Min Sketch is naturally tolerant of small error; trending score degrades gracefully.

Q3 — Data: ReadEvents are time-windowed (1-hour sliding window with 30-second slide). Pattern: Windowing. High cardinality (millions of articles): Count-Min Sketch per window for approximate counting. Temporal Decay: older reads within the window count equally (sliding window handles this); could add exponential decay weight for recency bias.

Q4 — Access: Read » write for trending (millions of reads, one write path per 30s slide). Pattern: Cache-Aside. Trending scores are written to Redis sorted set (trending:{window_start_epoch}) with TTL = 2 * window_slide = 60s. API reads from Redis. On Redis miss, fall back to Flink-managed state (or return stale score).

Q5 — Coupling: ReadEvent → Kafka → Flink (sliding window job) → Redis. Flink writes to Redis asynchronously every slide interval. No synchronous coupling between read event and trending update.

Mechanism selected: Flink sliding window (1 hour / 30 second slide). Count-Min Sketch per article per window (width=2048, depth=5 — error ε=0.001, probability δ=0.03). Result materialized to Redis sorted set ZADD trending:{epoch} {score} {article_id} with EXPIRE 60. Stream aggregation + dedup per event (Flink keyed state with event_id dedup; TTL 2 hours to bound state size).

Required combination: Stream aggregation + dedup per event always — per framework rule 6.4. Flink maintains Set<event_id> per article_id keyed state with 2-hour TTL for dedup before counting.

Flow 3: Personalized Feed Generation #

Q1 — Scope: PersonalizedFeed is per-user. UserInterest is per-user. Article metadata is global. Cross-service: UserInterest update (stream pipeline) triggers feed recompute (feed service). Scope is cross-service → Saga pattern for coordination? No — feed recompute is not a transaction; it’s an async triggered computation. Pattern: Outbox + Relay (UserInterest update → event on Kafka → feed recompute service).

Q2 — Failure: UserInterest update fails → feed uses stale interest vector (acceptable per I-7: 5-minute staleness bound). Feed recompute service crashes → cache TTL expires → next request triggers recompute. No data loss.

Q3 — Data: Read shape ≠ write shape: write model is ReadEvent stream; read model is ranked list of articles. Pattern: CQRS. Write side: Flink consumes ReadEvents, updates UserInterest vector (EMA update). Read side: Feed service reads UserInterest + Article metadata, computes ranked list, caches in Redis. Materialized View: pre-computed feed stored in Redis. Time-bounded: TTL 5 minutes on cached feed.

Q4 — Access: Read » write. Millions of feed reads per second; one interest update per user per session. Pattern: CQRS + Materialized View. Feed reads hit Redis cache. Cache miss triggers synchronous recompute (top-K articles by dot product with interest vector, from candidate set fetched from Elasticsearch/Postgres by recency). Rebuild path: on UserInterest update, invalidate cache key feed:{user_id} and async-enqueue recompute job.

Q5 — Coupling: UserInterest update is written to Postgres AND publishes event to Kafka (CDC via Debezium or explicit outbox pattern). Feed recompute service consumes from Kafka, recomputes feed, writes to Redis. Async decoupling between interest update and feed availability.

Circuit topology: The personalized feed is a closed-loop feedback system:

Plant: Article ranking function (dot product of UserInterest vector with Article topic vector)
Feedback signal: ReadEvent stream (user reads article A → increases weight of A’s topics in UserInterest)
Error signal: Difference between predicted interest (current interest vector) and observed behavior (what user actually reads)
Controller: EMA interest update (learning rate α = 0.1, decay γ = 0.95 per day)
Stability: The system is stable because α < 1 and γ < 1. Without decay, the interest vector could be dominated by old readings (integrator windup). Decay is the RC filter preventing windup.

Mechanism selected: CQRS + Materialized View + Cache-Aside. Write path: Flink interest-update job (ReadEvent → EMA update → Postgres UserInterest). Read path: Feed service reads Redis cache; miss → compute top-K via ANN (Approximate Nearest Neighbor) search of Article topic vectors against UserInterest vector using FAISS or Elasticsearch KNN; write result to Redis with 5-minute TTL. Rebuild path on cache miss or UserInterest update.

Required combination (6.4): CQRS + Materialized View requires explicit rebuild path — implemented as: (a) cache TTL expiry triggers lazy rebuild on next request, (b) UserInterest update event triggers proactive cache invalidation and async rebuild via Kafka consumer.

Flow 4: Crawl Scheduling (The Scheduler Pattern) #

Q1 — Scope: CrawlSchedule is within-service. No cross-service coordination needed. One scheduler process per region.

Q2 — Failure: Scheduler crashes → on restart, reads all CrawlSchedule rows with next_crawl_at <= now() and re-fires pending crawls. At-least-once crawl delivery → ArticleFetch dedup by (source_id, fetched_at window) prevents duplicate article processing. Slow degradation: if crawl workers are overwhelmed, scheduler backs off via exponential backoff on next_crawl_at.

Q3 — Data: CrawlSchedule is an intent/future constraint. The pattern: scheduled jobs are a special class of state where the intent (crawl this source every N minutes) must persist across restarts. The scheduler is a polling loop: SELECT * FROM crawl_schedule WHERE next_crawl_at <= NOW() AND status = 'PENDING' LIMIT 100 FOR UPDATE SKIP LOCKED.

FOR UPDATE SKIP LOCKED is the critical mechanism: allows multiple scheduler workers to claim rows without contention. Each worker claims a batch, fires crawl jobs, updates next_crawl_at = NOW() + crawl_interval.

Q4 — Access: Scheduler is internal only. No external read concern.

Q5 — Coupling: Scheduler writes CrawlTrigger event to Kafka. CrawlWorker pool consumes from Kafka, fetches source, publishes ArticleFetch event. This decouples scheduler throughput from crawl worker throughput.

Mechanism selected: Database-backed scheduler using FOR UPDATE SKIP LOCKED on crawl_schedule table. Scheduler daemon runs in loop: poll every 10 seconds, claim batch of due sources, publish to Kafka crawl-triggers topic, update next_crawl_at. Multiple scheduler instances are safe (SKIP LOCKED prevents double-claiming). CrawlWorker pool (auto-scaled) consumes crawl-triggers, fetches source, publishes to article-fetches Kafka topic.

No contention mechanism needed at the crawl worker level (pipeline ownership — system-only writes, single writer per source_id because crawl_schedule row is locked before dispatch). Dedup required: ArticleFetch events carry fetch_id = sha256(source_id + fetched_at_epoch_bucket) as idempotency key.

Step 7 — Axiomatic Validation #

Source-of-truth table: no dual truth.

State	Primary Store	Derived Stores	Allowed Staleness	Rebuild Mechanism
Source configuration	Postgres `source` table	None	—	—
CrawlSchedule	Postgres `crawl_schedule` table	None	—	—
Article content (immutable)	Postgres `article` table	S3 (raw HTML archive)	S3 is archive, not truth	—
Article metadata (category_labels, cluster_id)	Postgres `article` table	SearchIndex (ES), CategoryFeed cache	ES: 60s; Feed cache: 5min	CDC replay; cache TTL
ArticleFetch (raw HTML)	S3 (primary archive) + Kafka (transient)	None	Kafka is transit, not truth	S3 is the archive
StoryCluster membership	Postgres `story_cluster` + `article.cluster_id`	None	—	Re-run clustering pipeline on Article table
ReadEvent	Postgres `read_event` table (partitioned)	Kafka (transit)	Kafka is transit	—
UserInterest	Postgres `user_interest` table	Redis feed cache	5 min	Recompute from ReadEvent log
TrendingScore	Redis sorted set (TTL 60s)	None (ephemeral)	30s	Recompute by replaying ReadEvent stream from Kafka
PersonalizedFeed	Redis cache (TTL 5min)	None	5 min	Recompute from UserInterest + Article table
SearchIndex	Elasticsearch	None (ES is the index)	60s lag from Postgres	Full reindex from Postgres

Dual-truth risks identified and mitigated:

Risk: Article metadata in Postgres vs. Elasticsearch could diverge. Mitigation: ES is explicitly classified as a derived projection. CDC (Debezium) guarantees eventual delivery. ES is never written directly — only via CDC pipeline.
Risk: UserInterest could be dual-written by two concurrent interest-update pipeline workers for the same user. Mitigation: Flink partitions by user_id — exactly one Flink task per user_id; no concurrent writes. Postgres update uses WHERE user_id = ? AND version = ? (optimistic lock) for safety.
Risk: TrendingScore in Redis vs. raw ReadEvent count in Postgres. Mitigation: TrendingScore is explicitly a derived view; it is not reconciled against Postgres ReadEvent counts in real-time. The Count-Min Sketch is the operational truth for trending; accuracy is bounded by sketch error parameters.

Step 8 — Algorithm Design #

Pseudocode for every write path and state machine.

8.1 — Crawl Pipeline (Scheduler → Fetch → Parse → Ingest) #

// Scheduler daemon (runs every 10 seconds)
function scheduler_tick():
    BEGIN TRANSACTION
    sources = SELECT * FROM crawl_schedule
              WHERE next_crawl_at <= NOW()
              AND status = 'PENDING'
              ORDER BY next_crawl_at ASC
              LIMIT 100
              FOR UPDATE SKIP LOCKED

    for each source in sources:
        trigger = {
            source_id: source.source_id,
            url: source.url,
            fetch_type: source.feed_type,  // RSS | API | SCRAPE
            fetch_id: sha256(source.source_id + floor(now() / 60))  // idempotency key
        }
        kafka_publish('crawl-triggers', key=source.source_id, value=trigger)
        UPDATE crawl_schedule
        SET status = 'IN_FLIGHT', last_triggered_at = NOW()
        WHERE source_id = source.source_id
    COMMIT

// CrawlWorker (Kafka consumer, auto-scaled pool)
function crawl_worker(trigger):
    // Idempotency check
    if redis_setnx('crawl:in-flight:' + trigger.fetch_id, 1, TTL=600):
        raw_content = http_fetch(trigger.url)  // with timeout 30s, Circuit Breaker
        s3_key = 'raw/' + trigger.source_id + '/' + trigger.fetch_id + '.html'
        s3_put(s3_key, raw_content)

        fetch_event = {
            fetch_id: trigger.fetch_id,
            source_id: trigger.source_id,
            fetched_at: now(),
            s3_key: s3_key,
            http_status: raw_content.status,
            content_length: len(raw_content.body)
        }
        kafka_publish('article-fetches', key=trigger.source_id, value=fetch_event)

        // Update crawl schedule
        UPDATE crawl_schedule
        SET status = 'PENDING',
            next_crawl_at = NOW() + source.crawl_interval,
            last_crawled_at = NOW()
        WHERE source_id = trigger.source_id

// ParseWorker (Kafka consumer)
function parse_worker(fetch_event):
    if fetch_event.http_status != 200:
        emit_metric('crawl.fetch_error', {source_id: fetch_event.source_id})
        return

    raw_html = s3_get(fetch_event.s3_key)
    articles_raw = parse_feed(raw_html, fetch_event.source_id)  // RSS/Atom/HTML

    for each raw_article in articles_raw:
        url = normalize_url(raw_article.url)
        content_hash = sha256(raw_article.title + raw_article.body)

        // URL uniqueness enforced by DB
        result = INSERT INTO article (url, title, body, content_hash, published_at, source_id, ingested_at)
                 VALUES (url, ...)
                 ON CONFLICT (url) DO NOTHING
                 RETURNING article_id

        if result.rows_affected > 0:
            // New article — enqueue for classification and clustering
            kafka_publish('new-articles', key=url, value={article_id, content_hash, url})

8.2 — Article Classification Pipeline #

// ClassifierWorker (Kafka consumer, reads 'new-articles')
function classify_worker(new_article_event):
    article = fetch_article_content(new_article_event.article_id)
    category_labels = ml_classify(article.title + ' ' + article.body[:2000])
    // Returns list of (category, confidence) pairs

    UPDATE article
    SET category_labels = category_labels,
        classified_at = NOW()
    WHERE article_id = new_article_event.article_id

    kafka_publish('classified-articles', key=new_article_event.article_id, value={
        article_id, category_labels
    })

8.3 — Story Clustering (Near-Duplicate Detection) #

// ClusterWorker (Kafka consumer, reads 'classified-articles')
function cluster_worker(classified_event):
    article = fetch_article(classified_event.article_id)

    // Step 1: Compute MinHash signature
    tokens = shingle(article.title + ' ' + article.body, k=3)  // 3-gram shingles
    signature = minhash(tokens, num_hashes=128)

    // Step 2: LSH candidate lookup
    // Divide 128 hashes into 16 bands of 8 hashes each
    // Each band → one LSH bucket key
    candidates = []
    for band_idx in range(16):
        band = signature[band_idx*8 : (band_idx+1)*8]
        bucket_key = 'lsh:band:' + band_idx + ':' + hash(band)
        bucket_members = redis_smembers(bucket_key)
        candidates.extend(bucket_members)
        redis_sadd(bucket_key, article.article_id)
        redis_expire(bucket_key, 7 * 86400)  // 7-day TTL

    // Step 3: Compute exact Jaccard similarity for candidates
    // (to filter LSH false positives)
    best_match = None
    best_similarity = 0
    for candidate_id in set(candidates):
        candidate_sig = get_minhash_signature(candidate_id)  // from Redis or DB
        similarity = jaccard_estimate(signature, candidate_sig)
        if similarity > 0.85 and similarity > best_similarity:
            best_match = candidate_id
            best_similarity = similarity

    // Step 4: Assign cluster
    if best_match is None:
        // Create new cluster — this article is the canonical
        new_cluster_id = uuid()
        INSERT INTO story_cluster (cluster_id, canonical_article_id, created_at)
        VALUES (new_cluster_id, article.article_id, NOW())
        cluster_id = new_cluster_id
    else:
        // Join existing cluster
        cluster_id = get_cluster_id(best_match)
        // If best_match has no cluster (race condition), create one
        if cluster_id is None:
            cluster_id = uuid()
            INSERT INTO story_cluster ... (same as above for best_match)

    UPDATE article SET cluster_id = cluster_id WHERE article_id = article.article_id
    store_minhash_signature(article.article_id, signature)  // Redis, TTL 7 days

8.4 — ReadEvent Recording and Interest Update #

// API handler: POST /articles/{article_id}/read
function record_read(user_id, article_id, session_id):
    event_id = sha256(user_id + article_id + session_id)

    // Idempotency: unique index on (user_id, article_id, session_id)
    result = INSERT INTO read_event (event_id, user_id, article_id, session_id, timestamp)
             VALUES (event_id, user_id, article_id, session_id, NOW())
             ON CONFLICT (user_id, article_id, session_id) DO NOTHING

    if result.rows_affected > 0:
        // Publish to Kafka for downstream consumers
        kafka_publish('read-events', key=user_id, value={
            event_id, user_id, article_id, session_id, timestamp: NOW()
        })

    return OK

// Flink job: interest-update-pipeline
// Input: 'read-events' Kafka topic, keyed by user_id
// State: per user_id — Set<event_id> (dedup, TTL 48h), current interest vector
function interest_update_flink(read_event):
    // Dedup
    if dedup_state.contains(read_event.event_id):
        return  // duplicate, skip
    dedup_state.add(read_event.event_id)
    dedup_state.expire(read_event.event_id, TTL=48h)

    // Fetch article topic vector
    article_topics = fetch_article_topics(read_event.article_id)
    // Returns: {sports: 0.8, politics: 0.1, tech: 0.05, ...}

    // EMA update of interest vector
    current_interest = get_user_interest(read_event.user_id)
    // Apply decay for time since last update
    time_delta_days = (NOW() - current_interest.last_updated) / 86400
    decay = pow(0.95, time_delta_days)

    new_interest = {}
    for topic in all_topics:
        new_interest[topic] = decay * current_interest.vector.get(topic, 0.0)
                              + 0.1 * article_topics.get(topic, 0.0)

    // Normalize to unit vector
    norm = sqrt(sum(v^2 for v in new_interest.values()))
    new_interest = {k: v/norm for k, v in new_interest.items()}

    // Write to Postgres (optimistic lock)
    rows = UPDATE user_interest
           SET interest_vector = new_interest,
               last_updated = NOW(),
               version = version + 1
           WHERE user_id = read_event.user_id
           AND version = current_interest.version
    if rows == 0:
        // Conflict — another update raced; re-read and retry
        retry()

    // Invalidate feed cache
    redis_del('feed:' + read_event.user_id)
    kafka_publish('interest-updated', key=read_event.user_id, value={user_id, timestamp})

// Flink job: trending-pipeline
// Input: 'read-events' Kafka topic
// Window: SlidingEventTimeWindows(1 hour, 30 seconds)
// Output: Redis sorted set per window

function trending_flink_window(article_id, window_events):
    // Count-Min Sketch (width=2048, depth=5)
    sketch = CountMinSketch(width=2048, depth=5)
    for event in window_events:
        if not dedup_state.contains(event.event_id):
            sketch.increment(event.article_id)
            dedup_state.add(event.event_id)

    count = sketch.estimate(article_id)

    window_key = 'trending:' + floor(window.end / 30) * 30
    redis_zadd(window_key, count, article_id)
    redis_expire(window_key, 120)  // 2x slide interval TTL

// API: GET /trending?limit=50
function get_trending(limit):
    window_key = 'trending:' + current_window_epoch()
    results = redis_zrevrange(window_key, 0, limit-1, withscores=True)
    return results

8.6 — Personalized Feed Generation #

// Feed service: GET /feed/{user_id}
function get_feed(user_id, limit=50):
    cache_key = 'feed:' + user_id
    cached = redis_get(cache_key)
    if cached:
        return cached

    // Cache miss — recompute
    interest = fetch_user_interest(user_id)
    // interest.vector = {topic: weight, ...}

    // Candidate retrieval: fetch recent articles from top-interest categories
    top_topics = top_k(interest.vector, k=5)
    candidate_articles = []
    for topic in top_topics:
        // Query Elasticsearch for recent articles in this topic
        results = es_search({
            filter: {category: topic, published_at: {gte: NOW() - 48h}},
            sort: [{published_at: desc}],
            size: 100
        })
        candidate_articles.extend(results)

    // Deduplicate candidates by article_id
    candidates = deduplicate(candidate_articles, key='article_id')

    // Score each candidate: dot product of interest vector with article topic vector
    scored = []
    for article in candidates:
        relevance = dot_product(interest.vector, article.topic_vector)
        recency_decay = exp(-0.5 * hours_since(article.published_at))
        trending_boost = get_trending_score(article.article_id) / 1000.0
        score = relevance * recency_decay + 0.1 * trending_boost
        scored.append((score, article))

    // Sort by score, take top limit
    feed = sort_desc(scored, key=score)[:limit]

    // Cache result
    redis_setex(cache_key, 300, serialize(feed))  // 5-minute TTL

    return feed

// Background job: triggered by 'interest-updated' Kafka events
function proactive_feed_rebuild(user_id):
    redis_del('feed:' + user_id)  // Invalidate cache
    get_feed(user_id)              // Rebuild and warm cache

8.7 — Search Index (CDC Pipeline) #

// Debezium CDC configuration:
// Source: Postgres article table
// Sink: Kafka topic 'article-changes'
// Event types: INSERT, UPDATE (only for metadata fields)

// Elasticsearch sink consumer
function es_indexer(article_change_event):
    if article_change_event.op == 'INSERT':
        es_index({
            index: 'articles',
            id: article_change_event.url,
            body: {
                article_id: article_change_event.article_id,
                title: article_change_event.title,
                body_excerpt: article_change_event.body[:500],
                category_labels: article_change_event.category_labels,
                published_at: article_change_event.published_at,
                source_id: article_change_event.source_id,
                cluster_id: article_change_event.cluster_id
            }
        })
    elif article_change_event.op == 'UPDATE':
        // Only index metadata fields — content is immutable
        es_update({
            index: 'articles',
            id: article_change_event.url,
            doc: {
                category_labels: article_change_event.category_labels,
                cluster_id: article_change_event.cluster_id,
                engagement_count: article_change_event.engagement_count
            }
        })

// On failure: publish to dead-letter topic, retry with exponential backoff
// On persistent failure: alert on-call, manual reindex via:
//   reindex_from_postgres(article_id_range)

Step 9 — Logical Data Model #

Schema with partition keys derived from invariant scope.

Core Tables (Postgres) #

-- Source: registration of crawlable feeds
CREATE TABLE source (
    source_id       UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    url             TEXT NOT NULL UNIQUE,
    feed_type       TEXT NOT NULL CHECK (feed_type IN ('RSS', 'ATOM', 'API', 'SCRAPE')),
    crawl_interval  INTERVAL NOT NULL DEFAULT '15 minutes',
    last_crawled_at TIMESTAMPTZ,
    status          TEXT NOT NULL DEFAULT 'ACTIVE' CHECK (status IN ('ACTIVE', 'PAUSED', 'DEAD')),
    created_at      TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

-- CrawlSchedule: intent/future constraint for when to next crawl
CREATE TABLE crawl_schedule (
    source_id       UUID PRIMARY KEY REFERENCES source(source_id),
    next_crawl_at   TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    status          TEXT NOT NULL DEFAULT 'PENDING' CHECK (status IN ('PENDING', 'IN_FLIGHT')),
    last_triggered_at TIMESTAMPTZ
);
CREATE INDEX idx_crawl_schedule_next ON crawl_schedule (next_crawl_at) WHERE status = 'PENDING';

-- StoryCluster: near-duplicate story groups
CREATE TABLE story_cluster (
    cluster_id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    canonical_article_id UUID,  -- FK set after Article insert
    article_count       INT NOT NULL DEFAULT 1,
    created_at          TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    updated_at          TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

-- Article: primary entity, normalized from raw fetches
CREATE TABLE article (
    article_id      UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    url             TEXT NOT NULL UNIQUE,          -- natural dedup key, unique index
    title           TEXT NOT NULL,
    body            TEXT NOT NULL,
    content_hash    BYTEA NOT NULL,                -- sha256(title + body)
    published_at    TIMESTAMPTZ NOT NULL,
    source_id       UUID NOT NULL REFERENCES source(source_id),
    ingested_at     TIMESTAMPTZ NOT NULL DEFAULT NOW(),

    -- Mutable metadata (set by downstream pipelines)
    category_labels JSONB,                         -- [{category, confidence}]
    topic_vector    JSONB,                         -- {topic: weight, ...} normalized
    cluster_id      UUID REFERENCES story_cluster(cluster_id),
    engagement_count BIGINT NOT NULL DEFAULT 0,
    classified_at   TIMESTAMPTZ,
    clustered_at    TIMESTAMPTZ
) PARTITION BY RANGE (ingested_at);

-- Partition by month for retention management
CREATE TABLE article_2025_01 PARTITION OF article FOR VALUES FROM ('2025-01-01') TO ('2025-02-01');
-- ... create partitions monthly via automation

CREATE INDEX idx_article_source_published ON article (source_id, published_at DESC);
CREATE INDEX idx_article_category_published ON article USING GIN (category_labels) WHERE classified_at IS NOT NULL;
CREATE INDEX idx_article_cluster ON article (cluster_id);
CREATE INDEX idx_article_ingested ON article (ingested_at DESC);

-- ReadEvent: immutable event log
CREATE TABLE read_event (
    event_id        BYTEA PRIMARY KEY,             -- sha256(user_id + article_id + session_id)
    user_id         UUID NOT NULL,
    article_id      UUID NOT NULL REFERENCES article(article_id),
    session_id      UUID NOT NULL,
    timestamp       TIMESTAMPTZ NOT NULL DEFAULT NOW()
) PARTITION BY RANGE (timestamp);

CREATE UNIQUE INDEX idx_read_event_dedup ON read_event (user_id, article_id, session_id);
CREATE INDEX idx_read_event_user ON read_event (user_id, timestamp DESC);
CREATE INDEX idx_read_event_article ON read_event (article_id, timestamp DESC);

-- UserInterest: current interest vector per user
CREATE TABLE user_interest (
    user_id         UUID PRIMARY KEY,
    interest_vector JSONB NOT NULL DEFAULT '{}',   -- {topic: weight, ...} normalized
    last_updated    TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    version         BIGINT NOT NULL DEFAULT 0      -- for optimistic locking
);

Redis Key Schema (Derived Views) #

trending:{epoch_30s}        → SORTED SET  {article_id: score}  TTL=120s
feed:{user_id}              → STRING      serialized ranked article list  TTL=300s
crawl:in-flight:{fetch_id}  → STRING      '1'  TTL=600s
lsh:band:{band_idx}:{hash}  → SET         {article_id, ...}  TTL=604800s (7d)
minhash:sig:{article_id}    → STRING      128 hash values (binary)  TTL=604800s

Elasticsearch Index Schema #

{
  "mappings": {
    "properties": {
      "article_id":       { "type": "keyword" },
      "url":              { "type": "keyword" },
      "title":            { "type": "text", "analyzer": "english" },
      "body_excerpt":     { "type": "text", "analyzer": "english" },
      "category_labels":  { "type": "nested",
                            "properties": {
                              "category":   { "type": "keyword" },
                              "confidence": { "type": "float" }
                            }},
      "topic_vector":     { "type": "dense_vector", "dims": 128, "index": true, "similarity": "dot_product" },
      "published_at":     { "type": "date" },
      "source_id":        { "type": "keyword" },
      "cluster_id":       { "type": "keyword" },
      "engagement_count": { "type": "long" }
    }
  },
  "settings": {
    "number_of_shards": 12,
    "number_of_replicas": 1,
    "index.lifecycle.name": "articles-ilm-policy"
  }
}

Step 10 — Technology Landscape #

Layer	Technology	Rationale	Alternative Considered
Primary data store	PostgreSQL 16	ACID transactions; `FOR UPDATE SKIP LOCKED` for scheduler; `ON CONFLICT` for dedup; range partitioning for ReadEvent; JSONB for vectors and labels	CockroachDB (more complex; not needed at this scale)
Article metadata read	PostgreSQL (same instance, read replicas)	Category/topic browsing is simple range queries; no need for separate read store at initial scale	Cassandra (not needed; adds ops complexity)
Event stream / message bus	Apache Kafka (3 brokers, replication factor 3)	Durable log for ArticleFetch, ReadEvent, CrawlTrigger; at-least-once delivery with offset-based dedup; Kafka Streams or Flink as consumer	Pulsar (Kafka ecosystem is larger; Debezium support is mature)
Stream processing	Apache Flink (3 job managers, 6 task managers)	Sliding window for TrendingScore; stateful exactly-once for UserInterest update; rich windowing API; checkpointing to S3	Kafka Streams (simpler but less powerful windowing; no cross-key state joins)
Cache / derived views	Redis 7 (cluster mode, 6 nodes)	TrendingScore sorted sets; feed cache; crawl in-flight dedup; LSH buckets; low-latency reads	Memcached (no sorted sets, no TTL per key)
Search index	Elasticsearch 8 (3 data nodes, 1 coordinating)	Full-text search over article title/body; KNN vector search for feed candidates; faceted filtering by category	OpenSearch (equivalent, but ES has better k-NN support in v8)
CDC pipeline	Debezium (Kafka Connect plugin)	Postgres WAL → Kafka → Elasticsearch; guaranteed delivery; handles schema changes	Custom triggers (fragile; Debezium is production-tested)
Raw content storage	Amazon S3	Archival of raw HTML from crawls; S3 Lifecycle for cost management; extremely cheap at scale	GCS (equivalent; S3 chosen for ecosystem)
ML classification	Internal ML service (Python, FastAPI)	Topic/category classification via fine-tuned BERT or FastText; called synchronously from ClassifierWorker	AWS Comprehend (vendor lock-in; model not customizable)
ANN / vector search	FAISS (via Elasticsearch KNN)	Approximate nearest neighbor for feed candidate retrieval using topic vectors	Milvus (standalone vector DB adds ops overhead; ES KNN sufficient)
Crawl scheduling	Postgres-backed scheduler (`crawl_schedule` table + daemon)	Simple, correct, no new ops dependency; `FOR UPDATE SKIP LOCKED` handles concurrency	Quartz Scheduler / Celery Beat (adds dependency; Postgres-native is simpler)
Service mesh / gateway	Nginx + API Gateway	Authentication (JWT); rate limiting; routing to microservices	Kong (viable alternative)
Container orchestration	Kubernetes (EKS)	Auto-scaling for CrawlWorkers, ClassifierWorkers, Feed service	Nomad (smaller ecosystem)

Step 11 — Deployment Topology #

Logical Service Topology #

                         ┌──────────────────────────────────────────────┐
                         │              Clients (Browser / Mobile)       │
                         └─────────────────────┬────────────────────────┘
                                               │ HTTPS
                         ┌─────────────────────▼────────────────────────┐
                         │            API Gateway (Nginx + Auth)         │
                         │   Rate limiting, JWT validation, routing      │
                         └──┬────────────┬───────────────┬──────────────┘
                            │            │               │
               ┌────────────▼──┐  ┌──────▼──────┐  ┌───▼───────────┐
               │  Feed Service  │  │ Search Svc  │  │ Trending Svc  │
               │ (reads Redis,  │  │(queries ES) │  │ (reads Redis) │
               │  ES, Postgres) │  └──────┬──────┘  └───────────────┘
               └────────────┬──┘         │
                            │            │ ES query
              ┌─────────────▼──────────────────────────────────┐
              │                  Redis Cluster                   │
              │  feed:{user_id}, trending:{epoch}, LSH buckets  │
              └─────────────────────────────────────────────────┘

┌────────────────────────────── Write Path ──────────────────────────────────┐
│                                                                             │
│  ┌─────────────┐    ┌─────────────┐    ┌──────────────────────────────┐   │
│  │  Scheduler   │───▶│    Kafka    │───▶│  CrawlWorker Pool (k8s)      │   │
│  │  Daemon      │    │  crawl-     │    │  Fetches HTML, stores S3,    │   │
│  │  (Postgres   │    │  triggers   │    │  publishes to article-fetches │   │
│  │  FOR UPDATE  │    └─────────────┘    └──────────────────────────────┘   │
│  │  SKIP LOCKED)│                                    │                     │
│  └─────────────┘                       ┌─────────────▼──────────┐         │
│                                         │  ParseWorker Pool       │         │
│                                         │  (RSS/HTML parsing,     │         │
│                                         │  INSERT ON CONFLICT)    │         │
│                                         └─────────────┬──────────┘         │
│                                                       │ Postgres            │
│                              ┌────────────────────────▼────────────┐       │
│                              │      Postgres Primary (RDS)          │       │
│                              │  article, source, crawl_schedule,   │       │
│                              │  user_interest, read_event           │       │
│                              └──┬─────────────────────┬────────────┘       │
│                                 │ Debezium CDC         │ Read replicas (2)  │
│                     ┌───────────▼──┐          ┌────────▼──────────┐        │
│                     │  Kafka topic  │          │  Read Replica Pool │        │
│                     │  article-     │          │  (feed queries,    │        │
│                     │  changes      │          │   category browse) │        │
│                     └───────────┬──┘          └───────────────────┘        │
│                                 │                                           │
│                     ┌───────────▼──┐          ┌────────────────────┐       │
│                     │ Elasticsearch │          │  Flink Cluster     │       │
│                     │ Indexer       │          │  - trending-job    │       │
│                     │ (Kafka Connect│          │  - interest-update │       │
│                     │  Sink)        │          │  - cluster-job     │       │
│                     └───────────┬──┘          └────────────────────┘       │
│                                 │                                           │
│                     ┌───────────▼──────────────────────────────────┐       │
│                     │         Elasticsearch Cluster (3 nodes)       │       │
│                     └──────────────────────────────────────────────┘       │
└─────────────────────────────────────────────────────────────────────────────┘

Physical Deployment (AWS, Single Region Initial) #

Component	Instance Type	Count	Notes
Postgres (RDS)	db.r6g.2xlarge	1 primary + 2 replicas	Multi-AZ; 2TB gp3
Kafka brokers	i3.xlarge	3	3 AZs; 1.9TB NVMe
Flink (JobManager)	c6i.xlarge	3	HA; ZooKeeper for leader election
Flink (TaskManager)	r6i.2xlarge	6	64GB RAM; RocksDB state backend
Redis (ElastiCache)	r6g.xlarge	6	Cluster mode; 3 shards × 2 replicas
Elasticsearch	r6g.2xlarge	3 data + 1 coord	500GB gp3 each
CrawlWorkers	c6i.large	10-50 (auto-scaled)	HPA on queue depth
ParseWorkers	c6i.large	5-20 (auto-scaled)
ClassifierWorkers	g4dn.xlarge (GPU)	3-10	GPU for BERT inference
ClusterWorkers	c6i.large	3-8	CPU-bound MinHash
Feed Service	c6i.large	3-10	Stateless; HPA on CPU
Search Service	c6i.large	3-5	Stateless
Debezium (Kafka Connect)	c6i.large	3	Distributed mode
S3	—	—	Raw HTML archive

Step 12 — Consistency Model #

Classification of consistency per read/write path.

Path	Consistency Model	Justification
Article ingest (crawl → Postgres)	Linearizable (single primary)	URL uniqueness invariant requires atomic insert; Postgres primary provides linearizability
Article metadata update (classify/cluster → Postgres)	Read-your-writes (within pipeline partition)	Flink partitions by article_id; each article’s metadata updates are sequential
CrawlSchedule update (scheduler → Postgres)	Linearizable	`FOR UPDATE SKIP LOCKED` is pessimistic locking; serializable within transaction
ReadEvent recording (API → Postgres)	Linearizable (unique index enforces idempotency)	Postgres unique index on (user_id, article_id, session_id)
UserInterest update (Flink → Postgres)	Read-your-writes	Optimistic locking (version field); single Flink partition per user_id ensures no concurrent writers
SearchIndex (CDC via Debezium → Elasticsearch)	Eventual consistency (I-8: 60s staleness bound)	CDC is async; Elasticsearch is a derived projection
TrendingScore (Flink → Redis)	Eventual consistency (I-6: 30s staleness bound)	Stream aggregation is async; score updates every 30s
PersonalizedFeed (Redis cache)	Eventual consistency (I-7: 5-minute TTL)	Cache is a materialized derived view; stale on cache hit
Category/topic browse (Postgres read replica)	Eventual consistency (replication lag typically < 1s)	Read replica lag; acceptable for browsing
Trending read (Redis sorted set)	Eventual consistency (30s window slide)	Same as above
Search query (Elasticsearch)	Eventual consistency (60s CDC lag)	Derived projection

Consistency anomalies explicitly accepted:

A user reads an article and it does not immediately appear as read in their interest vector — up to 5 minutes staleness for feed personalization.
A newly published article may not appear in search for up to 60 seconds.
An article’s trending score may lag actual read volume by up to 30 seconds.
Two concurrent requests to read the same article from different sessions may both record ReadEvents (idempotency key prevents double-counting only within the same session).

Step 13 — Scaling Model #

Scaling Dimensions #

Dimension	Current Scale Target	Scaling Mechanism	Hotspot Risk
Sources (crawled feeds)	100,000 sources	Horizontal: more CrawlWorkers; Kafka partition count scales with source count	Single high-frequency source (e.g., Reuters publishes 500 articles/hour) → shard by source_id
Articles ingested/day	5 million articles/day	Postgres partitioned by month; ParseWorker pool auto-scaled	Viral event: all sources publish same story simultaneously → URL dedup absorbs burst
ReadEvents/second	50,000 reads/sec peak	ReadEvent table partitioned by day; Kafka handles bursts; Flink auto-scales	Viral article: millions of reads/hour → Count-Min Sketch handles cardinality; Redis sorted set atomic
Users	100 million users	UserInterest rows: 100M × 1KB = 100GB — fits in Postgres; feed cache in Redis cluster	Celebrity user (influencer): their read events processed same as any user
Trending computation	128 article_ids × 1-hour windows	Count-Min Sketch is O(1) memory per article; Redis sorted set scales horizontally	Breaking news: single article gets 10M reads/hour → CMS handles gracefully, Redis ZADD is atomic
Search	100M article index	Elasticsearch 12 shards; expand to 24 shards with reindex	Hot shard: recent articles searched more → ILM hot-warm-cold tier; recent index on hot nodes
Personalized feed	100M users × 5 recomputes/day	Redis cluster holds 100M keys × 10KB = 1TB → scale cluster; proactive rebuild for active users only	Popular user (many active sessions): each session triggers recompute → debounce: only rebuild if cache is cold

Sharding Strategy #

Object	Shard Key	Why
Article	`ingested_at` month (range partition)	Even distribution; enables time-based retention; no hotspot (new articles spread across sources)
ReadEvent	`timestamp` day (range partition)	Even distribution; enables fast deletion of old partitions for retention
Kafka `read-events`	`user_id` (hash)	Ensures all events for one user are in the same partition → Flink task receives all events for a user
Kafka `new-articles`	`url` hash	Ensures classifer and cluster workers process each article exactly once
Redis trending	`window_epoch`	Different windows are different keys; no hotspot
Redis feed cache	`user_id` hash (Redis Cluster)	Even distribution across Redis shards
Elasticsearch	Round-robin (12 shards)	Articles are uniformly distributed; no natural clustering needed

Auto-Scaling Triggers #

Service	Scale-Out Trigger	Scale-In Trigger
CrawlWorkers	Kafka `crawl-triggers` consumer lag > 1000 messages	Lag < 100 for 5 minutes
ParseWorkers	Kafka `article-fetches` consumer lag > 500 messages	Lag < 50 for 5 minutes
ClassifierWorkers	Kafka `new-articles` consumer lag > 200 messages	Lag < 20 for 5 minutes
Feed Service	CPU > 70%	CPU < 30% for 10 minutes
Flink TaskManagers	Flink backpressure > 80%	Backpressure < 20% for 10 minutes

Step 14 — Failure Model #

Failure Modes and Mitigations #

Failure	Impact	Detection	Mitigation	Recovery
Postgres primary failure	Article ingest halts; ReadEvents queue in Kafka; feed reads degrade to cache	RDS health check; Postgres replication lag alert	RDS Multi-AZ automatic failover (30-90s); Kafka buffers events during failover	Resume consumption from Kafka after failover; no data loss (Kafka retained)
Kafka broker failure	Read events may queue; crawl triggers queue	Kafka broker health; consumer lag alert	Replication factor 3; one broker failure is transparent	Kafka auto-rebalances partitions to surviving brokers; no intervention
Flink job crash (trending)	TrendingScore stale for up to window slide period	Flink job health; trending Redis TTL expiry alert	Flink checkpointing every 30s to S3; auto-restart from checkpoint	Job restarts from checkpoint; replays from last Kafka offset; trending recovers within 30s
Flink job crash (interest update)	UserInterest stale; feed personalization degrades	Flink job health; consumer lag alert	Same as above; Kafka ReadEvents retained 7 days	Job restarts; replays from checkpoint offset; no data loss
Redis cluster failure	Feed cache miss → higher latency; trending unavailable	Redis health; cache hit rate alert	Redis ElastiCache cluster mode with 2 replicas per shard; automatic failover	Cache rebuilds on demand on cache miss; trending recovers when Flink re-populates
Elasticsearch failure	Search unavailable; feed candidate retrieval from ES fails	ES health; search error rate	3-node ES with 1 replica per shard; coordinating node is stateless	ES auto-recovers; search unavailable during outage; feed service falls back to Postgres query
ML classifier service down	Articles not categorized; category browsing shows fewer results	Classifier health; Kafka `new-articles` consumer lag	ClassifierWorkers retry with exponential backoff; articles remain uncategorized until service recovers	Service recovers; Kafka backlog is processed; articles categorized with delay
CrawlWorker crash mid-fetch	ArticleFetch not published; source skipped for one interval	Crawl trigger timeout; `next_crawl_at` not updated for source	CrawlWorker uses Redis `SETNX` for in-flight dedup; if worker crashes, lock expires (TTL 600s); scheduler re-triggers	Scheduler re-triggers source after TTL expiry; at-most-one-interval gap in coverage
Debezium CDC lag	Elasticsearch stale beyond 60s SLO	CDC lag metric; ES vs Postgres article count difference	Debezium Kafka Connect with retries; dead-letter queue	Reprocess from dead-letter queue; or trigger full reindex for specific article_id range
S3 unavailable	Raw HTML archival fails; pipeline still works (S3 write is async)	S3 put error rate	CrawlWorker publishes to Kafka regardless of S3 success; S3 write is best-effort archive	Re-fetch from source if archival required; or accept loss of raw HTML for that fetch

Cascading Failure Analysis #

Scenario: Breaking news event — all 100K sources publish articles about the same story simultaneously

CrawlWorkers fetch 100K articles in 1 hour window
Kafka article-fetches topic receives 100K events — ParseWorkers scale out to handle burst
Postgres receives 100K INSERT ON CONFLICT DO NOTHING — most are the same cluster (SimHash collision) → DB handles gracefully; conflicting URLs deduplicated at insert; 100K inserts → perhaps 5K unique articles (different sources)
MinHash clustering pipeline is briefly overwhelmed → cluster assignment delayed; articles appear unclustered briefly
ReadEvents spike (everyone reads breaking news) → TrendingScore updates in Redis within 30s → trending feed shows breaking news correctly
Feed recompute for all users triggered → Redis cache invalidated at scale → Feed Service CPU spikes → HPA scales out
Circuit breaker: if Postgres read replicas lag, Feed Service falls back to stale cache or returns partial results

Mitigation: Breaking news is the design-time scenario this system is built for. Key protections: URL dedup at Postgres level prevents duplicate storage; Count-Min Sketch handles high-cardinality trending without memory explosion; Redis cache absorbs read spike; Kafka buffers upstream bursts.

Step 15 — SLOs #

Service	SLO	Measurement	Error Budget (30d)
Article ingest latency	P99 < 30s from publication to ingestion (for sources crawled every 15min, article available within 15min + 30s parse)	`ingested_at - published_at` for new articles	0.1% of articles may exceed 30s
Feed load time	P99 < 500ms	API response time from request to first byte (with cache)	0.1% of feed requests > 500ms
Feed load time (cache miss)	P99 < 3s	API response time on cache miss (full recompute)	0.5% of recomputes > 3s
Search response time	P99 < 1s	Elasticsearch query + API response	0.1% of searches > 1s
Trending freshness	95% of articles in trending list within 30s of crossing engagement threshold	TrendingScore update lag	5% of trending updates may lag
Personalized feed freshness	99% of feeds reflect UserInterest as of ≤ 5 minutes ago	Cache TTL + interest update lag	1% of feeds may be > 5min stale
Search index freshness	99% of articles appear in search within 60s of ingestion	`es_indexed_at - ingested_at`	1% of articles may lag > 60s
Availability (feed endpoint)	99.9% uptime	HTTP 5xx rate < 0.1%	43.2 minutes/month
Availability (search endpoint)	99.5% uptime	HTTP 5xx rate < 0.5%	3.6 hours/month
Crawl coverage	99% of active sources crawled within 2× their crawl_interval	Sources where `last_crawled_at > now() + 2 * crawl_interval`	1% may miss crawl window

Step 16 — Operational Parameters #

Tuneable Parameters #

Parameter	Default	Range	Effect
`crawl_interval` (per source)	15 minutes	5 min – 24 hours	Frequency of source crawling; affects Postgres write load
`min_crawl_interval` (system floor)	5 minutes	1 min – 30 min	Prevents overloading any single source
MinHash bands	16	8 – 32	More bands = higher recall (fewer missed duplicates); more Redis memory
MinHash hashes per band	8	4 – 16	More hashes = higher precision (fewer false positives); slower computation
Jaccard similarity threshold	0.85	0.7 – 0.95	Lower = more aggressive clustering; higher = stricter near-dup detection
Trending window size	1 hour	15 min – 6 hours	Wider = smoother trending; narrower = more reactive
Trending slide interval	30 seconds	10 sec – 5 min	Smaller = fresher trending; more Redis writes
Count-Min Sketch width	2048	512 – 8192	Wider = lower error; more memory
Count-Min Sketch depth	5	3 – 10	Deeper = lower probability of error; more memory
Feed cache TTL	300 seconds	60 – 1800 sec	Shorter = fresher feed; more recompute load
UserInterest EMA learning rate (α)	0.10	0.01 – 0.3	Higher = faster adaptation to new interests; lower = more stable
UserInterest decay rate (γ)	0.95 per day	0.8 – 1.0 per day	Lower = faster forgetting of old interests; higher = longer memory
Feed candidate pool (articles per topic)	100	20 – 500	Larger pool = better ranking quality; more ES query load
Top-K topics for feed candidates	5	2 – 10	More topics = broader feed; fewer = more focused
Kafka ReadEvent retention	7 days	1 – 30 days	Longer = more replay capacity for Flink recovery; more storage cost

Circuit Breaker Configuration #

Endpoint	Failure Threshold	Timeout	Half-Open Probe Interval
External source HTTP fetch	5 failures in 60s	30s per request	60s
Postgres write (article insert)	3 failures in 10s	5s	30s
Elasticsearch query (search/feed)	5 failures in 30s	3s	15s
Redis read (feed cache)	10 failures in 10s	100ms	10s
ML classifier service	3 failures in 30s	10s	60s

Step 17 — Runbooks #

Runbook 1: Feed Latency Degradation (P99 > 500ms) #

Trigger: Alert on feed_api_latency_p99 > 500ms for 5 consecutive minutes.

Diagnosis steps:

Check Redis feed cache hit rate: redis-cli info stats | grep keyspace_hits. If hit rate < 80%, cache is being evicted or TTL is too short.
Check Feed Service CPU: kubectl top pods -l app=feed-service. If > 80%, scale out: kubectl scale deployment feed-service --replicas=10.
Check Elasticsearch query latency: Kibana → Monitoring → Elasticsearch → Search latency. If P99 > 500ms, check for hot shards.
Check Postgres read replica lag: CloudWatch → RDS → ReplicaLag. If > 5s, feed service is reading stale data; acceptable per SLO but investigate replication stall.
Check UserInterest update lag: Kafka consumer group lag for interest-update-pipeline. If > 10,000 messages, Flink job may be struggling.

Remediation:

Redis eviction: Increase Redis cluster memory or reduce non-feed TTLs. Short-term: CONFIG SET maxmemory-policy allkeys-lru.
Feed Service overloaded: Scale out pods. If scaling doesn’t help, check for slow ES queries (see step 3).
Hot ES shards: Force-merge recent index or move hot shard to dedicated node.
Flink lag: Check Flink UI for backpressure. Scale TaskManagers if needed.

Runbook 2: Crawl Coverage Drop (sources missing crawl window) #

Trigger: Alert on sources_overdue_ratio > 0.01 (more than 1% of sources haven’t been crawled within 2× their interval).

Diagnosis steps:

Check scheduler daemon health: kubectl get pods -l app=scheduler-daemon. If pod is not running, restart: kubectl rollout restart deployment scheduler-daemon.
Check Kafka crawl-triggers consumer lag: if lag > 10,000, CrawlWorkers are overwhelmed.
Check CrawlWorker pod count: kubectl get pods -l app=crawl-worker. Auto-scaling should have triggered.
Check for circuit-breaker trips on external sources: look for high crawl.fetch_error metric rate. May indicate network issue or source outage.
Check Postgres crawl_schedule table for IN_FLIGHT rows with old last_triggered_at (> 10 minutes): these are stuck crawls whose workers crashed without completing.

Remediation:

Stuck IN_FLIGHT rows: UPDATE crawl_schedule SET status = 'PENDING' WHERE status = 'IN_FLIGHT' AND last_triggered_at < NOW() - INTERVAL '10 minutes'; This re-queues them for the next scheduler tick.
CrawlWorker overload: scale out kubectl scale deployment crawl-worker --replicas=50.
Network issue: check AWS VPC flow logs; check if specific source domains are down.

Runbook 3: Elasticsearch Index Divergence (search misses recent articles) #

Trigger: Alert on es_index_lag_seconds > 120 (CDC pipeline is more than 2× SLO behind).

Diagnosis steps:

Check Debezium Kafka Connect status: curl http://kafka-connect:8083/connectors/postgres-source/status.
Check Kafka article-changes topic consumer lag.
Check Elasticsearch indexer Kafka Connect sink status.
Check Elasticsearch cluster health: curl http://es:9200/_cluster/health.

Remediation:

Debezium connector failure: curl -X POST http://kafka-connect:8083/connectors/postgres-source/restart.
Elasticsearch cluster RED/YELLOW: check unassigned shards: curl http://es:9200/_cat/shards?v | grep UNASSIGNED. Force allocate if needed.
If lag is persistent (> 1 hour): trigger partial reindex for recently ingested articles:
```
reindex_articles_since(timestamp=NOW() - INTERVAL '2 hours')
```

Trigger: Alert on trending_redis_key_missing — the expected Redis key trending:{current_epoch} does not exist.

Diagnosis steps:

Check Flink trending-job status in Flink UI.
Check Kafka read-events consumer lag for trending-job consumer group.
Check Redis cluster health.

Remediation:

Flink job down: flink run -d trending-job.jar (or trigger via Kubernetes job restart).
Redis cluster issue: check ElastiCache events in AWS console.
Cold start (no ReadEvents in past hour): this is expected during low-traffic periods; trending returns empty list — acceptable.

Step 18 — Observability #

Metrics #

Metric	Type	Labels	Alert Threshold
`articles_ingested_total`	Counter	source_id, status (new/duplicate)	—
`article_parse_latency_seconds`	Histogram	feed_type	P99 > 5s
`crawl_fetch_error_rate`	Gauge	source_id	> 10% over 5 min
`sources_overdue_ratio`	Gauge	—	> 0.01
`kafka_consumer_lag`	Gauge	consumer_group, topic	> 10,000
`feed_api_latency_seconds`	Histogram	cache_status (hit/miss)	P99 > 500ms (hit), P99 > 3s (miss)
`feed_cache_hit_rate`	Gauge	—	< 80%
`es_index_lag_seconds`	Gauge	—	> 120s
`trending_score_freshness_seconds`	Gauge	—	> 60s
`user_interest_update_lag_seconds`	Gauge	—	> 300s
`minhash_cluster_assignments_total`	Counter	action (new_cluster/joined_cluster)	—
`read_event_dedup_rate`	Gauge	—	> 5% (indicates client retry loop)
`flink_job_restarts_total`	Counter	job_name	> 3 in 1 hour
`circuit_breaker_open_total`	Counter	service	> 0 per minute
`classifier_latency_seconds`	Histogram	—	P99 > 10s

Traces #

Distributed tracing (OpenTelemetry) spans:

feed.request → spans: redis.get(feed_cache), es.search(candidates), postgres.read(user_interest), rank.compute, redis.set(feed_cache)
article.ingest → spans: kafka.consume(article-fetches), parse.html, postgres.insert(article), kafka.publish(new-articles)
crawl.trigger → spans: postgres.query(crawl_schedule), kafka.publish(crawl-trigger), postgres.update(next_crawl_at)

Logs #

Structured JSON logs with fields: service, trace_id, span_id, level, msg, article_id, user_id, source_id, duration_ms.

Key log events:

article.ingested — every new article (not duplicate): {article_id, url, source_id, published_at, ingested_at}
article.duplicate — URL conflict on insert: {url, existing_article_id}
cluster.assigned — article assigned to cluster: {article_id, cluster_id, action, jaccard_similarity}
interest.updated — user interest vector updated: {user_id, top_topics, version}
feed.cache_miss — feed recomputed: {user_id, candidate_count, rank_duration_ms}
crawl.error — source fetch failed: {source_id, url, http_status, error}

Dashboards #

Ingestion Health: articles/hour by source, parse error rate, duplicate rate, cluster assignment rate
Pipeline Lag: Kafka consumer lag per consumer group, Flink checkpoint duration, CDC lag to Elasticsearch
User Experience: feed P50/P95/P99 latency (cache hit vs miss), search latency, trending score freshness
Crawl Coverage: sources overdue, crawl error rate by domain, CrawlWorker queue depth
Content Quality: articles per cluster (clustering effectiveness), classification confidence distribution

Step 19 — Cost Model #

Monthly cost estimate at 100K sources, 5M articles/day, 50K reads/second peak.

Component	Sizing	Monthly Cost (USD)	Notes
RDS Postgres (primary + 2 replicas)	db.r6g.2xlarge × 3	~$1,800	2TB gp3 storage; Multi-AZ
Kafka (MSK)	3 × i3.xlarge	~$900	Managed Kafka; 3 AZ
Flink (EC2)	3 JobManager (c6i.xl) + 6 TaskManager (r6i.2xl)	~$2,400	Spot for TaskManagers saves ~60%
Redis (ElastiCache)	6 × r6g.xlarge (cluster)	~$1,800
Elasticsearch	3 × r6g.2xlarge + 1 coord	~$1,600	500GB gp3 each
CrawlWorkers (EC2 Spot)	avg 20 × c6i.large	~$400	Spot pricing
ParseWorkers (EC2 Spot)	avg 10 × c6i.large	~$200
ClassifierWorkers (GPU Spot)	avg 5 × g4dn.xlarge	~$700	GPU Spot
ClusterWorkers (EC2 Spot)	avg 5 × c6i.large	~$100
Feed/Search Services (EC2)	avg 6 × c6i.large	~$360
S3 (raw HTML archive)	5M articles × 100KB avg × 30 days = 15TB/month	~$340	S3 Standard → Glacier after 7 days (~$150 effective)
Data transfer (outbound)	100M users × 50 requests/month × 10KB = 50TB	~$4,500	CloudFront CDN reduces this by ~80% → ~$900 effective
CloudFront CDN	50TB outbound × 80% cached	~$600	Reduces origin load and egress
Kubernetes (EKS)	Cluster overhead	~$150	Control plane
Debezium/Kafka Connect	3 × c6i.large	~$180
Total (approximate)		~$12,000 – $15,000/month	Spot discounts and CDN caching bring effective cost down

Cost optimization levers:

Use Spot instances for all stateless workers (CrawlWorkers, ParseWorkers, ClassifierWorkers) — 60-70% savings
S3 Lifecycle: raw HTML to Glacier after 7 days — reduces S3 cost by ~75%
Flink TaskManagers on Spot with checkpointing to S3 — job restarts on Spot interruption (acceptable; checkpoint recovery < 30s)
CloudFront for feed API responses: trending and category feeds are cacheable for 30s at edge — reduces origin load by 80%
Postgres: use RDS Reserved Instances (1-year) — 40% discount → ~$700/month for DB

Step 20 — Evolution Stages #

Stage 1: Foundation (0 → 1M users, 10K sources) #

What is built:

Postgres for all primary state (article, source, crawl_schedule, read_event, user_interest)
Kafka with 3 topics: crawl-triggers, article-fetches, read-events
CrawlWorker pool + ParseWorker pool (manual scaling, 5 workers each)
Simple rule-based classifier (keyword matching, no ML) for category labels
No near-duplicate clustering (each URL is its own “cluster”)
Feed = most recent articles in user’s followed categories (no ML ranking)
Trending = COUNT(read_events in past 1h) per article, computed by cron job every 5 minutes
Elasticsearch not yet deployed — search is a Postgres full-text query (tsvector)
No Redis — feed is computed on every request from Postgres

Compromises accepted: No personalization; no near-dup clustering; search is slow; trending is coarse.

Migration path to Stage 2: Add Redis for feed cache when P99 feed latency exceeds 1 second; add Flink when cron job trending lag becomes user-visible.

What changes:

Add Flink (2 TaskManagers) for sliding-window trending score
Add Redis (3-node cluster) for trending sorted set and feed cache
Add ML classifier (FastText, CPU-only, 3 workers)
Add UserInterest (Flink EMA update pipeline)
PersonalizedFeed: top-K from Postgres by topic score dot product (no ES yet)
ReadEvent deduplication implemented (unique index + session_id)
CrawlSchedule eviction: FOR UPDATE SKIP LOCKED scheduler pattern implemented

New invariants enforced: I-5 (ReadEvent idempotency), I-6 (TrendingScore freshness), I-7 (feed personalization)

Stage 3: Search and Story Clustering (10M → 50M users, 100K sources) #

What changes:

Deploy Elasticsearch (3 nodes); Debezium CDC from Postgres
Replace Postgres full-text search with Elasticsearch full-text + dense vector KNN
Implement MinHash + LSH story clustering (ClusterWorker pool)
StoryCluster table added; Article.cluster_id populated
Feed candidate retrieval switches from Postgres to Elasticsearch KNN
PersonalizedFeed now uses topic_vector (128-dim) instead of category labels
Add Kafka new-articles and classified-articles topics
Flink interest-update pipeline achieves exactly-once semantics

New invariants enforced: I-4 (near-dedup clustering), I-8 (search freshness)

Stage 4: Multi-Region (50M → 200M users) #

What changes:

Deploy read-region in EU and APAC (Postgres read replicas + Redis + ES)
Feed Service reads from local Redis/ES/Postgres replica
Kafka MirrorMaker 2 for cross-region ReadEvent replication
UserInterest updates go to primary region; replicated asynchronously
Trending is computed per-region (regional trends) + globally (global trending)
CrawlWorkers deployed per region; each region crawls a partition of sources (avoid duplicate crawls via source → region assignment in crawl_schedule)

New invariants: Cross-region: UserInterest may be stale by replication lag (typically < 1s) — accepted as eventual consistency.

Stage 5: Advanced ML and Publisher Ecosystem (200M+ users) #

What changes:

Replace FastText classifier with fine-tuned BERT (GPU inference, batched)
Add publisher API: verified publishers can submit articles directly (no crawl needed) via REST API; articles are inserted into Postgres directly, bypassing CrawlWorker/ParseWorker
Add user feedback loop: explicit signals (save, share, dislike) are incorporated into UserInterest with higher weight than implicit reads
Add A/B testing framework: PersonalizedFeed ranking function is parameterized; different ranking models tested in shadow mode
Story clustering graduates to transformer-based embeddings (SBERT) instead of MinHash for higher accuracy; LSH index replaced by FAISS approximate nearest neighbor
Breaking news detection: articles with engagement velocity > 3× baseline trending threshold trigger push notifications via a separate notification service

New objects: FeedbackEvent (explicit user signal — append-only, higher weight than ReadEvent), PublisherSubmission (stable entity — verified publishers bypass crawl pipeline), NotificationSubscription (intent/future constraint — user opts in to breaking news alerts).

End of document. All 20 steps have produced explicit output artifacts. The design is complete.

News Aggregator: System Design #

Ordering Principle #

Step 1 — Problem Normalization #

Step 2 — Object Extraction #

Primary Objects (Candidates) #

Source #

CrawlSchedule #

Article #

ArticleFetch #

StoryCluster #

ReadEvent #

UserInterest #

Derived / Rejected Objects #

Step 3 — Axis Assignment #

Step 4 — Invariant Extraction #

I-1: Article URL Uniqueness (Uniqueness/Idempotency) #

I-2: Article Content Immutability (Ordering) #

I-3: CrawlSchedule Freshness (Eligibility) #

I-4: Deduplication — Near-Duplicate Articles Map to One Cluster (Uniqueness) #

I-5: ReadEvent Idempotency (Uniqueness/Idempotency) #

I-6: TrendingScore Reflects Recent ReadEvents (Propagation) #

I-7: PersonalizedFeed Reflects Current UserInterest (Propagation) #

I-8: SearchIndex Reflects Article Table (Propagation) #

I-9: UserInterest Updated on Every ReadEvent Batch (Accounting) #

I-10: Access Control — Users Read Own Data (Access Control) #

Step 5 — DP Derivation #

Step 6 — Mechanism Selection #

6.1 — Invariant Type → Mechanism Family #

6.2 — Ownership × Evolution → Concurrency Mechanism #

6.3 — Q1-Q5 Full Derivation for Key Flows #

Flow 1: Article Deduplication and Story Clustering #

Flow 2: Trending Score Computation #

Flow 3: Personalized Feed Generation #

Flow 4: Crawl Scheduling (The Scheduler Pattern) #

Step 7 — Axiomatic Validation #

Step 8 — Algorithm Design #

8.1 — Crawl Pipeline (Scheduler → Fetch → Parse → Ingest) #

8.2 — Article Classification Pipeline #

8.3 — Story Clustering (Near-Duplicate Detection) #

8.4 — ReadEvent Recording and Interest Update #

8.5 — Trending Score Computation (Flink Sliding Window) #

8.6 — Personalized Feed Generation #

8.7 — Search Index (CDC Pipeline) #

Step 9 — Logical Data Model #

Core Tables (Postgres) #

Redis Key Schema (Derived Views) #

Elasticsearch Index Schema #

Step 10 — Technology Landscape #

Step 11 — Deployment Topology #

Logical Service Topology #

Physical Deployment (AWS, Single Region Initial) #

Step 12 — Consistency Model #

Step 13 — Scaling Model #

Scaling Dimensions #

Sharding Strategy #

Auto-Scaling Triggers #

Step 14 — Failure Model #

Failure Modes and Mitigations #

Cascading Failure Analysis #

Step 15 — SLOs #

Step 16 — Operational Parameters #

Tuneable Parameters #

Circuit Breaker Configuration #

Step 17 — Runbooks #

Runbook 1: Feed Latency Degradation (P99 > 500ms) #

Runbook 2: Crawl Coverage Drop (sources missing crawl window) #

Runbook 3: Elasticsearch Index Divergence (search misses recent articles) #

Runbook 4: Trending Score Stale (Redis TTL expired, Flink not writing) #

Step 18 — Observability #

Metrics #

Traces #

Logs #

Dashboards #

Step 19 — Cost Model #

Step 20 — Evolution Stages #

Stage 1: Foundation (0 → 1M users, 10K sources) #

Stage 2: Personalization and Real-Time Trending (1M → 10M users, 50K sources) #

Stage 3: Search and Story Clustering (10M → 50M users, 100K sources) #

Stage 4: Multi-Region (50M → 200M users) #

Stage 5: Advanced ML and Publisher Ecosystem (200M+ users) #