Skip to main content
  1. System Design Components/

Table of Contents

News Aggregator: System Design #

A complete mechanical derivation of a Google News / Flipboard-style news aggregator — from functional requirements to operable architecture, following the 20-step derivation framework without hand-wavy steps.


Ordering Principle #

Product requirements
  → normalize into operations over state          (Step 1)
  → extract primary objects                       (Step 2)
  → assign ownership, evolution, ordering         (Step 3)
  → extract invariants                            (Step 4)
  → derive minimal DPs from invariants            (Step 5)
  → select concrete mechanisms                    (Step 6)
  → validate independence and source-of-truth     (Step 7)
  → specify exact algorithms                      (Step 8)
  → define logical data model                     (Step 9)
  → map to technology landscape                   (Step 10)
  → define deployment topology                    (Step 11)
  → classify consistency per path                 (Step 12)
  → identify scaling dimensions and hotspots      (Step 13)
  → enumerate failure modes                       (Step 14)
  → define SLOs                                   (Step 15)
  → define operational parameters                 (Step 16)
  → write runbooks                                (Step 17)
  → define observability                          (Step 18)
  → estimate costs                                (Step 19)
  → plan evolution                                (Step 20)

Step 1 — Problem Normalization #

Convert product-language requirements into precise operations over state.

Original Functional Requirements:

  1. The system ingests articles from RSS feeds, APIs, and scraped sources
  2. Articles are deduplicated (same story from multiple sources = one article cluster)
  3. Articles are categorized by topic (ML classification)
  4. Users browse articles by category, topic, or keyword search
  5. Users see a personalized feed ranked by relevance to their interests
  6. Trending articles are surfaced (high engagement in short time window)
  7. Users can search for articles by keyword

Normalized Table:

#Original RequirementActorOperationState Touched
1aSystem ingests articles from RSS/API/scrapeScheduler + CrawlWorkerappend event (raw fetch)ArticleFetch event; Source.last_crawled_at overwrite
1bSchedule controls when each source is re-crawledSystem schedulerread intent constraint; trigger crawlCrawlSchedule (intent/future constraint)
2Deduplicate articles — same story = one clusterSystem pipelineconditional create on URL hash; approximate-match on content hashArticle (by URL key); StoryCluster (merge)
3Categorize articles by topicSystem ML pipelineoverwrite metadata on ArticleArticle.category_labels (mutable metadata)
4aUsers browse by categoryUserread projection filtered by categoryCategoryFeed (derived view over Article)
4bUsers browse by topicUserread projection filtered by topicTopicFeed (derived view over Article)
4cUsers browse by keyword searchUserquery indexed projectionSearchIndex (Elasticsearch projection)
5Users see personalized feedUserread pre-computed ranked projectionPersonalizedFeed (derived view over Article × UserInterest)
5aFeed ranking updates as user reads articlesSystemoverwrite interest vectorUserInterest (overwrite on read event)
5bUser reads an articleUserappend immutable eventReadEvent
6Trending articles surfacedSystemread sliding-window aggregateTrendingScore (derived view: count of ReadEvents per article per window)
7Users search for articles by keywordUserquery indexed projectionSearchIndex (Elasticsearch)

Key normalizations:

  • FR 1 hides two distinct operations: the crawl scheduling loop (an intent/future constraint that persists across many executions) and the article fetch event (a point-in-time result).
  • FR 2 “deduplication” is actually two distinct mechanisms: exact dedup (URL as natural key — one URL = one Article) and near-duplicate clustering (same story from different sources — requires approximate matching, producing a StoryCluster).
  • FR 5 “see a personalized feed” hides a hidden write: every read event updates UserInterest. The feed itself is a derived view, not primary state.
  • FR 6 “trending” is NOT primary state. TrendingScore is a derived aggregate over ReadEvents within a sliding window.
  • FR 4c and FR 7 both resolve to the same mechanism (SearchIndex) — not two separate systems.

Step 2 — Object Extraction #

Identify the minimal set of primary state objects; apply four purity tests to each candidate.

Primary Objects (Candidates) #

Source #

  • Class: Stable entity
  • Ownership purity: Single writer — admin registers/updates sources. Pipeline updates last_crawled_at. Split: admin writes url, crawl_interval, status; pipeline writes last_crawled_at. Both writes are on different fields under different conditions. Accept as one object.
  • Evolution purity: Overwrite — current registration state; last_crawled_at updated after each crawl.
  • Ordering purity: No meaningful order among sources.
  • Non-derivability: Cannot be reconstructed from ArticleFetch events — source configuration (URL, interval) is primary. Pass.
  • Verdict: Valid primary object.

CrawlSchedule #

  • Class: Intent / future constraint
  • Description: Encodes “source S must next be crawled at time T.” Persists until the crawl happens, then re-arms for the next interval. Analogous to AlertSubscription in a price-drop system.
  • Ownership purity: System only — scheduler writes next_crawl_at after each completed crawl.
  • Evolution purity: Overwrite — one current next-crawl-at per source.
  • Ordering purity: No meaningful ordering across different sources; total order within one source (each crawl produces one new schedule entry overwriting the last).
  • Non-derivability: The intent (next scheduled time) cannot be reconstructed solely from past ArticleFetch events without the crawl_interval. Pass.
  • Verdict: Valid primary object (intent/future constraint class).

Article #

  • Class: Stable entity
  • Description: The parsed, normalized representation of one published piece of content at one URL.
  • Ownership purity: Pipeline only writes. No user writes. Two fields with different evolution: immutable content (url, title, body, published_at, source_id, content_hash) vs. mutable metadata (category_labels, cluster_id, engagement_count). The mutable metadata is written by different sub-pipelines. This is a tension — but both sets of fields live naturally on the same entity and have the same natural key (URL). Accept: the immutable content is set once on insert; metadata fields are updated independently by labeled sub-pipelines.
  • Evolution purity: Content = append-only (set once, immutable). Metadata = overwrite. Mixed, but the entity is the natural unit of reference. Accept.
  • Ordering purity: Total order by published_at within source_id; by ingested_at within the system.
  • Non-derivability: The parsed text, title, and URL are primary — they come from the external world and cannot be derived internally. Pass.
  • Verdict: Valid primary object.

ArticleFetch #

  • Class: Event
  • Description: The raw result of one crawl attempt — raw HTML/RSS XML, HTTP status, fetch timestamp.
  • Ownership purity: CrawlWorker only writes.
  • Evolution purity: Append-only and immutable.
  • Ordering purity: Total order by fetched_at within source_id.
  • Non-derivability: The raw bytes from the external source are the provenance record; they cannot be reconstructed. Pass.
  • Verdict: Valid primary object (event class). Used as input to the parse pipeline and for archival.

StoryCluster #

  • Class: Stable entity (with merge-based evolution)
  • Description: A group of Articles that cover the same real-world story, from different sources (e.g., CNN + BBC + Reuters all cover the same plane crash). Cluster membership is determined by near-duplicate similarity.
  • Ownership purity: System pipeline only — the clustering pipeline writes cluster assignments.
  • Evolution purity: Merge — new articles can join an existing cluster; clusters can merge if two previously separate clusters are found to be about the same story. Commutative: adding article A then B is the same as adding B then A.
  • Ordering purity: No meaningful ordering among cluster members; cluster has a canonical_article_id pointing to the most authoritative source.
  • Non-derivability: The cluster identity and membership cannot be derived without running the near-duplicate algorithm. The result of MinHash/LSH is not mechanically derivable from Article content without the algorithm state (band signatures, hash functions). Accept.
  • Verdict: Valid primary object.

ReadEvent #

  • Class: Event
  • Description: User reads an article. Immutable point-in-time record.
  • Ownership purity: Single writer (the user, via the client/API layer).
  • Evolution purity: Append-only, immutable.
  • Ordering purity: Total order by timestamp within user_id.
  • Non-derivability: The fact that user U read article A at time T is primary. Cannot be reconstructed. Pass.
  • Verdict: Valid primary object.

UserInterest #

  • Class: Stable entity
  • Description: The current interest vector for a user — a weighted bag of topics/categories derived from their reading history. Used as input to feed ranking.
  • Ownership purity: System pipeline only (the interest-update pipeline writes after each ReadEvent). Users never write directly.
  • Evolution purity: Overwrite — always reflects the latest computed interest vector, not a history.
  • Ordering purity: No meaningful ordering.
  • Non-derivability: Could theoretically be reconstructed by replaying all ReadEvents through the interest model. However, as an online-updated vector (exponential moving average with decay), the current vector is the operational state; full reconstruction would be extremely expensive and model-dependent. Accept as primary state (the current vector is the operational source of truth).
  • Verdict: Valid primary object.

Derived / Rejected Objects #

CandidateProblemResolution
TrendingScoreDerivable from ReadEvents in sliding windowDerived view. Computed by Flink windowing job; stored in Redis sorted set with TTL. Not primary state.
PersonalizedFeedDerivable from Article × UserInterest × ranking modelDerived view (materialized). CQRS read model; rebuilt on UserInterest update. Not primary state.
SearchIndexProjection of Article content into inverted indexDerived view. Elasticsearch index maintained via CDC from Postgres Article table. Not primary state.
CategoryFeedDerivable from Article filtered by categoryDerived view. Can be served by querying Postgres/Elasticsearch with category filter. Not materialized separately.
TopicFeedDerivable from Article filtered by topicDerived view. Same as CategoryFeed.

Step 3 — Axis Assignment #

For every primary object, define ownership, evolution, and ordering.

Object: Source
  Ownership:   Multi-writer: admin sets registration config; pipeline updates last_crawled_at.
               In practice: admin is the authority on URL/interval; pipeline is authority on crawl timestamps.
               No contention between admin and pipeline (different fields, admin changes rare).
  Evolution:   Overwrite (current registration state is the truth; last_crawled_at replaced after each crawl)
  Ordering:    No meaningful order among sources

Object: CrawlSchedule
  Ownership:   System-only (scheduler writes next_crawl_at; crawl worker updates it after completion)
  Evolution:   Overwrite (one next_crawl_at per source_id; replaced after each crawl completes)
  Ordering:    No meaningful order among schedule entries; within a source, each entry supersedes the last

Object: Article
  Ownership:   Multi-writer pipeline (crawler writes base record; classifier pipeline writes category_labels;
               clustering pipeline writes cluster_id; engagement pipeline writes engagement_count)
               Each field has single-pipeline authority. No two pipelines write the same field.
  Evolution:   Content fields: append-only (set on first insert, never modified)
               Metadata fields: overwrite (each pipeline overwrites its own field)
  Ordering:    Total order by published_at within source_id (for browsing)
               Total order by ingested_at across all articles (for recency feed)

Object: ArticleFetch
  Ownership:   Single writer per partition (crawl worker writes; one worker per source)
  Evolution:   Append-only (immutable after creation)
  Ordering:    Total order by fetched_at within source_id

Object: StoryCluster
  Ownership:   System-only pipeline (clustering pipeline — MinHash/LSH job)
  Evolution:   Merge (articles join clusters; clusters may merge — commutative, associative operation)
  Ordering:    No meaningful order among cluster members; cluster itself ordered by article count desc for display

Object: ReadEvent
  Ownership:   Single writer (the requesting user, via API)
  Evolution:   Append-only (immutable)
  Ordering:    Total order by timestamp within user_id

Object: UserInterest
  Ownership:   System-only (interest-update pipeline writes after each ReadEvent batch)
  Evolution:   Overwrite (current interest vector replaces previous; EMA-based decay)
  Ordering:    No meaningful order (set of weighted topics)

Step 4 — Invariant Extraction #

Precise, testable properties that must hold in every valid execution.

I-1: Article URL Uniqueness (Uniqueness/Idempotency) #

For every valid system state, there is at most one Article record with any given url. If the same URL is fetched N times, exactly one Article record exists.

Scope: System-wide (url is globally unique) Implication: Insert of Article must be INSERT ... ON CONFLICT (url) DO NOTHING or equivalent upsert. The URL is the natural idempotency key.

I-2: Article Content Immutability (Ordering) #

Once an Article’s content_hash, title, body, and source_id fields are set, they are never modified. Only metadata fields (category_labels, cluster_id, engagement_count) may be updated after initial insert.

Scope: Per article (url) Implication: Write path for content must be insert-once. Metadata update paths must not touch content columns.

I-3: CrawlSchedule Freshness (Eligibility) #

A source S must not be fetched again before next_crawl_at(S). Equivalently, any ArticleFetch for source S must have fetched_at >= CrawlSchedule.next_crawl_at for that source.

Scope: Per source_id Implication: Scheduler must enforce the crawl interval. If a crawl is triggered early (e.g., by operator), the CrawlSchedule must be updated atomically before the crawl fires.

I-4: Deduplication — Near-Duplicate Articles Map to One Cluster (Uniqueness) #

For any two articles A1 and A2 whose MinHash similarity exceeds threshold θ (= 0.85), they must share the same cluster_id. No two articles covering the same real-world story should have different cluster_ids after the clustering pipeline completes.

Scope: Approximate — this is a best-effort invariant, not an atomic one. The clustering pipeline may run with some lag. Implication: The invariant is probabilistic (MinHash is approximate). Accept: the system provides near-dedup with bounded false-negative rate, not perfect dedup.

I-5: ReadEvent Idempotency (Uniqueness/Idempotency) #

If user U submits a read event for article A multiple times (due to retry or double-tap), at most one ReadEvent is recorded per (user_id, article_id, session_id).

Scope: Per (user_id, article_id, session_id) Implication: ReadEvent insert must dedup by (user_id, article_id, session_id). Use session_id as idempotency key.

I-6: TrendingScore Reflects Recent ReadEvents (Propagation) #

The TrendingScore for article A at time T reflects all ReadEvents for A in the window [T − W, T], where W = 1 hour. Staleness bound: ε ≤ 30 seconds.

Scope: Per article_id, per window Implication: Stream aggregation must process ReadEvents within 30 seconds. The score is approximate (Count-Min Sketch acceptable).

I-7: PersonalizedFeed Reflects Current UserInterest (Propagation) #

The PersonalizedFeed for user U reflects the UserInterest vector as of at most 5 minutes ago. If UserInterest has not changed in 24 hours, the feed may be stale by up to 24 hours (no reads = no updates).

Scope: Per user_id Implication: Feed materialization must be triggered within 5 minutes of a UserInterest update. TTL-based cache invalidation acceptable.

I-8: SearchIndex Reflects Article Table (Propagation) #

The SearchIndex (Elasticsearch) contains every Article that has been in the Postgres Article table for more than 60 seconds. No Article is permanently absent from the SearchIndex once ingested.

Scope: System-wide Implication: CDC from Postgres to Elasticsearch must have guaranteed delivery and must handle reindex failures with retry.

I-9: UserInterest Updated on Every ReadEvent Batch (Accounting) #

The UserInterest vector for user U reflects all ReadEvents for U with timestamp ≤ T − 5 minutes. The interest model must process every ReadEvent exactly once (no duplicates, no drops).

Scope: Per user_id Implication: Exactly-once processing of ReadEvents into UserInterest. Dedup by (user_id, article_id, session_id) before applying to interest model.

I-10: Access Control — Users Read Own Data (Access Control) #

User U may read only their own ReadEvents, UserInterest, and PersonalizedFeed. Category and topic feeds are public. Article content is public. Trending scores are public.

Scope: Per user_id Implication: All user-scoped read endpoints must authenticate and authorize by user_id. Shared/public endpoints require no user scoping.


Step 5 — DP Derivation #

Derive the minimal enforcing mechanism per invariant cluster.

InvariantClusterMinimal DP
I-1 (URL uniqueness)UniquenessUnique index on Article.url in Postgres; INSERT ON CONFLICT DO NOTHING in crawl pipeline
I-2 (content immutability)OrderingWrite path for content uses insert-only; metadata update queries restrict to metadata columns only (UPDATE article SET category_labels = ? WHERE url = ?)
I-3 (crawl freshness)EligibilityScheduler reads CrawlSchedule.next_crawl_at; fires crawl only when now() >= next_crawl_at; updates next_crawl_at = now() + crawl_interval atomically after dispatch
I-4 (near-dedup clustering)Uniqueness (approximate)MinHash signatures computed on Article ingest; LSH lookup for candidate near-duplicates; cluster merge via pipeline-side union-find; cluster_id written back to Article
I-5 (ReadEvent idempotency)Uniqueness/IdempotencyUnique index on ReadEvent(user_id, article_id, session_id); Kafka consumer uses event_id as dedup key before DB insert
I-6 (trending freshness)Propagation + AccountingFlink sliding window (1h / 30s slide); Count-Min Sketch per article; result written to Redis sorted set with TTL
I-7 (feed freshness)PropagationUserInterest update triggers feed recompute; Redis cache key feed:{user_id} with 5-minute TTL; feed service falls back to recompute on cache miss
I-8 (search freshness)PropagationDebezium CDC from Postgres Article table → Kafka topic → Elasticsearch sink; Kafka consumer retries on ES failure; dead-letter queue for persistent failures
I-9 (interest accounting)Accounting + IdempotencyFlink job consumes ReadEvent Kafka topic; dedup state keyed by (user_id, article_id, session_id); exactly-once semantics with Flink checkpointing + Kafka transactional producer to UserInterest update topic
I-10 (access control)Access ControlAPI gateway enforces JWT authentication; user-scoped endpoints validate user_id from token matches path parameter

Step 6 — Mechanism Selection #

The mechanical bridge from invariants to concrete implementation decisions.

6.1 — Invariant Type → Mechanism Family #

Invariant TypeMechanism Family
Eligibility (I-3)Optimistic scheduler with compare-and-update on next_crawl_at
Ordering (I-2)Insert-once + column-restricted update paths
Accounting (I-6, I-9)Stream aggregation (Flink) + windowing
Uniqueness/Idempotency (I-1, I-5)Natural key unique index + ON CONFLICT; dedup by event_id in stream
Propagation (I-6, I-7, I-8)CDC (Debezium) + cache TTL + stream aggregation
Access Control (I-10)API gateway JWT validation

6.2 — Ownership × Evolution → Concurrency Mechanism #

ObjectOwnershipEvolutionConcurrency Mechanism
SourceMulti-writer (admin + pipeline, different fields)OverwriteNo coordination needed — fields are non-overlapping; each writer owns its columns
CrawlScheduleSystem-onlyOverwriteSingle writer per source_id partition → no coordination needed
Article (content)Pipeline onlyAppend-only (insert once)Unique index enforces single-insert; ON CONFLICT DO NOTHING for retries
Article (metadata)Multi-pipeline, each field owned by one pipelineOverwriteNo coordination needed — each pipeline owns distinct columns
ArticleFetchSingle writer per sourceAppend-onlyNo contention mechanism needed; dedup required (event_id as idempotency key)
StoryClusterSystem-only pipelineMerge (commutative)CRDT-compatible: union-find cluster merge is commutative and associative; pipeline applies merges without coordination
ReadEventSingle writer (user)Append-onlyNo contention; dedup by (user_id, article_id, session_id) via unique index
UserInterestSystem-only pipelineOverwriteSingle writer per user_id partition → no coordination needed

6.3 — Q1-Q5 Full Derivation for Key Flows #

Flow 1: Article Deduplication and Story Clustering #

Q1 — Scope: The dedup check (URL uniqueness) is within-service (single Postgres instance per shard). Near-duplicate clustering (StoryCluster) is within-service but batched.

Q2 — Failure: CrawlWorker may crash after fetching but before writing to Postgres. The ArticleFetch event is published to Kafka first; Postgres write is downstream. At-least-once delivery from Kafka → dedup by URL (Postgres unique index absorbs retries). Slow degradation (Postgres down): Circuit Breaker on Postgres write path; crawler buffers in Kafka.

Q3 — Data: URL is content-addressed (URL is the natural hash = idempotency key for free). Near-duplicate content: MinHash signature is a commutative, order-independent summary of the article’s tokens → stream aggregation (signature computed at ingest; LSH lookup in Redis or in-memory index).

Q4 — Access: Dedup is a write-path concern, not a read concern.

Q5 — Coupling: CrawlWorker publishes ArticleFetch event to Kafka (Outbox-equivalent pattern: Kafka acts as the durable log). Parse-and-dedup pipeline consumes from Kafka → writes to Postgres. Kafka decouples the crawler from the database.

Mechanism selected: Pipeline pattern (CrawlWorker → Kafka → ParseWorker → Postgres). URL dedup via INSERT ON CONFLICT (url) DO NOTHING. Near-duplicate clustering via MinHash + LSH: compute 128-band MinHash signatures at parse time, store in Redis sorted set indexed by LSH band, look up candidates, compute Jaccard similarity, assign or create cluster via pipeline-side union-find, write cluster_id back to Article in a separate update.

Dedup + stream aggregation always paired: ArticleFetch events on Kafka are consumed with exactly-once semantics (Kafka consumer group with committed offsets). Before Postgres insert, check URL uniqueness. Event_id dedup in the consumer prevents double-processing on consumer restart.

Q1 — Scope: TrendingScore is a derived view. The source of truth is the ReadEvent stream. Trending is global (not per-user). Within-service scope (Flink job consuming from Kafka).

Q2 — Failure: Flink job crashes → restores from checkpoint; ReadEvents are replayed from Kafka offset. At-least-once delivery → dedup by event_id in Flink state before counting. Slow degradation: Count-Min Sketch is naturally tolerant of small error; trending score degrades gracefully.

Q3 — Data: ReadEvents are time-windowed (1-hour sliding window with 30-second slide). Pattern: Windowing. High cardinality (millions of articles): Count-Min Sketch per window for approximate counting. Temporal Decay: older reads within the window count equally (sliding window handles this); could add exponential decay weight for recency bias.

Q4 — Access: Read » write for trending (millions of reads, one write path per 30s slide). Pattern: Cache-Aside. Trending scores are written to Redis sorted set (trending:{window_start_epoch}) with TTL = 2 * window_slide = 60s. API reads from Redis. On Redis miss, fall back to Flink-managed state (or return stale score).

Q5 — Coupling: ReadEvent → Kafka → Flink (sliding window job) → Redis. Flink writes to Redis asynchronously every slide interval. No synchronous coupling between read event and trending update.

Mechanism selected: Flink sliding window (1 hour / 30 second slide). Count-Min Sketch per article per window (width=2048, depth=5 — error ε=0.001, probability δ=0.03). Result materialized to Redis sorted set ZADD trending:{epoch} {score} {article_id} with EXPIRE 60. Stream aggregation + dedup per event (Flink keyed state with event_id dedup; TTL 2 hours to bound state size).

Required combination: Stream aggregation + dedup per event always — per framework rule 6.4. Flink maintains Set<event_id> per article_id keyed state with 2-hour TTL for dedup before counting.

Flow 3: Personalized Feed Generation #

Q1 — Scope: PersonalizedFeed is per-user. UserInterest is per-user. Article metadata is global. Cross-service: UserInterest update (stream pipeline) triggers feed recompute (feed service). Scope is cross-service → Saga pattern for coordination? No — feed recompute is not a transaction; it’s an async triggered computation. Pattern: Outbox + Relay (UserInterest update → event on Kafka → feed recompute service).

Q2 — Failure: UserInterest update fails → feed uses stale interest vector (acceptable per I-7: 5-minute staleness bound). Feed recompute service crashes → cache TTL expires → next request triggers recompute. No data loss.

Q3 — Data: Read shape ≠ write shape: write model is ReadEvent stream; read model is ranked list of articles. Pattern: CQRS. Write side: Flink consumes ReadEvents, updates UserInterest vector (EMA update). Read side: Feed service reads UserInterest + Article metadata, computes ranked list, caches in Redis. Materialized View: pre-computed feed stored in Redis. Time-bounded: TTL 5 minutes on cached feed.

Q4 — Access: Read » write. Millions of feed reads per second; one interest update per user per session. Pattern: CQRS + Materialized View. Feed reads hit Redis cache. Cache miss triggers synchronous recompute (top-K articles by dot product with interest vector, from candidate set fetched from Elasticsearch/Postgres by recency). Rebuild path: on UserInterest update, invalidate cache key feed:{user_id} and async-enqueue recompute job.

Q5 — Coupling: UserInterest update is written to Postgres AND publishes event to Kafka (CDC via Debezium or explicit outbox pattern). Feed recompute service consumes from Kafka, recomputes feed, writes to Redis. Async decoupling between interest update and feed availability.

Circuit topology: The personalized feed is a closed-loop feedback system:

  • Plant: Article ranking function (dot product of UserInterest vector with Article topic vector)
  • Feedback signal: ReadEvent stream (user reads article A → increases weight of A’s topics in UserInterest)
  • Error signal: Difference between predicted interest (current interest vector) and observed behavior (what user actually reads)
  • Controller: EMA interest update (learning rate α = 0.1, decay γ = 0.95 per day)
  • Stability: The system is stable because α < 1 and γ < 1. Without decay, the interest vector could be dominated by old readings (integrator windup). Decay is the RC filter preventing windup.

Mechanism selected: CQRS + Materialized View + Cache-Aside. Write path: Flink interest-update job (ReadEvent → EMA update → Postgres UserInterest). Read path: Feed service reads Redis cache; miss → compute top-K via ANN (Approximate Nearest Neighbor) search of Article topic vectors against UserInterest vector using FAISS or Elasticsearch KNN; write result to Redis with 5-minute TTL. Rebuild path on cache miss or UserInterest update.

Required combination (6.4): CQRS + Materialized View requires explicit rebuild path — implemented as: (a) cache TTL expiry triggers lazy rebuild on next request, (b) UserInterest update event triggers proactive cache invalidation and async rebuild via Kafka consumer.

Flow 4: Crawl Scheduling (The Scheduler Pattern) #

Q1 — Scope: CrawlSchedule is within-service. No cross-service coordination needed. One scheduler process per region.

Q2 — Failure: Scheduler crashes → on restart, reads all CrawlSchedule rows with next_crawl_at <= now() and re-fires pending crawls. At-least-once crawl delivery → ArticleFetch dedup by (source_id, fetched_at window) prevents duplicate article processing. Slow degradation: if crawl workers are overwhelmed, scheduler backs off via exponential backoff on next_crawl_at.

Q3 — Data: CrawlSchedule is an intent/future constraint. The pattern: scheduled jobs are a special class of state where the intent (crawl this source every N minutes) must persist across restarts. The scheduler is a polling loop: SELECT * FROM crawl_schedule WHERE next_crawl_at <= NOW() AND status = 'PENDING' LIMIT 100 FOR UPDATE SKIP LOCKED.

FOR UPDATE SKIP LOCKED is the critical mechanism: allows multiple scheduler workers to claim rows without contention. Each worker claims a batch, fires crawl jobs, updates next_crawl_at = NOW() + crawl_interval.

Q4 — Access: Scheduler is internal only. No external read concern.

Q5 — Coupling: Scheduler writes CrawlTrigger event to Kafka. CrawlWorker pool consumes from Kafka, fetches source, publishes ArticleFetch event. This decouples scheduler throughput from crawl worker throughput.

Mechanism selected: Database-backed scheduler using FOR UPDATE SKIP LOCKED on crawl_schedule table. Scheduler daemon runs in loop: poll every 10 seconds, claim batch of due sources, publish to Kafka crawl-triggers topic, update next_crawl_at. Multiple scheduler instances are safe (SKIP LOCKED prevents double-claiming). CrawlWorker pool (auto-scaled) consumes crawl-triggers, fetches source, publishes to article-fetches Kafka topic.

No contention mechanism needed at the crawl worker level (pipeline ownership — system-only writes, single writer per source_id because crawl_schedule row is locked before dispatch). Dedup required: ArticleFetch events carry fetch_id = sha256(source_id + fetched_at_epoch_bucket) as idempotency key.


Step 7 — Axiomatic Validation #

Source-of-truth table: no dual truth.

StatePrimary StoreDerived StoresAllowed StalenessRebuild Mechanism
Source configurationPostgres source tableNone
CrawlSchedulePostgres crawl_schedule tableNone
Article content (immutable)Postgres article tableS3 (raw HTML archive)S3 is archive, not truth
Article metadata (category_labels, cluster_id)Postgres article tableSearchIndex (ES), CategoryFeed cacheES: 60s; Feed cache: 5minCDC replay; cache TTL
ArticleFetch (raw HTML)S3 (primary archive) + Kafka (transient)NoneKafka is transit, not truthS3 is the archive
StoryCluster membershipPostgres story_cluster + article.cluster_idNoneRe-run clustering pipeline on Article table
ReadEventPostgres read_event table (partitioned)Kafka (transit)Kafka is transit
UserInterestPostgres user_interest tableRedis feed cache5 minRecompute from ReadEvent log
TrendingScoreRedis sorted set (TTL 60s)None (ephemeral)30sRecompute by replaying ReadEvent stream from Kafka
PersonalizedFeedRedis cache (TTL 5min)None5 minRecompute from UserInterest + Article table
SearchIndexElasticsearchNone (ES is the index)60s lag from PostgresFull reindex from Postgres

Dual-truth risks identified and mitigated:

  1. Risk: Article metadata in Postgres vs. Elasticsearch could diverge. Mitigation: ES is explicitly classified as a derived projection. CDC (Debezium) guarantees eventual delivery. ES is never written directly — only via CDC pipeline.

  2. Risk: UserInterest could be dual-written by two concurrent interest-update pipeline workers for the same user. Mitigation: Flink partitions by user_id — exactly one Flink task per user_id; no concurrent writes. Postgres update uses WHERE user_id = ? AND version = ? (optimistic lock) for safety.

  3. Risk: TrendingScore in Redis vs. raw ReadEvent count in Postgres. Mitigation: TrendingScore is explicitly a derived view; it is not reconciled against Postgres ReadEvent counts in real-time. The Count-Min Sketch is the operational truth for trending; accuracy is bounded by sketch error parameters.


Step 8 — Algorithm Design #

Pseudocode for every write path and state machine.

8.1 — Crawl Pipeline (Scheduler → Fetch → Parse → Ingest) #

// Scheduler daemon (runs every 10 seconds)
function scheduler_tick():
    BEGIN TRANSACTION
    sources = SELECT * FROM crawl_schedule
              WHERE next_crawl_at <= NOW()
              AND status = 'PENDING'
              ORDER BY next_crawl_at ASC
              LIMIT 100
              FOR UPDATE SKIP LOCKED

    for each source in sources:
        trigger = {
            source_id: source.source_id,
            url: source.url,
            fetch_type: source.feed_type,  // RSS | API | SCRAPE
            fetch_id: sha256(source.source_id + floor(now() / 60))  // idempotency key
        }
        kafka_publish('crawl-triggers', key=source.source_id, value=trigger)
        UPDATE crawl_schedule
        SET status = 'IN_FLIGHT', last_triggered_at = NOW()
        WHERE source_id = source.source_id
    COMMIT

// CrawlWorker (Kafka consumer, auto-scaled pool)
function crawl_worker(trigger):
    // Idempotency check
    if redis_setnx('crawl:in-flight:' + trigger.fetch_id, 1, TTL=600):
        raw_content = http_fetch(trigger.url)  // with timeout 30s, Circuit Breaker
        s3_key = 'raw/' + trigger.source_id + '/' + trigger.fetch_id + '.html'
        s3_put(s3_key, raw_content)

        fetch_event = {
            fetch_id: trigger.fetch_id,
            source_id: trigger.source_id,
            fetched_at: now(),
            s3_key: s3_key,
            http_status: raw_content.status,
            content_length: len(raw_content.body)
        }
        kafka_publish('article-fetches', key=trigger.source_id, value=fetch_event)

        // Update crawl schedule
        UPDATE crawl_schedule
        SET status = 'PENDING',
            next_crawl_at = NOW() + source.crawl_interval,
            last_crawled_at = NOW()
        WHERE source_id = trigger.source_id

// ParseWorker (Kafka consumer)
function parse_worker(fetch_event):
    if fetch_event.http_status != 200:
        emit_metric('crawl.fetch_error', {source_id: fetch_event.source_id})
        return

    raw_html = s3_get(fetch_event.s3_key)
    articles_raw = parse_feed(raw_html, fetch_event.source_id)  // RSS/Atom/HTML

    for each raw_article in articles_raw:
        url = normalize_url(raw_article.url)
        content_hash = sha256(raw_article.title + raw_article.body)

        // URL uniqueness enforced by DB
        result = INSERT INTO article (url, title, body, content_hash, published_at, source_id, ingested_at)
                 VALUES (url, ...)
                 ON CONFLICT (url) DO NOTHING
                 RETURNING article_id

        if result.rows_affected > 0:
            // New article — enqueue for classification and clustering
            kafka_publish('new-articles', key=url, value={article_id, content_hash, url})

8.2 — Article Classification Pipeline #

// ClassifierWorker (Kafka consumer, reads 'new-articles')
function classify_worker(new_article_event):
    article = fetch_article_content(new_article_event.article_id)
    category_labels = ml_classify(article.title + ' ' + article.body[:2000])
    // Returns list of (category, confidence) pairs

    UPDATE article
    SET category_labels = category_labels,
        classified_at = NOW()
    WHERE article_id = new_article_event.article_id

    kafka_publish('classified-articles', key=new_article_event.article_id, value={
        article_id, category_labels
    })

8.3 — Story Clustering (Near-Duplicate Detection) #

// ClusterWorker (Kafka consumer, reads 'classified-articles')
function cluster_worker(classified_event):
    article = fetch_article(classified_event.article_id)

    // Step 1: Compute MinHash signature
    tokens = shingle(article.title + ' ' + article.body, k=3)  // 3-gram shingles
    signature = minhash(tokens, num_hashes=128)

    // Step 2: LSH candidate lookup
    // Divide 128 hashes into 16 bands of 8 hashes each
    // Each band → one LSH bucket key
    candidates = []
    for band_idx in range(16):
        band = signature[band_idx*8 : (band_idx+1)*8]
        bucket_key = 'lsh:band:' + band_idx + ':' + hash(band)
        bucket_members = redis_smembers(bucket_key)
        candidates.extend(bucket_members)
        redis_sadd(bucket_key, article.article_id)
        redis_expire(bucket_key, 7 * 86400)  // 7-day TTL

    // Step 3: Compute exact Jaccard similarity for candidates
    // (to filter LSH false positives)
    best_match = None
    best_similarity = 0
    for candidate_id in set(candidates):
        candidate_sig = get_minhash_signature(candidate_id)  // from Redis or DB
        similarity = jaccard_estimate(signature, candidate_sig)
        if similarity > 0.85 and similarity > best_similarity:
            best_match = candidate_id
            best_similarity = similarity

    // Step 4: Assign cluster
    if best_match is None:
        // Create new cluster — this article is the canonical
        new_cluster_id = uuid()
        INSERT INTO story_cluster (cluster_id, canonical_article_id, created_at)
        VALUES (new_cluster_id, article.article_id, NOW())
        cluster_id = new_cluster_id
    else:
        // Join existing cluster
        cluster_id = get_cluster_id(best_match)
        // If best_match has no cluster (race condition), create one
        if cluster_id is None:
            cluster_id = uuid()
            INSERT INTO story_cluster ... (same as above for best_match)

    UPDATE article SET cluster_id = cluster_id WHERE article_id = article.article_id
    store_minhash_signature(article.article_id, signature)  // Redis, TTL 7 days

8.4 — ReadEvent Recording and Interest Update #

// API handler: POST /articles/{article_id}/read
function record_read(user_id, article_id, session_id):
    event_id = sha256(user_id + article_id + session_id)

    // Idempotency: unique index on (user_id, article_id, session_id)
    result = INSERT INTO read_event (event_id, user_id, article_id, session_id, timestamp)
             VALUES (event_id, user_id, article_id, session_id, NOW())
             ON CONFLICT (user_id, article_id, session_id) DO NOTHING

    if result.rows_affected > 0:
        // Publish to Kafka for downstream consumers
        kafka_publish('read-events', key=user_id, value={
            event_id, user_id, article_id, session_id, timestamp: NOW()
        })

    return OK

// Flink job: interest-update-pipeline
// Input: 'read-events' Kafka topic, keyed by user_id
// State: per user_id — Set<event_id> (dedup, TTL 48h), current interest vector
function interest_update_flink(read_event):
    // Dedup
    if dedup_state.contains(read_event.event_id):
        return  // duplicate, skip
    dedup_state.add(read_event.event_id)
    dedup_state.expire(read_event.event_id, TTL=48h)

    // Fetch article topic vector
    article_topics = fetch_article_topics(read_event.article_id)
    // Returns: {sports: 0.8, politics: 0.1, tech: 0.05, ...}

    // EMA update of interest vector
    current_interest = get_user_interest(read_event.user_id)
    // Apply decay for time since last update
    time_delta_days = (NOW() - current_interest.last_updated) / 86400
    decay = pow(0.95, time_delta_days)

    new_interest = {}
    for topic in all_topics:
        new_interest[topic] = decay * current_interest.vector.get(topic, 0.0)
                              + 0.1 * article_topics.get(topic, 0.0)

    // Normalize to unit vector
    norm = sqrt(sum(v^2 for v in new_interest.values()))
    new_interest = {k: v/norm for k, v in new_interest.items()}

    // Write to Postgres (optimistic lock)
    rows = UPDATE user_interest
           SET interest_vector = new_interest,
               last_updated = NOW(),
               version = version + 1
           WHERE user_id = read_event.user_id
           AND version = current_interest.version
    if rows == 0:
        // Conflict — another update raced; re-read and retry
        retry()

    // Invalidate feed cache
    redis_del('feed:' + read_event.user_id)
    kafka_publish('interest-updated', key=read_event.user_id, value={user_id, timestamp})
// Flink job: trending-pipeline
// Input: 'read-events' Kafka topic
// Window: SlidingEventTimeWindows(1 hour, 30 seconds)
// Output: Redis sorted set per window

function trending_flink_window(article_id, window_events):
    // Count-Min Sketch (width=2048, depth=5)
    sketch = CountMinSketch(width=2048, depth=5)
    for event in window_events:
        if not dedup_state.contains(event.event_id):
            sketch.increment(event.article_id)
            dedup_state.add(event.event_id)

    count = sketch.estimate(article_id)

    window_key = 'trending:' + floor(window.end / 30) * 30
    redis_zadd(window_key, count, article_id)
    redis_expire(window_key, 120)  // 2x slide interval TTL

// API: GET /trending?limit=50
function get_trending(limit):
    window_key = 'trending:' + current_window_epoch()
    results = redis_zrevrange(window_key, 0, limit-1, withscores=True)
    return results

8.6 — Personalized Feed Generation #

// Feed service: GET /feed/{user_id}
function get_feed(user_id, limit=50):
    cache_key = 'feed:' + user_id
    cached = redis_get(cache_key)
    if cached:
        return cached

    // Cache miss — recompute
    interest = fetch_user_interest(user_id)
    // interest.vector = {topic: weight, ...}

    // Candidate retrieval: fetch recent articles from top-interest categories
    top_topics = top_k(interest.vector, k=5)
    candidate_articles = []
    for topic in top_topics:
        // Query Elasticsearch for recent articles in this topic
        results = es_search({
            filter: {category: topic, published_at: {gte: NOW() - 48h}},
            sort: [{published_at: desc}],
            size: 100
        })
        candidate_articles.extend(results)

    // Deduplicate candidates by article_id
    candidates = deduplicate(candidate_articles, key='article_id')

    // Score each candidate: dot product of interest vector with article topic vector
    scored = []
    for article in candidates:
        relevance = dot_product(interest.vector, article.topic_vector)
        recency_decay = exp(-0.5 * hours_since(article.published_at))
        trending_boost = get_trending_score(article.article_id) / 1000.0
        score = relevance * recency_decay + 0.1 * trending_boost
        scored.append((score, article))

    // Sort by score, take top limit
    feed = sort_desc(scored, key=score)[:limit]

    // Cache result
    redis_setex(cache_key, 300, serialize(feed))  // 5-minute TTL

    return feed

// Background job: triggered by 'interest-updated' Kafka events
function proactive_feed_rebuild(user_id):
    redis_del('feed:' + user_id)  // Invalidate cache
    get_feed(user_id)              // Rebuild and warm cache

8.7 — Search Index (CDC Pipeline) #

// Debezium CDC configuration:
// Source: Postgres article table
// Sink: Kafka topic 'article-changes'
// Event types: INSERT, UPDATE (only for metadata fields)

// Elasticsearch sink consumer
function es_indexer(article_change_event):
    if article_change_event.op == 'INSERT':
        es_index({
            index: 'articles',
            id: article_change_event.url,
            body: {
                article_id: article_change_event.article_id,
                title: article_change_event.title,
                body_excerpt: article_change_event.body[:500],
                category_labels: article_change_event.category_labels,
                published_at: article_change_event.published_at,
                source_id: article_change_event.source_id,
                cluster_id: article_change_event.cluster_id
            }
        })
    elif article_change_event.op == 'UPDATE':
        // Only index metadata fields — content is immutable
        es_update({
            index: 'articles',
            id: article_change_event.url,
            doc: {
                category_labels: article_change_event.category_labels,
                cluster_id: article_change_event.cluster_id,
                engagement_count: article_change_event.engagement_count
            }
        })

// On failure: publish to dead-letter topic, retry with exponential backoff
// On persistent failure: alert on-call, manual reindex via:
//   reindex_from_postgres(article_id_range)

Step 9 — Logical Data Model #

Schema with partition keys derived from invariant scope.

Core Tables (Postgres) #

-- Source: registration of crawlable feeds
CREATE TABLE source (
    source_id       UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    url             TEXT NOT NULL UNIQUE,
    feed_type       TEXT NOT NULL CHECK (feed_type IN ('RSS', 'ATOM', 'API', 'SCRAPE')),
    crawl_interval  INTERVAL NOT NULL DEFAULT '15 minutes',
    last_crawled_at TIMESTAMPTZ,
    status          TEXT NOT NULL DEFAULT 'ACTIVE' CHECK (status IN ('ACTIVE', 'PAUSED', 'DEAD')),
    created_at      TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

-- CrawlSchedule: intent/future constraint for when to next crawl
CREATE TABLE crawl_schedule (
    source_id       UUID PRIMARY KEY REFERENCES source(source_id),
    next_crawl_at   TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    status          TEXT NOT NULL DEFAULT 'PENDING' CHECK (status IN ('PENDING', 'IN_FLIGHT')),
    last_triggered_at TIMESTAMPTZ
);
CREATE INDEX idx_crawl_schedule_next ON crawl_schedule (next_crawl_at) WHERE status = 'PENDING';

-- StoryCluster: near-duplicate story groups
CREATE TABLE story_cluster (
    cluster_id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    canonical_article_id UUID,  -- FK set after Article insert
    article_count       INT NOT NULL DEFAULT 1,
    created_at          TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    updated_at          TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

-- Article: primary entity, normalized from raw fetches
CREATE TABLE article (
    article_id      UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    url             TEXT NOT NULL UNIQUE,          -- natural dedup key, unique index
    title           TEXT NOT NULL,
    body            TEXT NOT NULL,
    content_hash    BYTEA NOT NULL,                -- sha256(title + body)
    published_at    TIMESTAMPTZ NOT NULL,
    source_id       UUID NOT NULL REFERENCES source(source_id),
    ingested_at     TIMESTAMPTZ NOT NULL DEFAULT NOW(),

    -- Mutable metadata (set by downstream pipelines)
    category_labels JSONB,                         -- [{category, confidence}]
    topic_vector    JSONB,                         -- {topic: weight, ...} normalized
    cluster_id      UUID REFERENCES story_cluster(cluster_id),
    engagement_count BIGINT NOT NULL DEFAULT 0,
    classified_at   TIMESTAMPTZ,
    clustered_at    TIMESTAMPTZ
) PARTITION BY RANGE (ingested_at);

-- Partition by month for retention management
CREATE TABLE article_2025_01 PARTITION OF article FOR VALUES FROM ('2025-01-01') TO ('2025-02-01');
-- ... create partitions monthly via automation

CREATE INDEX idx_article_source_published ON article (source_id, published_at DESC);
CREATE INDEX idx_article_category_published ON article USING GIN (category_labels) WHERE classified_at IS NOT NULL;
CREATE INDEX idx_article_cluster ON article (cluster_id);
CREATE INDEX idx_article_ingested ON article (ingested_at DESC);

-- ReadEvent: immutable event log
CREATE TABLE read_event (
    event_id        BYTEA PRIMARY KEY,             -- sha256(user_id + article_id + session_id)
    user_id         UUID NOT NULL,
    article_id      UUID NOT NULL REFERENCES article(article_id),
    session_id      UUID NOT NULL,
    timestamp       TIMESTAMPTZ NOT NULL DEFAULT NOW()
) PARTITION BY RANGE (timestamp);

CREATE UNIQUE INDEX idx_read_event_dedup ON read_event (user_id, article_id, session_id);
CREATE INDEX idx_read_event_user ON read_event (user_id, timestamp DESC);
CREATE INDEX idx_read_event_article ON read_event (article_id, timestamp DESC);

-- UserInterest: current interest vector per user
CREATE TABLE user_interest (
    user_id         UUID PRIMARY KEY,
    interest_vector JSONB NOT NULL DEFAULT '{}',   -- {topic: weight, ...} normalized
    last_updated    TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    version         BIGINT NOT NULL DEFAULT 0      -- for optimistic locking
);

Redis Key Schema (Derived Views) #

trending:{epoch_30s}        → SORTED SET  {article_id: score}  TTL=120s
feed:{user_id}              → STRING      serialized ranked article list  TTL=300s
crawl:in-flight:{fetch_id}  → STRING      '1'  TTL=600s
lsh:band:{band_idx}:{hash}  → SET         {article_id, ...}  TTL=604800s (7d)
minhash:sig:{article_id}    → STRING      128 hash values (binary)  TTL=604800s

Elasticsearch Index Schema #

{
  "mappings": {
    "properties": {
      "article_id":       { "type": "keyword" },
      "url":              { "type": "keyword" },
      "title":            { "type": "text", "analyzer": "english" },
      "body_excerpt":     { "type": "text", "analyzer": "english" },
      "category_labels":  { "type": "nested",
                            "properties": {
                              "category":   { "type": "keyword" },
                              "confidence": { "type": "float" }
                            }},
      "topic_vector":     { "type": "dense_vector", "dims": 128, "index": true, "similarity": "dot_product" },
      "published_at":     { "type": "date" },
      "source_id":        { "type": "keyword" },
      "cluster_id":       { "type": "keyword" },
      "engagement_count": { "type": "long" }
    }
  },
  "settings": {
    "number_of_shards": 12,
    "number_of_replicas": 1,
    "index.lifecycle.name": "articles-ilm-policy"
  }
}

Step 10 — Technology Landscape #

LayerTechnologyRationaleAlternative Considered
Primary data storePostgreSQL 16ACID transactions; FOR UPDATE SKIP LOCKED for scheduler; ON CONFLICT for dedup; range partitioning for ReadEvent; JSONB for vectors and labelsCockroachDB (more complex; not needed at this scale)
Article metadata readPostgreSQL (same instance, read replicas)Category/topic browsing is simple range queries; no need for separate read store at initial scaleCassandra (not needed; adds ops complexity)
Event stream / message busApache Kafka (3 brokers, replication factor 3)Durable log for ArticleFetch, ReadEvent, CrawlTrigger; at-least-once delivery with offset-based dedup; Kafka Streams or Flink as consumerPulsar (Kafka ecosystem is larger; Debezium support is mature)
Stream processingApache Flink (3 job managers, 6 task managers)Sliding window for TrendingScore; stateful exactly-once for UserInterest update; rich windowing API; checkpointing to S3Kafka Streams (simpler but less powerful windowing; no cross-key state joins)
Cache / derived viewsRedis 7 (cluster mode, 6 nodes)TrendingScore sorted sets; feed cache; crawl in-flight dedup; LSH buckets; low-latency readsMemcached (no sorted sets, no TTL per key)
Search indexElasticsearch 8 (3 data nodes, 1 coordinating)Full-text search over article title/body; KNN vector search for feed candidates; faceted filtering by categoryOpenSearch (equivalent, but ES has better k-NN support in v8)
CDC pipelineDebezium (Kafka Connect plugin)Postgres WAL → Kafka → Elasticsearch; guaranteed delivery; handles schema changesCustom triggers (fragile; Debezium is production-tested)
Raw content storageAmazon S3Archival of raw HTML from crawls; S3 Lifecycle for cost management; extremely cheap at scaleGCS (equivalent; S3 chosen for ecosystem)
ML classificationInternal ML service (Python, FastAPI)Topic/category classification via fine-tuned BERT or FastText; called synchronously from ClassifierWorkerAWS Comprehend (vendor lock-in; model not customizable)
ANN / vector searchFAISS (via Elasticsearch KNN)Approximate nearest neighbor for feed candidate retrieval using topic vectorsMilvus (standalone vector DB adds ops overhead; ES KNN sufficient)
Crawl schedulingPostgres-backed scheduler (crawl_schedule table + daemon)Simple, correct, no new ops dependency; FOR UPDATE SKIP LOCKED handles concurrencyQuartz Scheduler / Celery Beat (adds dependency; Postgres-native is simpler)
Service mesh / gatewayNginx + API GatewayAuthentication (JWT); rate limiting; routing to microservicesKong (viable alternative)
Container orchestrationKubernetes (EKS)Auto-scaling for CrawlWorkers, ClassifierWorkers, Feed serviceNomad (smaller ecosystem)

Step 11 — Deployment Topology #

Logical Service Topology #

                         ┌──────────────────────────────────────────────┐
                         │              Clients (Browser / Mobile)       │
                         └─────────────────────┬────────────────────────┘
                                               │ HTTPS
                         ┌─────────────────────▼────────────────────────┐
                         │            API Gateway (Nginx + Auth)         │
                         │   Rate limiting, JWT validation, routing      │
                         └──┬────────────┬───────────────┬──────────────┘
                            │            │               │
               ┌────────────▼──┐  ┌──────▼──────┐  ┌───▼───────────┐
               │  Feed Service  │  │ Search Svc  │  │ Trending Svc  │
               │ (reads Redis,  │  │(queries ES) │  │ (reads Redis) │
               │  ES, Postgres) │  └──────┬──────┘  └───────────────┘
               └────────────┬──┘         │
                            │            │ ES query
              ┌─────────────▼──────────────────────────────────┐
              │                  Redis Cluster                   │
              │  feed:{user_id}, trending:{epoch}, LSH buckets  │
              └─────────────────────────────────────────────────┘

┌────────────────────────────── Write Path ──────────────────────────────────┐
│                                                                             │
│  ┌─────────────┐    ┌─────────────┐    ┌──────────────────────────────┐   │
│  │  Scheduler   │───▶│    Kafka    │───▶│  CrawlWorker Pool (k8s)      │   │
│  │  Daemon      │    │  crawl-     │    │  Fetches HTML, stores S3,    │   │
│  │  (Postgres   │    │  triggers   │    │  publishes to article-fetches │   │
│  │  FOR UPDATE  │    └─────────────┘    └──────────────────────────────┘   │
│  │  SKIP LOCKED)│                                    │                     │
│  └─────────────┘                       ┌─────────────▼──────────┐         │
│                                         │  ParseWorker Pool       │         │
│                                         │  (RSS/HTML parsing,     │         │
│                                         │  INSERT ON CONFLICT)    │         │
│                                         └─────────────┬──────────┘         │
│                                                       │ Postgres            │
│                              ┌────────────────────────▼────────────┐       │
│                              │      Postgres Primary (RDS)          │       │
│                              │  article, source, crawl_schedule,   │       │
│                              │  user_interest, read_event           │       │
│                              └──┬─────────────────────┬────────────┘       │
│                                 │ Debezium CDC         │ Read replicas (2)  │
│                     ┌───────────▼──┐          ┌────────▼──────────┐        │
│                     │  Kafka topic  │          │  Read Replica Pool │        │
│                     │  article-     │          │  (feed queries,    │        │
│                     │  changes      │          │   category browse) │        │
│                     └───────────┬──┘          └───────────────────┘        │
│                                 │                                           │
│                     ┌───────────▼──┐          ┌────────────────────┐       │
│                     │ Elasticsearch │          │  Flink Cluster     │       │
│                     │ Indexer       │          │  - trending-job    │       │
│                     │ (Kafka Connect│          │  - interest-update │       │
│                     │  Sink)        │          │  - cluster-job     │       │
│                     └───────────┬──┘          └────────────────────┘       │
│                                 │                                           │
│                     ┌───────────▼──────────────────────────────────┐       │
│                     │         Elasticsearch Cluster (3 nodes)       │       │
│                     └──────────────────────────────────────────────┘       │
└─────────────────────────────────────────────────────────────────────────────┘

Physical Deployment (AWS, Single Region Initial) #

ComponentInstance TypeCountNotes
Postgres (RDS)db.r6g.2xlarge1 primary + 2 replicasMulti-AZ; 2TB gp3
Kafka brokersi3.xlarge33 AZs; 1.9TB NVMe
Flink (JobManager)c6i.xlarge3HA; ZooKeeper for leader election
Flink (TaskManager)r6i.2xlarge664GB RAM; RocksDB state backend
Redis (ElastiCache)r6g.xlarge6Cluster mode; 3 shards × 2 replicas
Elasticsearchr6g.2xlarge3 data + 1 coord500GB gp3 each
CrawlWorkersc6i.large10-50 (auto-scaled)HPA on queue depth
ParseWorkersc6i.large5-20 (auto-scaled)
ClassifierWorkersg4dn.xlarge (GPU)3-10GPU for BERT inference
ClusterWorkersc6i.large3-8CPU-bound MinHash
Feed Servicec6i.large3-10Stateless; HPA on CPU
Search Servicec6i.large3-5Stateless
Debezium (Kafka Connect)c6i.large3Distributed mode
S3Raw HTML archive

Step 12 — Consistency Model #

Classification of consistency per read/write path.

PathConsistency ModelJustification
Article ingest (crawl → Postgres)Linearizable (single primary)URL uniqueness invariant requires atomic insert; Postgres primary provides linearizability
Article metadata update (classify/cluster → Postgres)Read-your-writes (within pipeline partition)Flink partitions by article_id; each article’s metadata updates are sequential
CrawlSchedule update (scheduler → Postgres)LinearizableFOR UPDATE SKIP LOCKED is pessimistic locking; serializable within transaction
ReadEvent recording (API → Postgres)Linearizable (unique index enforces idempotency)Postgres unique index on (user_id, article_id, session_id)
UserInterest update (Flink → Postgres)Read-your-writesOptimistic locking (version field); single Flink partition per user_id ensures no concurrent writers
SearchIndex (CDC via Debezium → Elasticsearch)Eventual consistency (I-8: 60s staleness bound)CDC is async; Elasticsearch is a derived projection
TrendingScore (Flink → Redis)Eventual consistency (I-6: 30s staleness bound)Stream aggregation is async; score updates every 30s
PersonalizedFeed (Redis cache)Eventual consistency (I-7: 5-minute TTL)Cache is a materialized derived view; stale on cache hit
Category/topic browse (Postgres read replica)Eventual consistency (replication lag typically < 1s)Read replica lag; acceptable for browsing
Trending read (Redis sorted set)Eventual consistency (30s window slide)Same as above
Search query (Elasticsearch)Eventual consistency (60s CDC lag)Derived projection

Consistency anomalies explicitly accepted:

  1. A user reads an article and it does not immediately appear as read in their interest vector — up to 5 minutes staleness for feed personalization.
  2. A newly published article may not appear in search for up to 60 seconds.
  3. An article’s trending score may lag actual read volume by up to 30 seconds.
  4. Two concurrent requests to read the same article from different sessions may both record ReadEvents (idempotency key prevents double-counting only within the same session).

Step 13 — Scaling Model #

Scaling Dimensions #

DimensionCurrent Scale TargetScaling MechanismHotspot Risk
Sources (crawled feeds)100,000 sourcesHorizontal: more CrawlWorkers; Kafka partition count scales with source countSingle high-frequency source (e.g., Reuters publishes 500 articles/hour) → shard by source_id
Articles ingested/day5 million articles/dayPostgres partitioned by month; ParseWorker pool auto-scaledViral event: all sources publish same story simultaneously → URL dedup absorbs burst
ReadEvents/second50,000 reads/sec peakReadEvent table partitioned by day; Kafka handles bursts; Flink auto-scalesViral article: millions of reads/hour → Count-Min Sketch handles cardinality; Redis sorted set atomic
Users100 million usersUserInterest rows: 100M × 1KB = 100GB — fits in Postgres; feed cache in Redis clusterCelebrity user (influencer): their read events processed same as any user
Trending computation128 article_ids × 1-hour windowsCount-Min Sketch is O(1) memory per article; Redis sorted set scales horizontallyBreaking news: single article gets 10M reads/hour → CMS handles gracefully, Redis ZADD is atomic
Search100M article indexElasticsearch 12 shards; expand to 24 shards with reindexHot shard: recent articles searched more → ILM hot-warm-cold tier; recent index on hot nodes
Personalized feed100M users × 5 recomputes/dayRedis cluster holds 100M keys × 10KB = 1TB → scale cluster; proactive rebuild for active users onlyPopular user (many active sessions): each session triggers recompute → debounce: only rebuild if cache is cold

Sharding Strategy #

ObjectShard KeyWhy
Articleingested_at month (range partition)Even distribution; enables time-based retention; no hotspot (new articles spread across sources)
ReadEventtimestamp day (range partition)Even distribution; enables fast deletion of old partitions for retention
Kafka read-eventsuser_id (hash)Ensures all events for one user are in the same partition → Flink task receives all events for a user
Kafka new-articlesurl hashEnsures classifer and cluster workers process each article exactly once
Redis trendingwindow_epochDifferent windows are different keys; no hotspot
Redis feed cacheuser_id hash (Redis Cluster)Even distribution across Redis shards
ElasticsearchRound-robin (12 shards)Articles are uniformly distributed; no natural clustering needed

Auto-Scaling Triggers #

ServiceScale-Out TriggerScale-In Trigger
CrawlWorkersKafka crawl-triggers consumer lag > 1000 messagesLag < 100 for 5 minutes
ParseWorkersKafka article-fetches consumer lag > 500 messagesLag < 50 for 5 minutes
ClassifierWorkersKafka new-articles consumer lag > 200 messagesLag < 20 for 5 minutes
Feed ServiceCPU > 70%CPU < 30% for 10 minutes
Flink TaskManagersFlink backpressure > 80%Backpressure < 20% for 10 minutes

Step 14 — Failure Model #

Failure Modes and Mitigations #

FailureImpactDetectionMitigationRecovery
Postgres primary failureArticle ingest halts; ReadEvents queue in Kafka; feed reads degrade to cacheRDS health check; Postgres replication lag alertRDS Multi-AZ automatic failover (30-90s); Kafka buffers events during failoverResume consumption from Kafka after failover; no data loss (Kafka retained)
Kafka broker failureRead events may queue; crawl triggers queueKafka broker health; consumer lag alertReplication factor 3; one broker failure is transparentKafka auto-rebalances partitions to surviving brokers; no intervention
Flink job crash (trending)TrendingScore stale for up to window slide periodFlink job health; trending Redis TTL expiry alertFlink checkpointing every 30s to S3; auto-restart from checkpointJob restarts from checkpoint; replays from last Kafka offset; trending recovers within 30s
Flink job crash (interest update)UserInterest stale; feed personalization degradesFlink job health; consumer lag alertSame as above; Kafka ReadEvents retained 7 daysJob restarts; replays from checkpoint offset; no data loss
Redis cluster failureFeed cache miss → higher latency; trending unavailableRedis health; cache hit rate alertRedis ElastiCache cluster mode with 2 replicas per shard; automatic failoverCache rebuilds on demand on cache miss; trending recovers when Flink re-populates
Elasticsearch failureSearch unavailable; feed candidate retrieval from ES failsES health; search error rate3-node ES with 1 replica per shard; coordinating node is statelessES auto-recovers; search unavailable during outage; feed service falls back to Postgres query
ML classifier service downArticles not categorized; category browsing shows fewer resultsClassifier health; Kafka new-articles consumer lagClassifierWorkers retry with exponential backoff; articles remain uncategorized until service recoversService recovers; Kafka backlog is processed; articles categorized with delay
CrawlWorker crash mid-fetchArticleFetch not published; source skipped for one intervalCrawl trigger timeout; next_crawl_at not updated for sourceCrawlWorker uses Redis SETNX for in-flight dedup; if worker crashes, lock expires (TTL 600s); scheduler re-triggersScheduler re-triggers source after TTL expiry; at-most-one-interval gap in coverage
Debezium CDC lagElasticsearch stale beyond 60s SLOCDC lag metric; ES vs Postgres article count differenceDebezium Kafka Connect with retries; dead-letter queueReprocess from dead-letter queue; or trigger full reindex for specific article_id range
S3 unavailableRaw HTML archival fails; pipeline still works (S3 write is async)S3 put error rateCrawlWorker publishes to Kafka regardless of S3 success; S3 write is best-effort archiveRe-fetch from source if archival required; or accept loss of raw HTML for that fetch

Cascading Failure Analysis #

Scenario: Breaking news event — all 100K sources publish articles about the same story simultaneously

  1. CrawlWorkers fetch 100K articles in 1 hour window
  2. Kafka article-fetches topic receives 100K events — ParseWorkers scale out to handle burst
  3. Postgres receives 100K INSERT ON CONFLICT DO NOTHING — most are the same cluster (SimHash collision) → DB handles gracefully; conflicting URLs deduplicated at insert; 100K inserts → perhaps 5K unique articles (different sources)
  4. MinHash clustering pipeline is briefly overwhelmed → cluster assignment delayed; articles appear unclustered briefly
  5. ReadEvents spike (everyone reads breaking news) → TrendingScore updates in Redis within 30s → trending feed shows breaking news correctly
  6. Feed recompute for all users triggered → Redis cache invalidated at scale → Feed Service CPU spikes → HPA scales out
  7. Circuit breaker: if Postgres read replicas lag, Feed Service falls back to stale cache or returns partial results

Mitigation: Breaking news is the design-time scenario this system is built for. Key protections: URL dedup at Postgres level prevents duplicate storage; Count-Min Sketch handles high-cardinality trending without memory explosion; Redis cache absorbs read spike; Kafka buffers upstream bursts.


Step 15 — SLOs #

ServiceSLOMeasurementError Budget (30d)
Article ingest latencyP99 < 30s from publication to ingestion (for sources crawled every 15min, article available within 15min + 30s parse)ingested_at - published_at for new articles0.1% of articles may exceed 30s
Feed load timeP99 < 500msAPI response time from request to first byte (with cache)0.1% of feed requests > 500ms
Feed load time (cache miss)P99 < 3sAPI response time on cache miss (full recompute)0.5% of recomputes > 3s
Search response timeP99 < 1sElasticsearch query + API response0.1% of searches > 1s
Trending freshness95% of articles in trending list within 30s of crossing engagement thresholdTrendingScore update lag5% of trending updates may lag
Personalized feed freshness99% of feeds reflect UserInterest as of ≤ 5 minutes agoCache TTL + interest update lag1% of feeds may be > 5min stale
Search index freshness99% of articles appear in search within 60s of ingestiones_indexed_at - ingested_at1% of articles may lag > 60s
Availability (feed endpoint)99.9% uptimeHTTP 5xx rate < 0.1%43.2 minutes/month
Availability (search endpoint)99.5% uptimeHTTP 5xx rate < 0.5%3.6 hours/month
Crawl coverage99% of active sources crawled within 2× their crawl_intervalSources where last_crawled_at > now() + 2 * crawl_interval1% may miss crawl window

Step 16 — Operational Parameters #

Tuneable Parameters #

ParameterDefaultRangeEffect
crawl_interval (per source)15 minutes5 min – 24 hoursFrequency of source crawling; affects Postgres write load
min_crawl_interval (system floor)5 minutes1 min – 30 minPrevents overloading any single source
MinHash bands168 – 32More bands = higher recall (fewer missed duplicates); more Redis memory
MinHash hashes per band84 – 16More hashes = higher precision (fewer false positives); slower computation
Jaccard similarity threshold0.850.7 – 0.95Lower = more aggressive clustering; higher = stricter near-dup detection
Trending window size1 hour15 min – 6 hoursWider = smoother trending; narrower = more reactive
Trending slide interval30 seconds10 sec – 5 minSmaller = fresher trending; more Redis writes
Count-Min Sketch width2048512 – 8192Wider = lower error; more memory
Count-Min Sketch depth53 – 10Deeper = lower probability of error; more memory
Feed cache TTL300 seconds60 – 1800 secShorter = fresher feed; more recompute load
UserInterest EMA learning rate (α)0.100.01 – 0.3Higher = faster adaptation to new interests; lower = more stable
UserInterest decay rate (γ)0.95 per day0.8 – 1.0 per dayLower = faster forgetting of old interests; higher = longer memory
Feed candidate pool (articles per topic)10020 – 500Larger pool = better ranking quality; more ES query load
Top-K topics for feed candidates52 – 10More topics = broader feed; fewer = more focused
Kafka ReadEvent retention7 days1 – 30 daysLonger = more replay capacity for Flink recovery; more storage cost

Circuit Breaker Configuration #

EndpointFailure ThresholdTimeoutHalf-Open Probe Interval
External source HTTP fetch5 failures in 60s30s per request60s
Postgres write (article insert)3 failures in 10s5s30s
Elasticsearch query (search/feed)5 failures in 30s3s15s
Redis read (feed cache)10 failures in 10s100ms10s
ML classifier service3 failures in 30s10s60s

Step 17 — Runbooks #

Runbook 1: Feed Latency Degradation (P99 > 500ms) #

Trigger: Alert on feed_api_latency_p99 > 500ms for 5 consecutive minutes.

Diagnosis steps:

  1. Check Redis feed cache hit rate: redis-cli info stats | grep keyspace_hits. If hit rate < 80%, cache is being evicted or TTL is too short.
  2. Check Feed Service CPU: kubectl top pods -l app=feed-service. If > 80%, scale out: kubectl scale deployment feed-service --replicas=10.
  3. Check Elasticsearch query latency: Kibana → Monitoring → Elasticsearch → Search latency. If P99 > 500ms, check for hot shards.
  4. Check Postgres read replica lag: CloudWatch → RDS → ReplicaLag. If > 5s, feed service is reading stale data; acceptable per SLO but investigate replication stall.
  5. Check UserInterest update lag: Kafka consumer group lag for interest-update-pipeline. If > 10,000 messages, Flink job may be struggling.

Remediation:

  • Redis eviction: Increase Redis cluster memory or reduce non-feed TTLs. Short-term: CONFIG SET maxmemory-policy allkeys-lru.
  • Feed Service overloaded: Scale out pods. If scaling doesn’t help, check for slow ES queries (see step 3).
  • Hot ES shards: Force-merge recent index or move hot shard to dedicated node.
  • Flink lag: Check Flink UI for backpressure. Scale TaskManagers if needed.

Runbook 2: Crawl Coverage Drop (sources missing crawl window) #

Trigger: Alert on sources_overdue_ratio > 0.01 (more than 1% of sources haven’t been crawled within 2× their interval).

Diagnosis steps:

  1. Check scheduler daemon health: kubectl get pods -l app=scheduler-daemon. If pod is not running, restart: kubectl rollout restart deployment scheduler-daemon.
  2. Check Kafka crawl-triggers consumer lag: if lag > 10,000, CrawlWorkers are overwhelmed.
  3. Check CrawlWorker pod count: kubectl get pods -l app=crawl-worker. Auto-scaling should have triggered.
  4. Check for circuit-breaker trips on external sources: look for high crawl.fetch_error metric rate. May indicate network issue or source outage.
  5. Check Postgres crawl_schedule table for IN_FLIGHT rows with old last_triggered_at (> 10 minutes): these are stuck crawls whose workers crashed without completing.

Remediation:

  • Stuck IN_FLIGHT rows: UPDATE crawl_schedule SET status = 'PENDING' WHERE status = 'IN_FLIGHT' AND last_triggered_at < NOW() - INTERVAL '10 minutes'; This re-queues them for the next scheduler tick.
  • CrawlWorker overload: scale out kubectl scale deployment crawl-worker --replicas=50.
  • Network issue: check AWS VPC flow logs; check if specific source domains are down.

Runbook 3: Elasticsearch Index Divergence (search misses recent articles) #

Trigger: Alert on es_index_lag_seconds > 120 (CDC pipeline is more than 2× SLO behind).

Diagnosis steps:

  1. Check Debezium Kafka Connect status: curl http://kafka-connect:8083/connectors/postgres-source/status.
  2. Check Kafka article-changes topic consumer lag.
  3. Check Elasticsearch indexer Kafka Connect sink status.
  4. Check Elasticsearch cluster health: curl http://es:9200/_cluster/health.

Remediation:

  • Debezium connector failure: curl -X POST http://kafka-connect:8083/connectors/postgres-source/restart.
  • Elasticsearch cluster RED/YELLOW: check unassigned shards: curl http://es:9200/_cat/shards?v | grep UNASSIGNED. Force allocate if needed.
  • If lag is persistent (> 1 hour): trigger partial reindex for recently ingested articles:
    reindex_articles_since(timestamp=NOW() - INTERVAL '2 hours')
    

Trigger: Alert on trending_redis_key_missing — the expected Redis key trending:{current_epoch} does not exist.

Diagnosis steps:

  1. Check Flink trending-job status in Flink UI.
  2. Check Kafka read-events consumer lag for trending-job consumer group.
  3. Check Redis cluster health.

Remediation:

  • Flink job down: flink run -d trending-job.jar (or trigger via Kubernetes job restart).
  • Redis cluster issue: check ElastiCache events in AWS console.
  • Cold start (no ReadEvents in past hour): this is expected during low-traffic periods; trending returns empty list — acceptable.

Step 18 — Observability #

Metrics #

MetricTypeLabelsAlert Threshold
articles_ingested_totalCountersource_id, status (new/duplicate)
article_parse_latency_secondsHistogramfeed_typeP99 > 5s
crawl_fetch_error_rateGaugesource_id> 10% over 5 min
sources_overdue_ratioGauge> 0.01
kafka_consumer_lagGaugeconsumer_group, topic> 10,000
feed_api_latency_secondsHistogramcache_status (hit/miss)P99 > 500ms (hit), P99 > 3s (miss)
feed_cache_hit_rateGauge< 80%
es_index_lag_secondsGauge> 120s
trending_score_freshness_secondsGauge> 60s
user_interest_update_lag_secondsGauge> 300s
minhash_cluster_assignments_totalCounteraction (new_cluster/joined_cluster)
read_event_dedup_rateGauge> 5% (indicates client retry loop)
flink_job_restarts_totalCounterjob_name> 3 in 1 hour
circuit_breaker_open_totalCounterservice> 0 per minute
classifier_latency_secondsHistogramP99 > 10s

Traces #

Distributed tracing (OpenTelemetry) spans:

  • feed.request → spans: redis.get(feed_cache), es.search(candidates), postgres.read(user_interest), rank.compute, redis.set(feed_cache)
  • article.ingest → spans: kafka.consume(article-fetches), parse.html, postgres.insert(article), kafka.publish(new-articles)
  • crawl.trigger → spans: postgres.query(crawl_schedule), kafka.publish(crawl-trigger), postgres.update(next_crawl_at)

Logs #

Structured JSON logs with fields: service, trace_id, span_id, level, msg, article_id, user_id, source_id, duration_ms.

Key log events:

  • article.ingested — every new article (not duplicate): {article_id, url, source_id, published_at, ingested_at}
  • article.duplicate — URL conflict on insert: {url, existing_article_id}
  • cluster.assigned — article assigned to cluster: {article_id, cluster_id, action, jaccard_similarity}
  • interest.updated — user interest vector updated: {user_id, top_topics, version}
  • feed.cache_miss — feed recomputed: {user_id, candidate_count, rank_duration_ms}
  • crawl.error — source fetch failed: {source_id, url, http_status, error}

Dashboards #

  1. Ingestion Health: articles/hour by source, parse error rate, duplicate rate, cluster assignment rate
  2. Pipeline Lag: Kafka consumer lag per consumer group, Flink checkpoint duration, CDC lag to Elasticsearch
  3. User Experience: feed P50/P95/P99 latency (cache hit vs miss), search latency, trending score freshness
  4. Crawl Coverage: sources overdue, crawl error rate by domain, CrawlWorker queue depth
  5. Content Quality: articles per cluster (clustering effectiveness), classification confidence distribution

Step 19 — Cost Model #

Monthly cost estimate at 100K sources, 5M articles/day, 50K reads/second peak.

ComponentSizingMonthly Cost (USD)Notes
RDS Postgres (primary + 2 replicas)db.r6g.2xlarge × 3~$1,8002TB gp3 storage; Multi-AZ
Kafka (MSK)3 × i3.xlarge~$900Managed Kafka; 3 AZ
Flink (EC2)3 JobManager (c6i.xl) + 6 TaskManager (r6i.2xl)~$2,400Spot for TaskManagers saves ~60%
Redis (ElastiCache)6 × r6g.xlarge (cluster)~$1,800
Elasticsearch3 × r6g.2xlarge + 1 coord~$1,600500GB gp3 each
CrawlWorkers (EC2 Spot)avg 20 × c6i.large~$400Spot pricing
ParseWorkers (EC2 Spot)avg 10 × c6i.large~$200
ClassifierWorkers (GPU Spot)avg 5 × g4dn.xlarge~$700GPU Spot
ClusterWorkers (EC2 Spot)avg 5 × c6i.large~$100
Feed/Search Services (EC2)avg 6 × c6i.large~$360
S3 (raw HTML archive)5M articles × 100KB avg × 30 days = 15TB/month~$340S3 Standard → Glacier after 7 days (~$150 effective)
Data transfer (outbound)100M users × 50 requests/month × 10KB = 50TB~$4,500CloudFront CDN reduces this by ~80% → ~$900 effective
CloudFront CDN50TB outbound × 80% cached~$600Reduces origin load and egress
Kubernetes (EKS)Cluster overhead~$150Control plane
Debezium/Kafka Connect3 × c6i.large~$180
Total (approximate)~$12,000 – $15,000/monthSpot discounts and CDN caching bring effective cost down

Cost optimization levers:

  1. Use Spot instances for all stateless workers (CrawlWorkers, ParseWorkers, ClassifierWorkers) — 60-70% savings
  2. S3 Lifecycle: raw HTML to Glacier after 7 days — reduces S3 cost by ~75%
  3. Flink TaskManagers on Spot with checkpointing to S3 — job restarts on Spot interruption (acceptable; checkpoint recovery < 30s)
  4. CloudFront for feed API responses: trending and category feeds are cacheable for 30s at edge — reduces origin load by 80%
  5. Postgres: use RDS Reserved Instances (1-year) — 40% discount → ~$700/month for DB

Step 20 — Evolution Stages #

Stage 1: Foundation (0 → 1M users, 10K sources) #

What is built:

  • Postgres for all primary state (article, source, crawl_schedule, read_event, user_interest)
  • Kafka with 3 topics: crawl-triggers, article-fetches, read-events
  • CrawlWorker pool + ParseWorker pool (manual scaling, 5 workers each)
  • Simple rule-based classifier (keyword matching, no ML) for category labels
  • No near-duplicate clustering (each URL is its own “cluster”)
  • Feed = most recent articles in user’s followed categories (no ML ranking)
  • Trending = COUNT(read_events in past 1h) per article, computed by cron job every 5 minutes
  • Elasticsearch not yet deployed — search is a Postgres full-text query (tsvector)
  • No Redis — feed is computed on every request from Postgres

Compromises accepted: No personalization; no near-dup clustering; search is slow; trending is coarse.

Migration path to Stage 2: Add Redis for feed cache when P99 feed latency exceeds 1 second; add Flink when cron job trending lag becomes user-visible.


What changes:

  • Add Flink (2 TaskManagers) for sliding-window trending score
  • Add Redis (3-node cluster) for trending sorted set and feed cache
  • Add ML classifier (FastText, CPU-only, 3 workers)
  • Add UserInterest (Flink EMA update pipeline)
  • PersonalizedFeed: top-K from Postgres by topic score dot product (no ES yet)
  • ReadEvent deduplication implemented (unique index + session_id)
  • CrawlSchedule eviction: FOR UPDATE SKIP LOCKED scheduler pattern implemented

New invariants enforced: I-5 (ReadEvent idempotency), I-6 (TrendingScore freshness), I-7 (feed personalization)


Stage 3: Search and Story Clustering (10M → 50M users, 100K sources) #

What changes:

  • Deploy Elasticsearch (3 nodes); Debezium CDC from Postgres
  • Replace Postgres full-text search with Elasticsearch full-text + dense vector KNN
  • Implement MinHash + LSH story clustering (ClusterWorker pool)
  • StoryCluster table added; Article.cluster_id populated
  • Feed candidate retrieval switches from Postgres to Elasticsearch KNN
  • PersonalizedFeed now uses topic_vector (128-dim) instead of category labels
  • Add Kafka new-articles and classified-articles topics
  • Flink interest-update pipeline achieves exactly-once semantics

New invariants enforced: I-4 (near-dedup clustering), I-8 (search freshness)


Stage 4: Multi-Region (50M → 200M users) #

What changes:

  • Deploy read-region in EU and APAC (Postgres read replicas + Redis + ES)
  • Feed Service reads from local Redis/ES/Postgres replica
  • Kafka MirrorMaker 2 for cross-region ReadEvent replication
  • UserInterest updates go to primary region; replicated asynchronously
  • Trending is computed per-region (regional trends) + globally (global trending)
  • CrawlWorkers deployed per region; each region crawls a partition of sources (avoid duplicate crawls via source → region assignment in crawl_schedule)

New invariants: Cross-region: UserInterest may be stale by replication lag (typically < 1s) — accepted as eventual consistency.


Stage 5: Advanced ML and Publisher Ecosystem (200M+ users) #

What changes:

  • Replace FastText classifier with fine-tuned BERT (GPU inference, batched)
  • Add publisher API: verified publishers can submit articles directly (no crawl needed) via REST API; articles are inserted into Postgres directly, bypassing CrawlWorker/ParseWorker
  • Add user feedback loop: explicit signals (save, share, dislike) are incorporated into UserInterest with higher weight than implicit reads
  • Add A/B testing framework: PersonalizedFeed ranking function is parameterized; different ranking models tested in shadow mode
  • Story clustering graduates to transformer-based embeddings (SBERT) instead of MinHash for higher accuracy; LSH index replaced by FAISS approximate nearest neighbor
  • Breaking news detection: articles with engagement velocity > 3× baseline trending threshold trigger push notifications via a separate notification service

New objects: FeedbackEvent (explicit user signal — append-only, higher weight than ReadEvent), PublisherSubmission (stable entity — verified publishers bypass crawl pipeline), NotificationSubscription (intent/future constraint — user opts in to breaking news alerts).


End of document. All 20 steps have produced explicit output artifacts. The design is complete.

There's no articles to list here yet.