Skip to main content
  1. System Design Components/

Data Engineering Patterns to Archetypes Mapping #

This note maps recurring data engineering patterns to the system design archetypes already used in the cheat sheets. It is not a replacement for the existing notes. It is a bridge:

  • from data pipeline pattern language
  • to product/infra archetype language
  • to interview-useful mechanism families

The source inspiration here is data_engineering_desing_patterns.txt, especially its patterns around ingestion, error handling, idempotency, flow orchestration, storage layout, data quality, and observability.

How To Read This Note #

For each data engineering pattern, ask:

  1. what archetype family is this closest to?
  2. what mechanism family does it imply?
  3. where does it show up in interviews?

Use this note when an interview problem drifts into:

  • CDC
  • dedup
  • checkpointing
  • late data
  • compaction
  • orchestration
  • stream processing
  • data quality
  • observability

1. Ingestion Patterns #

Full Loader #

  • Closest archetype:
    • batch snapshot ingestion
    • append log + projection refresh
  • Mechanism family:
    • periodic scan
    • batch overwrite or batch publish
  • Interview relevance:
    • initial bootstrap
    • periodic backfill
    • full rebuild of search/index/projection

Incremental Loader #

  • Closest archetype:
    • append-only child object
    • projection maintenance
  • Mechanism family:
    • watermark or high-water-mark checkpoint
    • incremental fetch by time/key
  • Interview relevance:
    • sync APIs
    • changes feed
    • delta pipelines

Change Data Capture #

  • Closest archetype:
    • append log + consumer progress
    • projection / replication pipeline
  • Mechanism family:
    • WAL/binlog/stream reader
    • downstream event consumers
  • Interview relevance:
    • search indexing
    • webhooks
    • replication
    • materialized projections

Passthrough Replicator #

  • Closest archetype:
    • append log replication
    • control plane + local snapshot
  • Mechanism family:
    • source stream to replica sink
    • minimal transformation
  • Interview relevance:
    • cache warm pipelines
    • region replication
    • mirror topics / mirror stores

Transformation Replicator #

  • Closest archetype:
    • derived projection
    • ETL/ELT pipeline
  • Mechanism family:
    • read source events
    • transform
    • write derived sink
  • Interview relevance:
    • denormalized read models
    • analytics sidecars
    • search documents

Compactor #

  • Closest archetype:
    • append log + compaction
    • versioned namespace + immutable content units
    • projection maintenance
  • Mechanism family:
    • merge many fragments into fewer bigger units
    • materialize compacted state
  • Interview relevance:
    • log compaction
    • small-files problem
    • snapshotting
    • storage/read amplification tradeoffs

Readiness Marker #

  • Closest archetype:
    • workflow / orchestration gate
    • future constraint / external dependency release
  • Mechanism family:
    • sentinel file
    • completion marker
    • downstream gate
  • Interview relevance:
    • batch pipeline sequencing
    • multi-step ETL
    • “when is data safe to consume?”

External Trigger #

  • Closest archetype:
    • workflow / lifecycle
    • future constraint + claimable run
  • Mechanism family:
    • webhook/event-triggered job start
    • external scheduler signal
  • Interview relevance:
    • event-driven pipelines
    • orchestration kickoffs

2. Error Management Patterns #

Dead-Letter #

  • Closest archetype:
    • append log + side-failure sink
    • workflow / retry path
  • Mechanism family:
    • failed record quarantine
    • replay queue / poison record isolation
  • Interview relevance:
    • queue consumers
    • webhook delivery
    • ETL bad-record handling
    • stream processing fault isolation

Windowed Deduplicator #

  • Closest archetype:
    • append-only ingestion with idempotency window
    • relation/event dedup
  • Mechanism family:
    • key + time window dedup state
    • watermark-aware duplicate suppression
  • Interview relevance:
    • event ingestion
    • click/impression streams
    • CDC duplicate suppression

Late Data Detector #

  • Closest archetype:
    • stateful streaming projection
    • temporal workflow correction
  • Mechanism family:
    • watermarking
    • lateness threshold
    • side-output for late arrivals
  • Interview relevance:
    • stream windows
    • ranking/trending
    • analytics freshness correctness

Static Late Data Integrator #

  • Closest archetype:
    • batch repair / backfill workflow
  • Mechanism family:
    • fixed replay windows
    • periodic backfill job
  • Interview relevance:
    • warehouse partition repair
    • scheduled recomputation

Dynamic Late Data Integrator #

  • Closest archetype:
    • repair workflow with discovered worklist
    • frontier-like backfill set
  • Mechanism family:
    • state table of impacted partitions/windows
    • targeted reprocessing
  • Interview relevance:
    • selective replay
    • replay only affected partitions

Filter Interceptor #

  • Closest archetype:
    • projection / observability side-output
  • Mechanism family:
    • classify-and-count filtered records
    • branch bad/good records with metrics
  • Interview relevance:
    • data quality reporting
    • ETL auditability

Checkpointer #

  • Closest archetype:
    • append log + consumer progress
    • claimable run progress state
  • Mechanism family:
    • offset/state persistence
    • restart from last durable progress
  • Interview relevance:
    • Kafka consumers
    • Spark/Flink jobs
    • crawler/scheduler progress tracking

3. Idempotency Patterns #

Data Overwrite #

  • Closest archetype:
    • current-value entity overwrite
    • partition overwrite in warehouse systems
  • Mechanism family:
    • replace known bounded slice atomically
  • Interview relevance:
    • warehouse partition refresh
    • current-state materialized datasets

Merger #

  • Closest archetype:
    • current-value entity
    • versioned state merge
  • Mechanism family:
    • merge/upsert based on business key
  • Interview relevance:
    • slowly changing dimensions
    • idempotent ingest

Stateful Merger #

  • Closest archetype:
    • workflow/projection with accumulated state
  • Mechanism family:
    • key-state merge
    • previous state consulted during write
  • Interview relevance:
    • streaming upserts
    • incremental aggregates

Keyed Idempotency #

  • Closest archetype:
    • critical transaction process
    • append ingestion dedup
  • Mechanism family:
    • idempotency key table
    • dedup on stable request/event key
  • Interview relevance:
    • payments
    • webhook receivers
    • exactly-once-ish API semantics

Transactional Writer #

  • Closest archetype:
    • critical transaction process
    • workflow state + durable commit
  • Mechanism family:
    • transactional sink write
    • atomic write of data and progress
  • Interview relevance:
    • outbox-like guarantees
    • exactly-once sink semantics

Proxy #

  • Closest archetype:
    • immutable dataset publish through mutable pointer
    • versioned namespace + head pointer
  • Mechanism family:
    • write new version elsewhere
    • publish by pointer switch
  • Interview relevance:
    • blue/green datasets
    • snapshot publishing
    • “don’t overwrite, publish new version”

4. Data Value Patterns #

Static Joiner / Dynamic Joiner #

  • Closest archetype:
    • derived projection
    • enrichment pipeline
  • Mechanism family:
    • lookup join or stateful join
    • enrich stream/batch records
  • Interview relevance:
    • denormalized read models
    • recommendation/search features

Wrapper / Metadata Decorator #

  • Closest archetype:
    • derived projection
    • content envelope / event envelope
  • Mechanism family:
    • attach metadata around payload
  • Interview relevance:
    • event schema evolution
    • observability metadata
    • transport envelopes

Distributed Aggregator / Local Aggregator #

  • Closest archetype:
    • derived projection
    • ranking/leaderboard/trending
  • Mechanism family:
    • partial aggregation then combine
    • local pre-aggregation to reduce hot writes
  • Interview relevance:
    • counters
    • analytics
    • popularity
    • stream reductions

Incremental Sessionizer / Stateful Sessionizer #

  • Closest archetype:
    • temporal grouping projection
    • stateful streaming workflow
  • Mechanism family:
    • keyed session state
    • inactivity timeout
  • Interview relevance:
    • user activity analytics
    • event stream segmentation

Bin Pack Orderer / FIFO Orderer #

  • Closest archetype:
    • batching / ordering pipeline
    • append log processing discipline
  • Mechanism family:
    • reorder-by-batch or strict FIFO
  • Interview relevance:
    • throughput vs ordering tradeoff
    • queue/stream consumption strategy

5. Data Flow Patterns #

Local Sequencer / Isolated Sequencer #

  • Closest archetype:
    • workflow / orchestration
    • control plane sequencing
  • Mechanism family:
    • explicit task order in same runner
    • isolated stages with explicit dependencies
  • Interview relevance:
    • DAG orchestration
    • multi-step processes

Aligned Fan-In / Unaligned Fan-In #

  • Closest archetype:
    • fan-in join / barrier coordination
  • Mechanism family:
    • wait for all aligned inputs
    • accept partial/misaligned arrival with coordination logic
  • Interview relevance:
    • stream joins
    • multi-source ETL synchronization

Parallel Split / Exclusive Choice #

  • Closest archetype:
    • fan-out branching
    • conditional workflow path selection
  • Mechanism family:
    • branch to many
    • route to one based on predicate
  • Interview relevance:
    • routing pipelines
    • rules engines
    • notification delivery paths

Single Runner / Concurrent Runner #

  • Closest archetype:
    • claimable run execution model
    • workflow concurrency control
  • Mechanism family:
    • one runner at a time
    • parallel task execution
  • Interview relevance:
    • schedulers
    • Airflow/Cadence/Temporal style orchestration

6. Data Storage Patterns #

Horizontal Partitioner #

  • Closest archetype:
    • partitioned storage
    • hot-range/hot-key scale control
  • Mechanism family:
    • split by time/key/tenant/hash
  • Interview relevance:
    • scaling writes
    • pruning reads
    • tenant isolation

Vertical Partitioner #

  • Closest archetype:
    • hot/cold split
    • metadata/content separation
  • Mechanism family:
    • split wide row/object by column or concern
  • Interview relevance:
    • blob metadata vs blob bytes
    • hot fields vs cold payload

Bucket #

  • Closest archetype:
    • write sharding
    • hot-key mitigation
  • Mechanism family:
    • split one logical hot key across many physical keys
  • Interview relevance:
    • hot tags
    • counters
    • leaderboard partials
    • DDB/Spanner hot partitions

Sorter #

  • Closest archetype:
    • storage layout optimization
    • ordered projection support
  • Mechanism family:
    • sorted write layout for scan/read efficiency
  • Interview relevance:
    • Parquet/warehouse read pruning
    • key-range scan efficiency

Metadata Enhancer #

  • Closest archetype:
    • derived metadata projection
  • Mechanism family:
    • attach summary/index metadata to improve read pruning
  • Interview relevance:
    • manifests
    • file statistics
    • pruning and skipping

Dataset Materializer #

  • Closest archetype:
    • derived projection
    • snapshot publish
  • Mechanism family:
    • precompute read-optimized dataset
  • Interview relevance:
    • dashboards
    • search/index docs
    • warehouse marts

Manifest #

  • Closest archetype:
    • versioned namespace + immutable content units
    • publish-through-metadata pointer
  • Mechanism family:
    • explicit list of immutable files/chunks forming a version
  • Interview relevance:
    • lakehouse metadata
    • Dropbox-like manifests
    • immutable version publish

Normalizer / Denormalizer #

  • Closest archetype:
    • storage/read model shaping
    • normalized write truth vs denormalized read projection
  • Mechanism family:
    • normalize for integrity
    • denormalize for performance
  • Interview relevance:
    • read/write model split
    • OLTP vs OLAP shape

7. Data Quality Patterns #

Audit-Write-Audit-Publish #

  • Closest archetype:
    • critical publish workflow
    • validation gate before external visibility
  • Mechanism family:
    • audit input
    • write output
    • audit output
    • publish only if valid
  • Interview relevance:
    • regulated pipelines
    • quality gates before consumer exposure

Constraints Enforcer #

  • Closest archetype:
    • invariant enforcement
    • data contract enforcement
  • Mechanism family:
    • validate before accept/publish
  • Interview relevance:
    • schema/data constraints
    • business rule enforcement

Schema Compatibility Enforcer / Schema Migrator #

  • Closest archetype:
    • contract/version management
    • versioned namespace / schema evolution
  • Mechanism family:
    • compatibility checks
    • coordinated migration
  • Interview relevance:
    • event schemas
    • API evolution
    • CDC compatibility

Offline Observer / Online Observer #

  • Closest archetype:
    • observability / audit sidecar
  • Mechanism family:
    • batch or realtime quality monitors
  • Interview relevance:
    • data quality checks
    • pipeline guardrails

8. Data Observability Patterns #

Flow Interruption Detector #

  • Closest archetype:
    • liveness detector
    • pipeline monitoring
  • Mechanism family:
    • missing heartbeat / missing data detector
  • Interview relevance:
    • stuck pipeline alerting
    • “why is data not arriving?”

Skew Detector #

  • Closest archetype:
    • hot partition / skew observability
  • Mechanism family:
    • detect uneven load/data distribution
  • Interview relevance:
    • Flink/Spark/Kafka skew
    • DDB/Spanner hot partition symptoms

Lag Detector #

  • Closest archetype:
    • consumer progress monitoring
    • replication freshness monitoring
  • Mechanism family:
    • source offset/time minus consumer offset/time
  • Interview relevance:
    • Kafka lag
    • CDC lag
    • replication delay

SLA Misses Detector #

  • Closest archetype:
    • workflow deadline monitoring
    • service level observability
  • Mechanism family:
    • compare expected completion/arrival vs actual
  • Interview relevance:
    • batch deadlines
    • delayed notifications
    • stale projections

Dataset Tracker / Fine-Grained Tracker #

  • Closest archetype:
    • lineage/provenance tracking
    • audit metadata graph
  • Mechanism family:
    • dataset-level lineage
    • record/file-level provenance
  • Interview relevance:
    • debugging data issues
    • governance/compliance
    • blast-radius analysis

Most Useful Patterns For Interview Prep #

If you only remember a small subset, remember these:

  1. Change Data Capture
  2. Dead-Letter
  3. Windowed Deduplicator
  4. Checkpointer
  5. Keyed Idempotency
  6. Compactor
  7. Bucket
  8. Manifest
  9. Readiness Marker
  10. Lag Detector

These cover a lot of the recurring “hard” infra/data questions in interviews.

How This Complements Existing Cheat Sheets #

Use this note alongside:

The existing notes are archetype-first. This note is data-pattern-first.