Data Engineering Patterns to Archetypes Mapping #

This note maps recurring data engineering patterns to the system design archetypes already used in the cheat sheets. It is not a replacement for the existing notes. It is a bridge:

from data pipeline pattern language
to product/infra archetype language
to interview-useful mechanism families

The source inspiration here is data_engineering_desing_patterns.txt, especially its patterns around ingestion, error handling, idempotency, flow orchestration, storage layout, data quality, and observability.

How To Read This Note #

For each data engineering pattern, ask:

what archetype family is this closest to?
what mechanism family does it imply?
where does it show up in interviews?

Use this note when an interview problem drifts into:

CDC
dedup
checkpointing
late data
compaction
orchestration
stream processing
data quality
observability

1. Ingestion Patterns #

Full Loader #

Closest archetype:
- batch snapshot ingestion
- append log + projection refresh
Mechanism family:
- periodic scan
- batch overwrite or batch publish
Interview relevance:
- initial bootstrap
- periodic backfill
- full rebuild of search/index/projection

Incremental Loader #

Closest archetype:
- append-only child object
- projection maintenance
Mechanism family:
- watermark or high-water-mark checkpoint
- incremental fetch by time/key
Interview relevance:
- sync APIs
- changes feed
- delta pipelines

Change Data Capture #

Closest archetype:
- append log + consumer progress
- projection / replication pipeline
Mechanism family:
- WAL/binlog/stream reader
- downstream event consumers
Interview relevance:
- search indexing
- webhooks
- replication
- materialized projections

Passthrough Replicator #

Closest archetype:
- append log replication
- control plane + local snapshot
Mechanism family:
- source stream to replica sink
- minimal transformation
Interview relevance:
- cache warm pipelines
- region replication
- mirror topics / mirror stores

Transformation Replicator #

Closest archetype:
- derived projection
- ETL/ELT pipeline
Mechanism family:
- read source events
- transform
- write derived sink
Interview relevance:
- denormalized read models
- analytics sidecars
- search documents

Compactor #

Closest archetype:
- append log + compaction
- versioned namespace + immutable content units
- projection maintenance
Mechanism family:
- merge many fragments into fewer bigger units
- materialize compacted state
Interview relevance:
- log compaction
- small-files problem
- snapshotting
- storage/read amplification tradeoffs

Readiness Marker #

Closest archetype:
- workflow / orchestration gate
- future constraint / external dependency release
Mechanism family:
- sentinel file
- completion marker
- downstream gate
Interview relevance:
- batch pipeline sequencing
- multi-step ETL
- “when is data safe to consume?”

External Trigger #

Closest archetype:
- workflow / lifecycle
- future constraint + claimable run
Mechanism family:
- webhook/event-triggered job start
- external scheduler signal
Interview relevance:
- event-driven pipelines
- orchestration kickoffs

2. Error Management Patterns #

Dead-Letter #

Closest archetype:
- append log + side-failure sink
- workflow / retry path
Mechanism family:
- failed record quarantine
- replay queue / poison record isolation
Interview relevance:
- queue consumers
- webhook delivery
- ETL bad-record handling
- stream processing fault isolation

Windowed Deduplicator #

Closest archetype:
- append-only ingestion with idempotency window
- relation/event dedup
Mechanism family:
- key + time window dedup state
- watermark-aware duplicate suppression
Interview relevance:
- event ingestion
- click/impression streams
- CDC duplicate suppression

Late Data Detector #

Closest archetype:
- stateful streaming projection
- temporal workflow correction
Mechanism family:
- watermarking
- lateness threshold
- side-output for late arrivals
Interview relevance:
- stream windows
- ranking/trending
- analytics freshness correctness

Static Late Data Integrator #

Closest archetype:
- batch repair / backfill workflow
Mechanism family:
- fixed replay windows
- periodic backfill job
Interview relevance:
- warehouse partition repair
- scheduled recomputation

Dynamic Late Data Integrator #

Closest archetype:
- repair workflow with discovered worklist
- frontier-like backfill set
Mechanism family:
- state table of impacted partitions/windows
- targeted reprocessing
Interview relevance:
- selective replay
- replay only affected partitions

Filter Interceptor #

Closest archetype:
- projection / observability side-output
Mechanism family:
- classify-and-count filtered records
- branch bad/good records with metrics
Interview relevance:
- data quality reporting
- ETL auditability

Checkpointer #

Closest archetype:
- append log + consumer progress
- claimable run progress state
Mechanism family:
- offset/state persistence
- restart from last durable progress
Interview relevance:
- Kafka consumers
- Spark/Flink jobs
- crawler/scheduler progress tracking

3. Idempotency Patterns #

Data Overwrite #

Closest archetype:
- current-value entity overwrite
- partition overwrite in warehouse systems
Mechanism family:
- replace known bounded slice atomically
Interview relevance:
- warehouse partition refresh
- current-state materialized datasets

Merger #

Closest archetype:
- current-value entity
- versioned state merge
Mechanism family:
- merge/upsert based on business key
Interview relevance:
- slowly changing dimensions
- idempotent ingest

Stateful Merger #

Closest archetype:
- workflow/projection with accumulated state
Mechanism family:
- key-state merge
- previous state consulted during write
Interview relevance:
- streaming upserts
- incremental aggregates

Keyed Idempotency #

Closest archetype:
- critical transaction process
- append ingestion dedup
Mechanism family:
- idempotency key table
- dedup on stable request/event key
Interview relevance:
- payments
- webhook receivers
- exactly-once-ish API semantics

Transactional Writer #

Closest archetype:
- critical transaction process
- workflow state + durable commit
Mechanism family:
- transactional sink write
- atomic write of data and progress
Interview relevance:
- outbox-like guarantees
- exactly-once sink semantics

Proxy #

Closest archetype:
- immutable dataset publish through mutable pointer
- versioned namespace + head pointer
Mechanism family:
- write new version elsewhere
- publish by pointer switch
Interview relevance:
- blue/green datasets
- snapshot publishing
- “don’t overwrite, publish new version”

4. Data Value Patterns #

Static Joiner / Dynamic Joiner #

Closest archetype:
- derived projection
- enrichment pipeline
Mechanism family:
- lookup join or stateful join
- enrich stream/batch records
Interview relevance:
- denormalized read models
- recommendation/search features

Wrapper / Metadata Decorator #

Closest archetype:
- derived projection
- content envelope / event envelope
Mechanism family:
- attach metadata around payload
Interview relevance:
- event schema evolution
- observability metadata
- transport envelopes

Distributed Aggregator / Local Aggregator #

Closest archetype:
- derived projection
- ranking/leaderboard/trending
Mechanism family:
- partial aggregation then combine
- local pre-aggregation to reduce hot writes
Interview relevance:
- counters
- analytics
- popularity
- stream reductions

Incremental Sessionizer / Stateful Sessionizer #

Closest archetype:
- temporal grouping projection
- stateful streaming workflow
Mechanism family:
- keyed session state
- inactivity timeout
Interview relevance:
- user activity analytics
- event stream segmentation

Bin Pack Orderer / FIFO Orderer #

Closest archetype:
- batching / ordering pipeline
- append log processing discipline
Mechanism family:
- reorder-by-batch or strict FIFO
Interview relevance:
- throughput vs ordering tradeoff
- queue/stream consumption strategy

5. Data Flow Patterns #

Local Sequencer / Isolated Sequencer #

Closest archetype:
- workflow / orchestration
- control plane sequencing
Mechanism family:
- explicit task order in same runner
- isolated stages with explicit dependencies
Interview relevance:
- DAG orchestration
- multi-step processes

Aligned Fan-In / Unaligned Fan-In #

Closest archetype:
- fan-in join / barrier coordination
Mechanism family:
- wait for all aligned inputs
- accept partial/misaligned arrival with coordination logic
Interview relevance:
- stream joins
- multi-source ETL synchronization

Parallel Split / Exclusive Choice #

Closest archetype:
- fan-out branching
- conditional workflow path selection
Mechanism family:
- branch to many
- route to one based on predicate
Interview relevance:
- routing pipelines
- rules engines
- notification delivery paths

Single Runner / Concurrent Runner #

Closest archetype:
- claimable run execution model
- workflow concurrency control
Mechanism family:
- one runner at a time
- parallel task execution
Interview relevance:
- schedulers
- Airflow/Cadence/Temporal style orchestration

6. Data Storage Patterns #

Horizontal Partitioner #

Closest archetype:
- partitioned storage
- hot-range/hot-key scale control
Mechanism family:
- split by time/key/tenant/hash
Interview relevance:
- scaling writes
- pruning reads
- tenant isolation

Vertical Partitioner #

Closest archetype:
- hot/cold split
- metadata/content separation
Mechanism family:
- split wide row/object by column or concern
Interview relevance:
- blob metadata vs blob bytes
- hot fields vs cold payload

Bucket #

Closest archetype:
- write sharding
- hot-key mitigation
Mechanism family:
- split one logical hot key across many physical keys
Interview relevance:
- hot tags
- counters
- leaderboard partials
- DDB/Spanner hot partitions

Sorter #

Closest archetype:
- storage layout optimization
- ordered projection support
Mechanism family:
- sorted write layout for scan/read efficiency
Interview relevance:
- Parquet/warehouse read pruning
- key-range scan efficiency

Metadata Enhancer #

Closest archetype:
- derived metadata projection
Mechanism family:
- attach summary/index metadata to improve read pruning
Interview relevance:
- manifests
- file statistics
- pruning and skipping

Dataset Materializer #

Closest archetype:
- derived projection
- snapshot publish
Mechanism family:
- precompute read-optimized dataset
Interview relevance:
- dashboards
- search/index docs
- warehouse marts

Manifest #

Closest archetype:
- versioned namespace + immutable content units
- publish-through-metadata pointer
Mechanism family:
- explicit list of immutable files/chunks forming a version
Interview relevance:
- lakehouse metadata
- Dropbox-like manifests
- immutable version publish

Normalizer / Denormalizer #

Closest archetype:
- storage/read model shaping
- normalized write truth vs denormalized read projection
Mechanism family:
- normalize for integrity
- denormalize for performance
Interview relevance:
- read/write model split
- OLTP vs OLAP shape

7. Data Quality Patterns #

Audit-Write-Audit-Publish #

Closest archetype:
- critical publish workflow
- validation gate before external visibility
Mechanism family:
- audit input
- write output
- audit output
- publish only if valid
Interview relevance:
- regulated pipelines
- quality gates before consumer exposure

Constraints Enforcer #

Closest archetype:
- invariant enforcement
- data contract enforcement
Mechanism family:
- validate before accept/publish
Interview relevance:
- schema/data constraints
- business rule enforcement

Schema Compatibility Enforcer / Schema Migrator #

Closest archetype:
- contract/version management
- versioned namespace / schema evolution
Mechanism family:
- compatibility checks
- coordinated migration
Interview relevance:
- event schemas
- API evolution
- CDC compatibility

Offline Observer / Online Observer #

Closest archetype:
- observability / audit sidecar
Mechanism family:
- batch or realtime quality monitors
Interview relevance:
- data quality checks
- pipeline guardrails

8. Data Observability Patterns #

Flow Interruption Detector #

Closest archetype:
- liveness detector
- pipeline monitoring
Mechanism family:
- missing heartbeat / missing data detector
Interview relevance:
- stuck pipeline alerting
- “why is data not arriving?”

Skew Detector #

Closest archetype:
- hot partition / skew observability
Mechanism family:
- detect uneven load/data distribution
Interview relevance:
- Flink/Spark/Kafka skew
- DDB/Spanner hot partition symptoms

Lag Detector #

Closest archetype:
- consumer progress monitoring
- replication freshness monitoring
Mechanism family:
- source offset/time minus consumer offset/time
Interview relevance:
- Kafka lag
- CDC lag
- replication delay

SLA Misses Detector #

Closest archetype:
- workflow deadline monitoring
- service level observability
Mechanism family:
- compare expected completion/arrival vs actual
Interview relevance:
- batch deadlines
- delayed notifications
- stale projections

Dataset Tracker / Fine-Grained Tracker #

Closest archetype:
- lineage/provenance tracking
- audit metadata graph
Mechanism family:
- dataset-level lineage
- record/file-level provenance
Interview relevance:
- debugging data issues
- governance/compliance
- blast-radius analysis

Most Useful Patterns For Interview Prep #

If you only remember a small subset, remember these:

Change Data Capture
Dead-Letter
Windowed Deduplicator
Checkpointer
Keyed Idempotency
Compactor
Bucket
Manifest
Readiness Marker
Lag Detector

These cover a lot of the recurring “hard” infra/data questions in interviews.

How This Complements Existing Cheat Sheets #

Use this note alongside:

The existing notes are archetype-first. This note is data-pattern-first.