Data Engineering Patterns to Archetypes Mapping #
This note maps recurring data engineering patterns to the system design archetypes already used in the cheat sheets. It is not a replacement for the existing notes. It is a bridge:
- from data pipeline pattern language
- to product/infra archetype language
- to interview-useful mechanism families
The source inspiration here is data_engineering_desing_patterns.txt, especially its patterns around ingestion, error handling, idempotency, flow orchestration, storage layout, data quality, and observability.
How To Read This Note #
For each data engineering pattern, ask:
- what archetype family is this closest to?
- what mechanism family does it imply?
- where does it show up in interviews?
Use this note when an interview problem drifts into:
- CDC
- dedup
- checkpointing
- late data
- compaction
- orchestration
- stream processing
- data quality
- observability
1. Ingestion Patterns #
Full Loader #
- Closest archetype:
- batch snapshot ingestion
- append log + projection refresh
- Mechanism family:
- periodic scan
- batch overwrite or batch publish
- Interview relevance:
- initial bootstrap
- periodic backfill
- full rebuild of search/index/projection
Incremental Loader #
- Closest archetype:
- append-only child object
- projection maintenance
- Mechanism family:
- watermark or high-water-mark checkpoint
- incremental fetch by time/key
- Interview relevance:
- sync APIs
- changes feed
- delta pipelines
Change Data Capture #
- Closest archetype:
- append log + consumer progress
- projection / replication pipeline
- Mechanism family:
- WAL/binlog/stream reader
- downstream event consumers
- Interview relevance:
- search indexing
- webhooks
- replication
- materialized projections
Passthrough Replicator #
- Closest archetype:
- append log replication
- control plane + local snapshot
- Mechanism family:
- source stream to replica sink
- minimal transformation
- Interview relevance:
- cache warm pipelines
- region replication
- mirror topics / mirror stores
Transformation Replicator #
- Closest archetype:
- derived projection
- ETL/ELT pipeline
- Mechanism family:
- read source events
- transform
- write derived sink
- Interview relevance:
- denormalized read models
- analytics sidecars
- search documents
Compactor #
- Closest archetype:
- append log + compaction
- versioned namespace + immutable content units
- projection maintenance
- Mechanism family:
- merge many fragments into fewer bigger units
- materialize compacted state
- Interview relevance:
- log compaction
- small-files problem
- snapshotting
- storage/read amplification tradeoffs
Readiness Marker #
- Closest archetype:
- workflow / orchestration gate
- future constraint / external dependency release
- Mechanism family:
- sentinel file
- completion marker
- downstream gate
- Interview relevance:
- batch pipeline sequencing
- multi-step ETL
- “when is data safe to consume?”
External Trigger #
- Closest archetype:
- workflow / lifecycle
- future constraint + claimable run
- Mechanism family:
- webhook/event-triggered job start
- external scheduler signal
- Interview relevance:
- event-driven pipelines
- orchestration kickoffs
2. Error Management Patterns #
Dead-Letter #
- Closest archetype:
- append log + side-failure sink
- workflow / retry path
- Mechanism family:
- failed record quarantine
- replay queue / poison record isolation
- Interview relevance:
- queue consumers
- webhook delivery
- ETL bad-record handling
- stream processing fault isolation
Windowed Deduplicator #
- Closest archetype:
- append-only ingestion with idempotency window
- relation/event dedup
- Mechanism family:
- key + time window dedup state
- watermark-aware duplicate suppression
- Interview relevance:
- event ingestion
- click/impression streams
- CDC duplicate suppression
Late Data Detector #
- Closest archetype:
- stateful streaming projection
- temporal workflow correction
- Mechanism family:
- watermarking
- lateness threshold
- side-output for late arrivals
- Interview relevance:
- stream windows
- ranking/trending
- analytics freshness correctness
Static Late Data Integrator #
- Closest archetype:
- batch repair / backfill workflow
- Mechanism family:
- fixed replay windows
- periodic backfill job
- Interview relevance:
- warehouse partition repair
- scheduled recomputation
Dynamic Late Data Integrator #
- Closest archetype:
- repair workflow with discovered worklist
- frontier-like backfill set
- Mechanism family:
- state table of impacted partitions/windows
- targeted reprocessing
- Interview relevance:
- selective replay
- replay only affected partitions
Filter Interceptor #
- Closest archetype:
- projection / observability side-output
- Mechanism family:
- classify-and-count filtered records
- branch bad/good records with metrics
- Interview relevance:
- data quality reporting
- ETL auditability
Checkpointer #
- Closest archetype:
- append log + consumer progress
- claimable run progress state
- Mechanism family:
- offset/state persistence
- restart from last durable progress
- Interview relevance:
- Kafka consumers
- Spark/Flink jobs
- crawler/scheduler progress tracking
3. Idempotency Patterns #
Data Overwrite #
- Closest archetype:
- current-value entity overwrite
- partition overwrite in warehouse systems
- Mechanism family:
- replace known bounded slice atomically
- Interview relevance:
- warehouse partition refresh
- current-state materialized datasets
Merger #
- Closest archetype:
- current-value entity
- versioned state merge
- Mechanism family:
- merge/upsert based on business key
- Interview relevance:
- slowly changing dimensions
- idempotent ingest
Stateful Merger #
- Closest archetype:
- workflow/projection with accumulated state
- Mechanism family:
- key-state merge
- previous state consulted during write
- Interview relevance:
- streaming upserts
- incremental aggregates
Keyed Idempotency #
- Closest archetype:
- critical transaction process
- append ingestion dedup
- Mechanism family:
- idempotency key table
- dedup on stable request/event key
- Interview relevance:
- payments
- webhook receivers
- exactly-once-ish API semantics
Transactional Writer #
- Closest archetype:
- critical transaction process
- workflow state + durable commit
- Mechanism family:
- transactional sink write
- atomic write of data and progress
- Interview relevance:
- outbox-like guarantees
- exactly-once sink semantics
Proxy #
- Closest archetype:
- immutable dataset publish through mutable pointer
- versioned namespace + head pointer
- Mechanism family:
- write new version elsewhere
- publish by pointer switch
- Interview relevance:
- blue/green datasets
- snapshot publishing
- “don’t overwrite, publish new version”
4. Data Value Patterns #
Static Joiner / Dynamic Joiner #
- Closest archetype:
- derived projection
- enrichment pipeline
- Mechanism family:
- lookup join or stateful join
- enrich stream/batch records
- Interview relevance:
- denormalized read models
- recommendation/search features
Wrapper / Metadata Decorator #
- Closest archetype:
- derived projection
- content envelope / event envelope
- Mechanism family:
- attach metadata around payload
- Interview relevance:
- event schema evolution
- observability metadata
- transport envelopes
Distributed Aggregator / Local Aggregator #
- Closest archetype:
- derived projection
- ranking/leaderboard/trending
- Mechanism family:
- partial aggregation then combine
- local pre-aggregation to reduce hot writes
- Interview relevance:
- counters
- analytics
- popularity
- stream reductions
Incremental Sessionizer / Stateful Sessionizer #
- Closest archetype:
- temporal grouping projection
- stateful streaming workflow
- Mechanism family:
- keyed session state
- inactivity timeout
- Interview relevance:
- user activity analytics
- event stream segmentation
Bin Pack Orderer / FIFO Orderer #
- Closest archetype:
- batching / ordering pipeline
- append log processing discipline
- Mechanism family:
- reorder-by-batch or strict FIFO
- Interview relevance:
- throughput vs ordering tradeoff
- queue/stream consumption strategy
5. Data Flow Patterns #
Local Sequencer / Isolated Sequencer #
- Closest archetype:
- workflow / orchestration
- control plane sequencing
- Mechanism family:
- explicit task order in same runner
- isolated stages with explicit dependencies
- Interview relevance:
- DAG orchestration
- multi-step processes
Aligned Fan-In / Unaligned Fan-In #
- Closest archetype:
- fan-in join / barrier coordination
- Mechanism family:
- wait for all aligned inputs
- accept partial/misaligned arrival with coordination logic
- Interview relevance:
- stream joins
- multi-source ETL synchronization
Parallel Split / Exclusive Choice #
- Closest archetype:
- fan-out branching
- conditional workflow path selection
- Mechanism family:
- branch to many
- route to one based on predicate
- Interview relevance:
- routing pipelines
- rules engines
- notification delivery paths
Single Runner / Concurrent Runner #
- Closest archetype:
- claimable run execution model
- workflow concurrency control
- Mechanism family:
- one runner at a time
- parallel task execution
- Interview relevance:
- schedulers
- Airflow/Cadence/Temporal style orchestration
6. Data Storage Patterns #
Horizontal Partitioner #
- Closest archetype:
- partitioned storage
- hot-range/hot-key scale control
- Mechanism family:
- split by time/key/tenant/hash
- Interview relevance:
- scaling writes
- pruning reads
- tenant isolation
Vertical Partitioner #
- Closest archetype:
- hot/cold split
- metadata/content separation
- Mechanism family:
- split wide row/object by column or concern
- Interview relevance:
- blob metadata vs blob bytes
- hot fields vs cold payload
Bucket #
- Closest archetype:
- write sharding
- hot-key mitigation
- Mechanism family:
- split one logical hot key across many physical keys
- Interview relevance:
- hot tags
- counters
- leaderboard partials
- DDB/Spanner hot partitions
Sorter #
- Closest archetype:
- storage layout optimization
- ordered projection support
- Mechanism family:
- sorted write layout for scan/read efficiency
- Interview relevance:
- Parquet/warehouse read pruning
- key-range scan efficiency
Metadata Enhancer #
- Closest archetype:
- derived metadata projection
- Mechanism family:
- attach summary/index metadata to improve read pruning
- Interview relevance:
- manifests
- file statistics
- pruning and skipping
Dataset Materializer #
- Closest archetype:
- derived projection
- snapshot publish
- Mechanism family:
- precompute read-optimized dataset
- Interview relevance:
- dashboards
- search/index docs
- warehouse marts
Manifest #
- Closest archetype:
- versioned namespace + immutable content units
- publish-through-metadata pointer
- Mechanism family:
- explicit list of immutable files/chunks forming a version
- Interview relevance:
- lakehouse metadata
- Dropbox-like manifests
- immutable version publish
Normalizer / Denormalizer #
- Closest archetype:
- storage/read model shaping
- normalized write truth vs denormalized read projection
- Mechanism family:
- normalize for integrity
- denormalize for performance
- Interview relevance:
- read/write model split
- OLTP vs OLAP shape
7. Data Quality Patterns #
Audit-Write-Audit-Publish #
- Closest archetype:
- critical publish workflow
- validation gate before external visibility
- Mechanism family:
- audit input
- write output
- audit output
- publish only if valid
- Interview relevance:
- regulated pipelines
- quality gates before consumer exposure
Constraints Enforcer #
- Closest archetype:
- invariant enforcement
- data contract enforcement
- Mechanism family:
- validate before accept/publish
- Interview relevance:
- schema/data constraints
- business rule enforcement
Schema Compatibility Enforcer / Schema Migrator #
- Closest archetype:
- contract/version management
- versioned namespace / schema evolution
- Mechanism family:
- compatibility checks
- coordinated migration
- Interview relevance:
- event schemas
- API evolution
- CDC compatibility
Offline Observer / Online Observer #
- Closest archetype:
- observability / audit sidecar
- Mechanism family:
- batch or realtime quality monitors
- Interview relevance:
- data quality checks
- pipeline guardrails
8. Data Observability Patterns #
Flow Interruption Detector #
- Closest archetype:
- liveness detector
- pipeline monitoring
- Mechanism family:
- missing heartbeat / missing data detector
- Interview relevance:
- stuck pipeline alerting
- “why is data not arriving?”
Skew Detector #
- Closest archetype:
- hot partition / skew observability
- Mechanism family:
- detect uneven load/data distribution
- Interview relevance:
- Flink/Spark/Kafka skew
- DDB/Spanner hot partition symptoms
Lag Detector #
- Closest archetype:
- consumer progress monitoring
- replication freshness monitoring
- Mechanism family:
- source offset/time minus consumer offset/time
- Interview relevance:
- Kafka lag
- CDC lag
- replication delay
SLA Misses Detector #
- Closest archetype:
- workflow deadline monitoring
- service level observability
- Mechanism family:
- compare expected completion/arrival vs actual
- Interview relevance:
- batch deadlines
- delayed notifications
- stale projections
Dataset Tracker / Fine-Grained Tracker #
- Closest archetype:
- lineage/provenance tracking
- audit metadata graph
- Mechanism family:
- dataset-level lineage
- record/file-level provenance
- Interview relevance:
- debugging data issues
- governance/compliance
- blast-radius analysis
Most Useful Patterns For Interview Prep #
If you only remember a small subset, remember these:
Change Data CaptureDead-LetterWindowed DeduplicatorCheckpointerKeyed IdempotencyCompactorBucketManifestReadiness MarkerLag Detector
These cover a lot of the recurring “hard” infra/data questions in interviews.
How This Complements Existing Cheat Sheets #
Use this note alongside:
- archetype-state-actors-failure-scale-reference.md
- infra-actions-entities-mechanisms-by-archetype-cheat-sheet.md
- role-based-failure-and-mitigation-phrase-sheet.md
- role-based-scaling-bottleneck-and-mitigation-phrase-sheet.md
The existing notes are archetype-first. This note is data-pattern-first.