Data Engineering Design Patterns Cheat Sheet #

Source: Data Engineering Design Patterns by Bartosz Konieczny (O’Reilly, 2025).

This note extracts the patterns most useful for system design interviews, organized by concern. Each pattern includes the core idea, the key consequence/tradeoff, and which interview archetype it maps to.

1. Data Ingestion Patterns #

Full Loader #

overwrites the entire dataset each run
simple but costly at scale
consistency risk: consumers may see partial data during the overwrite window
mitigation: use a view/proxy abstraction — write to a staging table, then atomically swap the view

Maps to: any archetype with periodic bulk refresh (search index rebuild, cache warm)

Incremental Loader #

processes only new data since last run
two implementations: delta column (filter by ingestion_time) or partition-based (process next time partition)
risk: using event_time as delta column misses late data
backfill risk: reprocessing a wide range accidentally becomes a full load

Maps to: projection workers (archetype 10), CDC pipelines

Change Data Capture (CDC) #

streams changes directly from the database commit log (WAL)
captures inserts, updates, and hard deletes
lower latency than incremental loader
consequence: converts data-at-rest into data-in-motion — downstream must handle streaming semantics (e.g., temporal joins may produce no match because the other stream hasn’t arrived yet)
typical implementation: Debezium + Kafka Connect

Maps to: any event-sourced pipeline, CQRS read-model updates, archetype 10 (projection worker)

Compactor #

merges many small files into fewer large files
solves the small files problem: too many small files degrade listing, planning, and read performance
tradeoff: compaction adds write amplification and may temporarily double storage

Maps to: append log systems (archetype 22), time-series ingest (archetype 24)

Readiness Marker #

gate that prevents processing until upstream data is complete
implementations: sentinel file (_SUCCESS marker), partition metadata check, sensor task in orchestrator
without it: pipeline processes partial data and produces incorrect results

Maps to: any batch pipeline with upstream dependencies

2. Error Management Patterns #

Dead-Letter #

catch unprocessable records, route them to a separate store instead of failing the whole pipeline
the pipeline continues processing valid records
consequence — snowball backfilling: if you replay dead-lettered records, downstream consumers may need to reprocess too
consequence — ordering break: replayed records arrive out of order
the dead-letter store needs its own monitoring (count, rate, freshness)

Maps to: archetypes 10 (projection worker), 15 (fan-out delivery), 22 (append log + consumer)

Windowed Deduplicator #

eliminates duplicate records within a bounded window
batch: use DISTINCT or ROW_NUMBER() OVER (PARTITION BY …)
streaming: maintain a state store of seen keys for the window duration; evict after watermark advances
tradeoff — space vs time: longer dedup window catches more duplicates but requires more state
deduplication does NOT guarantee exactly-once delivery — retries after the dedup step can still produce duplicates downstream

Maps to: any at-least-once pipeline, archetypes 10, 15, 22

Late Data Detector (Watermark) #

tracks the maximum event time seen so far, subtracts an allowed lateness to produce the watermark
any event older than the watermark is classified as late
watermark must be monotonically increasing (use MAX, not MIN, at partition level)
stuck-in-the-past risk: if MIN is used, late data can pull the watermark backward, reopening already-emitted windows and growing state unboundedly
MAX strategy risk: in skewed environments, fast partitions advance the watermark and cause slow partitions to be classified as late

Maps to: any windowed streaming system, archetypes 22, 24

Late Data Integrator (Static and Dynamic) #

static: reprocess a fixed historical window to absorb late records (simple but wasteful)
dynamic: track which versions have been processed; only reprocess partitions that received late data
dynamic variant requires a version state table and adds complexity

Maps to: projection workers that must be complete (archetype 10)

Checkpointer #

persists processing position (offsets) and computed state to durable storage
two implementations: framework-managed (Flink barriers, Spark checkpoint location) or SDK-managed (Kafka consumer commit)
tradeoff — delivery guarantee vs latency: frequent checkpoints = slower but less data to reprocess on failure; infrequent = faster but more reprocessing
exactly-once is an illusion: parallel tasks can fail between checkpoint intervals, causing replay of already-successful records
checkpoint alone gives at-least-once; combine with idempotency pattern for exactly-once feeling

Maps to: archetype 22 (append log + consumer progress) — this IS the offset discipline

Filter Interceptor #

route records to different processing paths based on classification rules
useful for quality tiers: valid records proceed, suspect records get extra validation, invalid records go to dead-letter

Maps to: any pipeline with data quality gates

3. Idempotency Patterns #

Fast Metadata Cleaner #

use metadata operations (DROP PARTITION, TRUNCATE) instead of row-level DELETE for cleanup before rewrite
partition the dataset so each run’s output is isolated — then overwrite the partition atomically
much faster than scanning and deleting individual rows

Maps to: any batch write that must be rerunnable (archetypes 10, 15)

Data Overwrite #

full physical overwrite of the output dataset each run
simple idempotency but expensive for large datasets
consumers see no data during the overwrite window unless you use a proxy/view abstraction

Maps to: full-refresh projections, reference data loads

Merger (Upsert) #

MERGE / INSERT … ON CONFLICT UPDATE for incremental idempotent writes
handles both new and updated records in one operation
consequence: MERGE is more expensive than INSERT because it must check for existing keys

Maps to: any incremental write path, CQRS read-model updates

Keyed Idempotency #

generate a deterministic key from immutable attributes so the same input always produces the same key
key-value stores naturally deduplicate on key (UPSERT semantics)
critical: use immutable attributes (append_time, ingestion_time) not mutable ones (event_time) — late data can change event_time and break key stability
Kafka append-only log does not deduplicate at write time; relies on async compaction, so consumers may see temporary duplicates

Maps to: session aggregation, any keyed state store, archetype 10

Transactional Writer #

wrap writes in a transaction so consumers only see complete results (commit) or nothing (rollback)
distributed variant: either per-task local transactions (weaker — job retry replays committed tasks) or whole-job transactions (stronger — needs coordinator)
all-or-nothing semantics prevent partial data exposure
idempotency is scoped to the current transaction — backfill creates a new transaction and may reinsert

Maps to: any write path where partial visibility is dangerous, archetype 10 (projection worker)

Proxy #

expose data through an abstraction (view, alias) that can be atomically swapped
write to a new version of the underlying table, then swap the proxy to point to the new version
consumers never see partial or stale data during the write
enables immutable dataset design: each run creates a new version, old versions preserved for rollback

Maps to: any system that serves reads during writes (search index swap, config rollout)

4. Data Value Patterns #

Distributed Aggregator (Shuffle + Reduce) #

bring related records together across machines via network exchange (shuffle), then aggregate
shuffle is the dominant latency and cost driver in distributed processing
data skew: if one key has disproportionately more records, its reducer becomes the bottleneck
mitigation: salting — add a random salt to the key, aggregate with salt, then aggregate again without salt (two-phase aggregation)

Maps to: any fan-in aggregation, archetypes 6 (analytics read model), 24 (metrics pipeline)

Local Aggregator #

if data is already co-partitioned (e.g., Kafka partitions align with group-by keys), aggregate locally without shuffle
eliminates network exchange entirely
only works when input partitioning matches the aggregation key

Maps to: stream processing optimization, archetype 22 where partition key = aggregation key

Sessionization (Incremental and Stateful) #

incremental sessionizer: batch pattern — combine current input with pending sessions from previous run, emit completed sessions, carry forward open sessions
stateful sessionizer: streaming pattern — accumulate events per session key in state store, emit on inactivity timeout
three phases: initialization (session start), accumulation (new events), finalization (timeout/close event)
state management is the hard part: state grows with active sessions and must be checkpointed

Maps to: archetypes 6 (analytics), 10 (projection worker), user session tracking

Data Ordering (Bin Pack and FIFO) #

bin pack orderer: organize output files by size for efficient downstream reading (avoid small files)
FIFO orderer: preserve input ordering through the pipeline
ordering adds cost — only enforce when downstream correctness requires it

Maps to: any pipeline where ordering matters (archetype 22 per-partition ordering)

5. Data Flow Patterns #

Local Sequencer #

decompose a monolithic pipeline into sequential tasks within the same workflow
benefit: individual task retry without re-running the entire pipeline
rule of thumb: put boundaries at restart points and between expensive operations

Maps to: any multi-step pipeline design

Isolated Sequencer #

cross-pipeline dependencies: pipeline B triggers only after pipeline A completes
implementations: shared data readiness markers, event-driven triggers, orchestrator cross-DAG sensors
enables team-level ownership boundaries

Maps to: any system with multiple teams producing/consuming data

Fan-In (Aligned and Unaligned) #

aligned: all upstream dependencies must complete before the downstream task runs
unaligned: downstream task runs as soon as any upstream completes
aligned guarantees completeness; unaligned reduces latency but may process partial inputs

Maps to: any aggregation that depends on multiple sources

Fan-Out (Parallel Split and Exclusive Choice) #

parallel split: one input feeds multiple independent downstream tasks
exclusive choice: route input to exactly one downstream based on a condition
parallel split is the write fan-out pattern; exclusive choice is content-based routing

Maps to: archetype 15 (fan-out delivery worker), event-driven architectures

6. Data Storage Patterns #

Horizontal Partitioner #

divide data by a low-cardinality attribute (date, region) into physically isolated partitions
enables partition pruning: queries touching one date only scan that partition
risk: high-cardinality partition key creates too many small partitions (metadata overhead, small files)
risk: skewed partitions — one partition much larger than others becomes the bottleneck
mutability: changing the partition key requires rewriting all historical data (Apache Iceberg handles this at metadata level only)
horizontal partitioning ≠ sharding: partitioning is logical division; sharding is physical division across machines

Maps to: any system that stores time-series or event data

Bucket #

hash-based grouping of records into a fixed number of buckets within a partition
useful for high-cardinality keys where partitioning would create too many directories
enables sort-merge joins without shuffle when both tables use the same bucketing scheme

Maps to: archetype 24 (metrics pipeline — high-cardinality time-series keys)

Dataset Materializer #

pre-compute and persist expensive query results for fast repeated access
tradeoff: storage cost and freshness lag vs query performance
materialized datasets may become stale — need a refresh strategy

Maps to: archetypes 6 (analytics read model), CQRS projections

Normalizer vs Denormalizer #

normalizer: eliminate redundancy, enforce consistency, minimize storage — but queries require joins
denormalizer: pre-join and duplicate data for fast reads — but updates must propagate to all copies
write-heavy systems favor normalization; read-heavy systems favor denormalization

Maps to: the fundamental read-path vs write-path tradeoff in every archetype

7. Data Quality Patterns #

Audit-Write-Audit-Publish (AWAP) #

four-phase pipeline: audit input → write output → audit output → publish to consumers
pre-write audit catches bad input before expensive processing
post-write audit catches processing bugs before consumers see the data
publish is the atomic visibility step (view swap, partition promotion)

Maps to: any pipeline where data quality directly affects users

Schema Compatibility Enforcer #

validate that upstream schema changes are backward-compatible before allowing ingestion
prevents breaking downstream consumers with unexpected column drops or type changes
typical implementations: schema registries (Confluent, AWS Glue)

Maps to: any event-driven system with schema evolution (archetypes 22, 10)

8. Data Observability Patterns #

Flow Interruption Detector #

detect when a data source stops producing data
continuous delivery: alert if no records arrive within expected interval
irregular delivery: alert if the gap exceeds the configured no-data window
implementations: metadata layer (last modification time), data layer (row count comparison), storage layer (file modification time)

Maps to: any pipeline with SLA on data freshness

Skew Detector #

compare current data volume against previous window — alert if difference exceeds threshold
catch incomplete datasets before processing
risk: seasonality creates false positives
risk: fatality loop — if today’s skewed data becomes tomorrow’s comparison baseline, valid data triggers a false alert
mitigation: compare against most recent successful run, not just the previous run

Maps to: any batch pipeline, archetype 24 (metrics ingest volume monitoring)

Lag Detector #

measure how far behind a consumer is from its producer
increasing lag = consumer can’t keep up = upcoming freshness/availability problem
critical metric for streaming systems

Maps to: archetype 22 (consumer lag is the primary operational metric)

SLA Misses Detector #

track end-to-end pipeline completion time against committed SLA
alert before the SLA is breached, not after
useful for pipelines with downstream business commitments

Maps to: any pipeline with contractual freshness guarantees

Data Lineage (Dataset Tracker and Fine-Grained Tracker) #

dataset tracker: record which pipeline produced which dataset at which time
fine-grained tracker: record column-level lineage (which input columns feed which output columns)
essential for debugging data quality issues and understanding blast radius of upstream changes

Maps to: any system where you need to answer “where did this data come from?”

Cross-Cutting Concepts #

Delivery Semantics #

at-most-once: checkpoint before processing — lose data on failure
at-least-once: checkpoint after processing — duplicate data on failure
exactly-once: illusion built from at-least-once + idempotent writes

The book emphasizes: exactly-once is always a combination of checkpointing + idempotency, never checkpointing alone.

The Watermark #

watermark = MAX(event_time) − allowed_lateness
monotonically increasing — never moves backward
controls three things simultaneously: late data detection, state eviction, window firing
too aggressive (small lateness) = drops too many valid events
too lenient (large lateness) = state grows, windows fire late, freshness suffers

Data Skew #

unbalanced distribution of records across partitions or keys
the skewed partition/key determines overall job duration (stragglers)
mitigations: salting (two-phase aggregation), adaptive query execution, backpressure buffers
skew is both a processing concern (hot task) and a storage concern (hot partition)

Small Files Problem #

too many small files = slow listing, excessive metadata, poor read throughput
caused by: high-frequency ingestion, over-partitioning, streaming micro-batches
mitigation: compaction (merge small files into large files), bin-packing, coalesce before write

Pattern-to-Archetype Quick Reference #

Pattern	Most Relevant Archetypes
Full Loader	search index rebuild, cache warm, reference data
Incremental Loader	projection workers (10), CDC pipelines
CDC	event-sourced pipelines, CQRS, archetype 10
Compactor	append log (22), time-series (24)
Dead-Letter	projection (10), fan-out (15), append log (22)
Windowed Deduplicator	any at-least-once pipeline (10, 15, 22)
Watermark / Late Data	windowed streaming (22, 24)
Checkpointer	append log + consumer progress (22)
Keyed Idempotency	session aggregation, keyed state stores (10)
Transactional Writer	any write path with partial-visibility risk (10)
Distributed Aggregator	analytics (6), metrics pipeline (24)
Sessionization	analytics (6), projection (10)
Horizontal Partitioner	any time-series or event store
Bucket	high-cardinality keys, metrics (24)
Skew Detector	batch pipelines, metrics ingest (24)
Lag Detector	append log consumer monitoring (22)
AWAP	any pipeline with data quality SLA

Interview One-Liner #

For data pipeline design I think in terms of ingestion shape (full, incremental, CDC), error discipline (dead-letter, dedup, late data), idempotency strategy (overwrite, keyed, transactional), and observability (lag, skew, flow interruption). Each choice is a tradeoff between latency, completeness, and complexity.