Skip to main content
  1. System Design Components/

Data Engineering Design Patterns Cheat Sheet

Data Engineering Design Patterns Cheat Sheet #

Source: Data Engineering Design Patterns by Bartosz Konieczny (O’Reilly, 2025).

This note extracts the patterns most useful for system design interviews, organized by concern. Each pattern includes the core idea, the key consequence/tradeoff, and which interview archetype it maps to.


1. Data Ingestion Patterns #

Full Loader #

  1. overwrites the entire dataset each run
  2. simple but costly at scale
  3. consistency risk: consumers may see partial data during the overwrite window
  4. mitigation: use a view/proxy abstraction — write to a staging table, then atomically swap the view

Maps to: any archetype with periodic bulk refresh (search index rebuild, cache warm)

Incremental Loader #

  1. processes only new data since last run
  2. two implementations: delta column (filter by ingestion_time) or partition-based (process next time partition)
  3. risk: using event_time as delta column misses late data
  4. backfill risk: reprocessing a wide range accidentally becomes a full load

Maps to: projection workers (archetype 10), CDC pipelines

Change Data Capture (CDC) #

  1. streams changes directly from the database commit log (WAL)
  2. captures inserts, updates, and hard deletes
  3. lower latency than incremental loader
  4. consequence: converts data-at-rest into data-in-motion — downstream must handle streaming semantics (e.g., temporal joins may produce no match because the other stream hasn’t arrived yet)
  5. typical implementation: Debezium + Kafka Connect

Maps to: any event-sourced pipeline, CQRS read-model updates, archetype 10 (projection worker)

Compactor #

  1. merges many small files into fewer large files
  2. solves the small files problem: too many small files degrade listing, planning, and read performance
  3. tradeoff: compaction adds write amplification and may temporarily double storage

Maps to: append log systems (archetype 22), time-series ingest (archetype 24)

Readiness Marker #

  1. gate that prevents processing until upstream data is complete
  2. implementations: sentinel file (_SUCCESS marker), partition metadata check, sensor task in orchestrator
  3. without it: pipeline processes partial data and produces incorrect results

Maps to: any batch pipeline with upstream dependencies


2. Error Management Patterns #

Dead-Letter #

  1. catch unprocessable records, route them to a separate store instead of failing the whole pipeline
  2. the pipeline continues processing valid records
  3. consequence — snowball backfilling: if you replay dead-lettered records, downstream consumers may need to reprocess too
  4. consequence — ordering break: replayed records arrive out of order
  5. the dead-letter store needs its own monitoring (count, rate, freshness)

Maps to: archetypes 10 (projection worker), 15 (fan-out delivery), 22 (append log + consumer)

Windowed Deduplicator #

  1. eliminates duplicate records within a bounded window
  2. batch: use DISTINCT or ROW_NUMBER() OVER (PARTITION BY …)
  3. streaming: maintain a state store of seen keys for the window duration; evict after watermark advances
  4. tradeoff — space vs time: longer dedup window catches more duplicates but requires more state
  5. deduplication does NOT guarantee exactly-once delivery — retries after the dedup step can still produce duplicates downstream

Maps to: any at-least-once pipeline, archetypes 10, 15, 22

Late Data Detector (Watermark) #

  1. tracks the maximum event time seen so far, subtracts an allowed lateness to produce the watermark
  2. any event older than the watermark is classified as late
  3. watermark must be monotonically increasing (use MAX, not MIN, at partition level)
  4. stuck-in-the-past risk: if MIN is used, late data can pull the watermark backward, reopening already-emitted windows and growing state unboundedly
  5. MAX strategy risk: in skewed environments, fast partitions advance the watermark and cause slow partitions to be classified as late

Maps to: any windowed streaming system, archetypes 22, 24

Late Data Integrator (Static and Dynamic) #

  1. static: reprocess a fixed historical window to absorb late records (simple but wasteful)
  2. dynamic: track which versions have been processed; only reprocess partitions that received late data
  3. dynamic variant requires a version state table and adds complexity

Maps to: projection workers that must be complete (archetype 10)

Checkpointer #

  1. persists processing position (offsets) and computed state to durable storage
  2. two implementations: framework-managed (Flink barriers, Spark checkpoint location) or SDK-managed (Kafka consumer commit)
  3. tradeoff — delivery guarantee vs latency: frequent checkpoints = slower but less data to reprocess on failure; infrequent = faster but more reprocessing
  4. exactly-once is an illusion: parallel tasks can fail between checkpoint intervals, causing replay of already-successful records
  5. checkpoint alone gives at-least-once; combine with idempotency pattern for exactly-once feeling

Maps to: archetype 22 (append log + consumer progress) — this IS the offset discipline

Filter Interceptor #

  1. route records to different processing paths based on classification rules
  2. useful for quality tiers: valid records proceed, suspect records get extra validation, invalid records go to dead-letter

Maps to: any pipeline with data quality gates


3. Idempotency Patterns #

Fast Metadata Cleaner #

  1. use metadata operations (DROP PARTITION, TRUNCATE) instead of row-level DELETE for cleanup before rewrite
  2. partition the dataset so each run’s output is isolated — then overwrite the partition atomically
  3. much faster than scanning and deleting individual rows

Maps to: any batch write that must be rerunnable (archetypes 10, 15)

Data Overwrite #

  1. full physical overwrite of the output dataset each run
  2. simple idempotency but expensive for large datasets
  3. consumers see no data during the overwrite window unless you use a proxy/view abstraction

Maps to: full-refresh projections, reference data loads

Merger (Upsert) #

  1. MERGE / INSERT … ON CONFLICT UPDATE for incremental idempotent writes
  2. handles both new and updated records in one operation
  3. consequence: MERGE is more expensive than INSERT because it must check for existing keys

Maps to: any incremental write path, CQRS read-model updates

Keyed Idempotency #

  1. generate a deterministic key from immutable attributes so the same input always produces the same key
  2. key-value stores naturally deduplicate on key (UPSERT semantics)
  3. critical: use immutable attributes (append_time, ingestion_time) not mutable ones (event_time) — late data can change event_time and break key stability
  4. Kafka append-only log does not deduplicate at write time; relies on async compaction, so consumers may see temporary duplicates

Maps to: session aggregation, any keyed state store, archetype 10

Transactional Writer #

  1. wrap writes in a transaction so consumers only see complete results (commit) or nothing (rollback)
  2. distributed variant: either per-task local transactions (weaker — job retry replays committed tasks) or whole-job transactions (stronger — needs coordinator)
  3. all-or-nothing semantics prevent partial data exposure
  4. idempotency is scoped to the current transaction — backfill creates a new transaction and may reinsert

Maps to: any write path where partial visibility is dangerous, archetype 10 (projection worker)

Proxy #

  1. expose data through an abstraction (view, alias) that can be atomically swapped
  2. write to a new version of the underlying table, then swap the proxy to point to the new version
  3. consumers never see partial or stale data during the write
  4. enables immutable dataset design: each run creates a new version, old versions preserved for rollback

Maps to: any system that serves reads during writes (search index swap, config rollout)


4. Data Value Patterns #

Distributed Aggregator (Shuffle + Reduce) #

  1. bring related records together across machines via network exchange (shuffle), then aggregate
  2. shuffle is the dominant latency and cost driver in distributed processing
  3. data skew: if one key has disproportionately more records, its reducer becomes the bottleneck
  4. mitigation: salting — add a random salt to the key, aggregate with salt, then aggregate again without salt (two-phase aggregation)

Maps to: any fan-in aggregation, archetypes 6 (analytics read model), 24 (metrics pipeline)

Local Aggregator #

  1. if data is already co-partitioned (e.g., Kafka partitions align with group-by keys), aggregate locally without shuffle
  2. eliminates network exchange entirely
  3. only works when input partitioning matches the aggregation key

Maps to: stream processing optimization, archetype 22 where partition key = aggregation key

Sessionization (Incremental and Stateful) #

  1. incremental sessionizer: batch pattern — combine current input with pending sessions from previous run, emit completed sessions, carry forward open sessions
  2. stateful sessionizer: streaming pattern — accumulate events per session key in state store, emit on inactivity timeout
  3. three phases: initialization (session start), accumulation (new events), finalization (timeout/close event)
  4. state management is the hard part: state grows with active sessions and must be checkpointed

Maps to: archetypes 6 (analytics), 10 (projection worker), user session tracking

Data Ordering (Bin Pack and FIFO) #

  1. bin pack orderer: organize output files by size for efficient downstream reading (avoid small files)
  2. FIFO orderer: preserve input ordering through the pipeline
  3. ordering adds cost — only enforce when downstream correctness requires it

Maps to: any pipeline where ordering matters (archetype 22 per-partition ordering)


5. Data Flow Patterns #

Local Sequencer #

  1. decompose a monolithic pipeline into sequential tasks within the same workflow
  2. benefit: individual task retry without re-running the entire pipeline
  3. rule of thumb: put boundaries at restart points and between expensive operations

Maps to: any multi-step pipeline design

Isolated Sequencer #

  1. cross-pipeline dependencies: pipeline B triggers only after pipeline A completes
  2. implementations: shared data readiness markers, event-driven triggers, orchestrator cross-DAG sensors
  3. enables team-level ownership boundaries

Maps to: any system with multiple teams producing/consuming data

Fan-In (Aligned and Unaligned) #

  1. aligned: all upstream dependencies must complete before the downstream task runs
  2. unaligned: downstream task runs as soon as any upstream completes
  3. aligned guarantees completeness; unaligned reduces latency but may process partial inputs

Maps to: any aggregation that depends on multiple sources

Fan-Out (Parallel Split and Exclusive Choice) #

  1. parallel split: one input feeds multiple independent downstream tasks
  2. exclusive choice: route input to exactly one downstream based on a condition
  3. parallel split is the write fan-out pattern; exclusive choice is content-based routing

Maps to: archetype 15 (fan-out delivery worker), event-driven architectures


6. Data Storage Patterns #

Horizontal Partitioner #

  1. divide data by a low-cardinality attribute (date, region) into physically isolated partitions
  2. enables partition pruning: queries touching one date only scan that partition
  3. risk: high-cardinality partition key creates too many small partitions (metadata overhead, small files)
  4. risk: skewed partitions — one partition much larger than others becomes the bottleneck
  5. mutability: changing the partition key requires rewriting all historical data (Apache Iceberg handles this at metadata level only)
  6. horizontal partitioning ≠ sharding: partitioning is logical division; sharding is physical division across machines

Maps to: any system that stores time-series or event data

Bucket #

  1. hash-based grouping of records into a fixed number of buckets within a partition
  2. useful for high-cardinality keys where partitioning would create too many directories
  3. enables sort-merge joins without shuffle when both tables use the same bucketing scheme

Maps to: archetype 24 (metrics pipeline — high-cardinality time-series keys)

Dataset Materializer #

  1. pre-compute and persist expensive query results for fast repeated access
  2. tradeoff: storage cost and freshness lag vs query performance
  3. materialized datasets may become stale — need a refresh strategy

Maps to: archetypes 6 (analytics read model), CQRS projections

Normalizer vs Denormalizer #

  1. normalizer: eliminate redundancy, enforce consistency, minimize storage — but queries require joins
  2. denormalizer: pre-join and duplicate data for fast reads — but updates must propagate to all copies
  3. write-heavy systems favor normalization; read-heavy systems favor denormalization

Maps to: the fundamental read-path vs write-path tradeoff in every archetype


7. Data Quality Patterns #

Audit-Write-Audit-Publish (AWAP) #

  1. four-phase pipeline: audit input → write output → audit output → publish to consumers
  2. pre-write audit catches bad input before expensive processing
  3. post-write audit catches processing bugs before consumers see the data
  4. publish is the atomic visibility step (view swap, partition promotion)

Maps to: any pipeline where data quality directly affects users

Schema Compatibility Enforcer #

  1. validate that upstream schema changes are backward-compatible before allowing ingestion
  2. prevents breaking downstream consumers with unexpected column drops or type changes
  3. typical implementations: schema registries (Confluent, AWS Glue)

Maps to: any event-driven system with schema evolution (archetypes 22, 10)


8. Data Observability Patterns #

Flow Interruption Detector #

  1. detect when a data source stops producing data
  2. continuous delivery: alert if no records arrive within expected interval
  3. irregular delivery: alert if the gap exceeds the configured no-data window
  4. implementations: metadata layer (last modification time), data layer (row count comparison), storage layer (file modification time)

Maps to: any pipeline with SLA on data freshness

Skew Detector #

  1. compare current data volume against previous window — alert if difference exceeds threshold
  2. catch incomplete datasets before processing
  3. risk: seasonality creates false positives
  4. risk: fatality loop — if today’s skewed data becomes tomorrow’s comparison baseline, valid data triggers a false alert
  5. mitigation: compare against most recent successful run, not just the previous run

Maps to: any batch pipeline, archetype 24 (metrics ingest volume monitoring)

Lag Detector #

  1. measure how far behind a consumer is from its producer
  2. increasing lag = consumer can’t keep up = upcoming freshness/availability problem
  3. critical metric for streaming systems

Maps to: archetype 22 (consumer lag is the primary operational metric)

SLA Misses Detector #

  1. track end-to-end pipeline completion time against committed SLA
  2. alert before the SLA is breached, not after
  3. useful for pipelines with downstream business commitments

Maps to: any pipeline with contractual freshness guarantees

Data Lineage (Dataset Tracker and Fine-Grained Tracker) #

  1. dataset tracker: record which pipeline produced which dataset at which time
  2. fine-grained tracker: record column-level lineage (which input columns feed which output columns)
  3. essential for debugging data quality issues and understanding blast radius of upstream changes

Maps to: any system where you need to answer “where did this data come from?”


Cross-Cutting Concepts #

Delivery Semantics #

  1. at-most-once: checkpoint before processing — lose data on failure
  2. at-least-once: checkpoint after processing — duplicate data on failure
  3. exactly-once: illusion built from at-least-once + idempotent writes

The book emphasizes: exactly-once is always a combination of checkpointing + idempotency, never checkpointing alone.

The Watermark #

  1. watermark = MAX(event_time) − allowed_lateness
  2. monotonically increasing — never moves backward
  3. controls three things simultaneously: late data detection, state eviction, window firing
  4. too aggressive (small lateness) = drops too many valid events
  5. too lenient (large lateness) = state grows, windows fire late, freshness suffers

Data Skew #

  1. unbalanced distribution of records across partitions or keys
  2. the skewed partition/key determines overall job duration (stragglers)
  3. mitigations: salting (two-phase aggregation), adaptive query execution, backpressure buffers
  4. skew is both a processing concern (hot task) and a storage concern (hot partition)

Small Files Problem #

  1. too many small files = slow listing, excessive metadata, poor read throughput
  2. caused by: high-frequency ingestion, over-partitioning, streaming micro-batches
  3. mitigation: compaction (merge small files into large files), bin-packing, coalesce before write

Pattern-to-Archetype Quick Reference #

PatternMost Relevant Archetypes
Full Loadersearch index rebuild, cache warm, reference data
Incremental Loaderprojection workers (10), CDC pipelines
CDCevent-sourced pipelines, CQRS, archetype 10
Compactorappend log (22), time-series (24)
Dead-Letterprojection (10), fan-out (15), append log (22)
Windowed Deduplicatorany at-least-once pipeline (10, 15, 22)
Watermark / Late Datawindowed streaming (22, 24)
Checkpointerappend log + consumer progress (22)
Keyed Idempotencysession aggregation, keyed state stores (10)
Transactional Writerany write path with partial-visibility risk (10)
Distributed Aggregatoranalytics (6), metrics pipeline (24)
Sessionizationanalytics (6), projection (10)
Horizontal Partitionerany time-series or event store
Buckethigh-cardinality keys, metrics (24)
Skew Detectorbatch pipelines, metrics ingest (24)
Lag Detectorappend log consumer monitoring (22)
AWAPany pipeline with data quality SLA

Interview One-Liner #

For data pipeline design I think in terms of ingestion shape (full, incremental, CDC), error discipline (dead-letter, dedup, late data), idempotency strategy (overwrite, keyed, transactional), and observability (lag, skew, flow interruption). Each choice is a tradeoff between latency, completeness, and complexity.