Data Governance

Table of Contents

The Schema Registry as Control Plane #

A distributed log without schema enforcement is an untyped byte stream. Producers write whatever they want. Consumers decode whatever they receive. When schemas drift — a field renamed, a type changed, a required field removed — the failure is silent until a consumer crashes deserializing a record it no longer understands. By then, the damage is in the log and the consumer may be hours behind.

The Schema Registry solves this by separating the schema from the data and enforcing compatibility contracts at produce time — before bad data enters the log. It is not a passive store. It is the control plane for your data contracts.

The Wire Format: Magic Byte and Schema ID #

Schema Registry produce/consume flow with wire format

When a producer serializes a record using the Schema Registry, it does not embed the full schema in the message. That would make every record orders of magnitude larger. Instead, it uses a compact wire format:

[0x00] [schema_id: 4 bytes] [serialized payload]

Magic byte (0x00) — signals that this record uses Schema Registry encoding. A consumer that receives a record without this magic byte knows the record was not encoded with the registry and can reject it immediately.
Schema ID (4 bytes) — an integer referencing the registered schema version. The consumer looks up this ID against the registry, retrieves the writer schema, and uses it to deserialize the payload.

This design means the schema is fetched once and cached. The wire overhead per record is exactly 5 bytes regardless of schema complexity.

Subject Naming: TopicNameStrategy vs RecordNameStrategy #

The registry organizes schemas into subjects — named containers for schema versions. The naming strategy determines how subjects are assigned:

TopicNameStrategy (default) — one subject per topic, suffixed with -key or -value. All records on orders use the orders-value subject. Simple and predictable, but forces all producers on a topic to use the same schema.
RecordNameStrategy — one subject per fully-qualified record type, independent of topic. A com.example.OrderPlaced event has the same subject whether it is written to orders or orders-archive. Enables multiple record types on one topic (a union topic) but requires consumers to handle type dispatch.

The choice is an architectural decision about topic design. Single-type topics with TopicNameStrategy are easier to reason about. Union topics with RecordNameStrategy offer more flexibility but introduce consumer-side complexity.

Avro vs Protobuf: Governance Tradeoffs #

Both formats support schema evolution. The tradeoffs are in how they handle compatibility, code generation, and cross-language support.

Property	Avro	Protobuf
Schema definition	JSON / Avro IDL	`.proto` files
Default values	Required for backward compat	Optional fields have implicit defaults
Field identification	By name (resolved at read time)	By field number (stable across renames)
Rename safety	Rename breaks compatibility	Rename is safe — field number is stable
Schema in registry	Required for deserialization	Optional — self-describing with `Any`
Cross-language	Strong	Strong
Governance strictness	Higher — schema required	Lower — field numbers can drift

Avro’s name-based resolution means a field rename is a breaking change. Protobuf’s number-based resolution means you can rename a field without affecting existing consumers — only the field number matters. For long-lived schemas with many teams producing and consuming, Protobuf’s stability guarantees are operationally safer. For tightly controlled pipelines where the schema is owned by one team, Avro’s strictness catches contract violations earlier.

Schema Evolution: Compatibility Modes #

The registry enforces compatibility between schema versions. Every new schema version submitted for a subject is validated against the registered compatibility mode before being accepted.

The Six Compatibility Modes #

Compatibility modes hierarchy: transitive vs non-transitive

BACKWARD — new schema can read data written with the previous schema. Consumers can be upgraded before producers. Safe default for most consumer-driven workflows.
FORWARD — old schema can read data written with the new schema. Producers can be upgraded before consumers. Required when consumers cannot be upgraded immediately.
FULL — both backward and forward compatible. The strictest mode. New schema is a safe drop-in replacement in both directions.
BACKWARD_TRANSITIVE — new schema is backward compatible with all previous versions, not just the most recent. Prevents compatibility from becoming a chain where v3 is compatible with v2 but not v1.
FORWARD_TRANSITIVE — old schemas can read data written by any newer version.
FULL_TRANSITIVE — fully compatible with all registered versions in both directions.

The transitive variants are the correct default for production systems. Non-transitive compatibility creates a false sense of safety: v3 passes compatibility checks against v2, but a consumer still running v1 cannot read v3 data. Transitive modes enforce the invariant across the entire version history.

Breaking Changes: The Dual-Write Pattern #

Dual-write migration: parallel topics with gradual consumer migration

Some schema changes are inherently breaking — removing a required field, changing a field type, renaming a field in Avro. These cannot be made compatible. The correct pattern is a dual-write migration:

Create a new topic with the new schema (orders-v2).
Update producers to write to both orders (old schema) and orders-v2 (new schema) simultaneously.
Migrate consumers to orders-v2 one by one.
Once all consumers are on orders-v2, stop writing to orders.
Decommission orders after its retention window expires.

This gives every consumer an explicit migration window without any forced cutover. The cost is temporary write amplification and operational overhead of managing two topics. The alternative — a forced cutover — requires all producers and consumers to be updated simultaneously, which is rarely achievable in distributed systems.

Retention and Compaction as Data Products #

Retention and compaction are not just storage management policies — they define what the log represents as a data product.

Time and Size Retention: The Floating Horizon #

Time-based retention creates a floating horizon: the log always contains the last N days of events. New consumers that join after day 1 cannot reconstruct the full history — they can only see events within the retention window. This is correct for ephemeral event streams (click events, metrics, logs) but wrong for stateful streams (user profiles, account balances, configuration state) where a new consumer needs to bootstrap full state from the log.

The hazard is subtle. A consumer that is offline for longer than the retention period returns to find its last committed offset below the Log Start Offset. It receives OFFSET_OUT_OF_RANGE and must reset — either to the beginning of the available window (losing the gap) or to the end (skipping everything it missed). Neither is correct for a stateful consumer. Retention windows must be set with the slowest consumer’s recovery time in mind.

Compaction: The Floating Hazard and Tombstone Race #

Log compaction retains only the latest value per key — semantically equivalent to a database table. A new consumer replaying a compacted log from the beginning gets a complete snapshot of current state.

Two operational hazards:

Compaction lag — the compaction process is asynchronous. There is always a window between when a newer value is written and when the older value is physically removed. During this window, the log contains both values. A consumer reading during compaction may see the old value before it is cleaned. This is not a correctness issue for well-designed consumers (the newer value will be read eventually) but can cause transient state inconsistencies in consumers that materialize state incrementally.

Tombstone race condition: consumer offline during compaction misses deletion

Tombstone race condition — a tombstone (a record with a null value, signaling deletion) is retained through one compaction pass to ensure all consumers see the deletion signal before it is removed. If a consumer is offline during the entire window between the tombstone being written and the tombstone being removed by compaction, it will never see the deletion. The consumer’s materialized state will contain a key that the log has logically deleted.

Topic Lifecycle: Matching Policy to Data Product #

Data Product	Retention Policy	Compaction	Use Case
Event stream	Time-based (7–30 days)	No	Click events, logs, metrics
State changelog	Indefinite or compacted	Yes	User profiles, account state
Audit log	Long time-based (1–7 years)	No	Compliance, forensics
Work queue	Short time-based (1–3 days)	No	Task distribution
Snapshot topic	Compact + delete	Yes + time	Hybrid: full state + bounded storage

The compact,delete cleanup policy combines both: compaction retains the latest value per key, then time-based deletion removes keys whose latest update is older than the retention threshold. This creates a bounded compacted log — full current state within a time window, automatically pruned beyond it.