Data Governance
Table of Contents
The Schema Registry as Control Plane #
A distributed log without schema enforcement is an untyped byte stream. Producers write whatever they want. Consumers decode whatever they receive. When schemas drift — a field renamed, a type changed, a required field removed — the failure is silent until a consumer crashes deserializing a record it no longer understands. By then, the damage is in the log and the consumer may be hours behind.
The Schema Registry solves this by separating the schema from the data and enforcing compatibility contracts at produce time — before bad data enters the log. It is not a passive store. It is the control plane for your data contracts.
The Wire Format: Magic Byte and Schema ID #
When a producer serializes a record using the Schema Registry, it does not embed the full schema in the message. That would make every record orders of magnitude larger. Instead, it uses a compact wire format:
[0x00] [schema_id: 4 bytes] [serialized payload]
- Magic byte (
0x00) — signals that this record uses Schema Registry encoding. A consumer that receives a record without this magic byte knows the record was not encoded with the registry and can reject it immediately. - Schema ID (4 bytes) — an integer referencing the registered schema version. The consumer looks up this ID against the registry, retrieves the writer schema, and uses it to deserialize the payload.
This design means the schema is fetched once and cached. The wire overhead per record is exactly 5 bytes regardless of schema complexity.
Subject Naming: TopicNameStrategy vs RecordNameStrategy #
The registry organizes schemas into subjects — named containers for schema versions. The naming strategy determines how subjects are assigned:
TopicNameStrategy (default) — one subject per topic, suffixed with
-keyor-value. All records onordersuse theorders-valuesubject. Simple and predictable, but forces all producers on a topic to use the same schema.RecordNameStrategy — one subject per fully-qualified record type, independent of topic. A
com.example.OrderPlacedevent has the same subject whether it is written toordersororders-archive. Enables multiple record types on one topic (a union topic) but requires consumers to handle type dispatch.
The choice is an architectural decision about topic design. Single-type topics with TopicNameStrategy are easier to reason about. Union topics with RecordNameStrategy offer more flexibility but introduce consumer-side complexity.
Avro vs Protobuf: Governance Tradeoffs #
Both formats support schema evolution. The tradeoffs are in how they handle compatibility, code generation, and cross-language support.
| Property | Avro | Protobuf |
|---|---|---|
| Schema definition | JSON / Avro IDL | .proto files |
| Default values | Required for backward compat | Optional fields have implicit defaults |
| Field identification | By name (resolved at read time) | By field number (stable across renames) |
| Rename safety | Rename breaks compatibility | Rename is safe — field number is stable |
| Schema in registry | Required for deserialization | Optional — self-describing with Any |
| Cross-language | Strong | Strong |
| Governance strictness | Higher — schema required | Lower — field numbers can drift |
Avro’s name-based resolution means a field rename is a breaking change. Protobuf’s number-based resolution means you can rename a field without affecting existing consumers — only the field number matters. For long-lived schemas with many teams producing and consuming, Protobuf’s stability guarantees are operationally safer. For tightly controlled pipelines where the schema is owned by one team, Avro’s strictness catches contract violations earlier.
Schema Evolution: Compatibility Modes #
The registry enforces compatibility between schema versions. Every new schema version submitted for a subject is validated against the registered compatibility mode before being accepted.
The Six Compatibility Modes #
BACKWARD — new schema can read data written with the previous schema. Consumers can be upgraded before producers. Safe default for most consumer-driven workflows.
FORWARD — old schema can read data written with the new schema. Producers can be upgraded before consumers. Required when consumers cannot be upgraded immediately.
FULL — both backward and forward compatible. The strictest mode. New schema is a safe drop-in replacement in both directions.
BACKWARD_TRANSITIVE — new schema is backward compatible with all previous versions, not just the most recent. Prevents compatibility from becoming a chain where v3 is compatible with v2 but not v1.
FORWARD_TRANSITIVE — old schemas can read data written by any newer version.
FULL_TRANSITIVE — fully compatible with all registered versions in both directions.
The transitive variants are the correct default for production systems. Non-transitive compatibility creates a false sense of safety: v3 passes compatibility checks against v2, but a consumer still running v1 cannot read v3 data. Transitive modes enforce the invariant across the entire version history.
Breaking Changes: The Dual-Write Pattern #
Some schema changes are inherently breaking — removing a required field, changing a field type, renaming a field in Avro. These cannot be made compatible. The correct pattern is a dual-write migration:
- Create a new topic with the new schema (
orders-v2). - Update producers to write to both
orders(old schema) andorders-v2(new schema) simultaneously. - Migrate consumers to
orders-v2one by one. - Once all consumers are on
orders-v2, stop writing toorders. - Decommission
ordersafter its retention window expires.
This gives every consumer an explicit migration window without any forced cutover. The cost is temporary write amplification and operational overhead of managing two topics. The alternative — a forced cutover — requires all producers and consumers to be updated simultaneously, which is rarely achievable in distributed systems.
Retention and Compaction as Data Products #
Retention and compaction are not just storage management policies — they define what the log represents as a data product.
Time and Size Retention: The Floating Horizon #
Time-based retention creates a floating horizon: the log always contains the last N days of events. New consumers that join after day 1 cannot reconstruct the full history — they can only see events within the retention window. This is correct for ephemeral event streams (click events, metrics, logs) but wrong for stateful streams (user profiles, account balances, configuration state) where a new consumer needs to bootstrap full state from the log.
The hazard is subtle. A consumer that is offline for longer than the retention period returns to find its last committed offset below the Log Start Offset. It receives OFFSET_OUT_OF_RANGE and must reset — either to the beginning of the available window (losing the gap) or to the end (skipping everything it missed). Neither is correct for a stateful consumer. Retention windows must be set with the slowest consumer’s recovery time in mind.
Compaction: The Floating Hazard and Tombstone Race #
Log compaction retains only the latest value per key — semantically equivalent to a database table. A new consumer replaying a compacted log from the beginning gets a complete snapshot of current state.
Two operational hazards:
Compaction lag — the compaction process is asynchronous. There is always a window between when a newer value is written and when the older value is physically removed. During this window, the log contains both values. A consumer reading during compaction may see the old value before it is cleaned. This is not a correctness issue for well-designed consumers (the newer value will be read eventually) but can cause transient state inconsistencies in consumers that materialize state incrementally.
Tombstone race condition — a tombstone (a record with a null value, signaling deletion) is retained through one compaction pass to ensure all consumers see the deletion signal before it is removed. If a consumer is offline during the entire window between the tombstone being written and the tombstone being removed by compaction, it will never see the deletion. The consumer’s materialized state will contain a key that the log has logically deleted.
Topic Lifecycle: Matching Policy to Data Product #
| Data Product | Retention Policy | Compaction | Use Case |
|---|---|---|---|
| Event stream | Time-based (7–30 days) | No | Click events, logs, metrics |
| State changelog | Indefinite or compacted | Yes | User profiles, account state |
| Audit log | Long time-based (1–7 years) | No | Compliance, forensics |
| Work queue | Short time-based (1–3 days) | No | Task distribution |
| Snapshot topic | Compact + delete | Yes + time | Hybrid: full state + bounded storage |
The compact,delete cleanup policy combines both: compaction retains the latest value per key, then time-based deletion removes keys whose latest update is older than the retention threshold. This creates a bounded compacted log — full current state within a time window, automatically pruned beyond it.