- My Development Notes/
- RabbitMQ Internals: A Substrate Decomposition/
- Cluster and Metadata Substrate — Khepri and Mnesia/
Cluster and Metadata Substrate — Khepri and Mnesia
Table of Contents
Cluster and Metadata Substrate — Khepri and Mnesia #
The cluster/metadata substrate is the shared ground truth for everything that must be consistent across all broker nodes: exchange definitions, queue records, bindings, vhost configurations, user credentials, permissions, and policies. Every other substrate reads from it. Changes to it must be durable and consistent — losing a binding or user record is a correctness failure, not just a performance problem.
RabbitMQ has two implementations of this substrate: Mnesia (the legacy backend, an Erlang-native distributed database) and Khepri (the new backend, a Ra-based hierarchical key-value store). Both are live in the current codebase, selectable via the khepri_db feature flag.
Mnesia: The Legacy Backend #
Mnesia is an Erlang/OTP built-in distributed database. RabbitMQ has used it since version 1.0. Its key properties:
- Replicated tables: each Mnesia table can be declared as
disc_copies(replicated on disk on all nodes),ram_copies(in-memory only), ordisc_only_copies. - Transactions: ACID transactions using two-phase locking.
- Schema: tables are Erlang record types. Schema changes require coordinated cluster-wide operations.
- Partition handling: Mnesia uses a simple “island” partition model — each partition continues independently. On heal, one island wins; the other loses its changes. No quorum requirement.
Mnesia’s partition behavior is the critical weakness. A network partition can cause two nodes to independently accept conflicting writes (e.g., two different users both create a queue named orders). On partition heal, Mnesia detects the conflict but cannot automatically resolve it — the broker enters an indeterminate state requiring operator intervention.
Khepri: The Raft-Based Replacement #
Khepri is a new metadata store built on Ra (the same Raft library as quorum queues). Its key properties:
- Hierarchical tree: data is stored in a tree, addressed by path (
[rabbit, exchanges, <<"/">>, <<"amq.topic">>]). Think of it as a file system path where each node in the tree can hold a value. - Raft consistency: all writes go through the Raft leader. Writes are committed only after a majority of nodes acknowledge. No split-brain.
- Projections: Khepri can maintain ETS projections — derived views of the tree maintained incrementally as the tree changes.
- Feature flag controlled: the
khepri_dbfeature flag controls which backend is active. Migration happens live, with a one-time data transfer from Mnesia tables to the Khepri tree.
%% rabbit_khepri.erl: starting Khepri
khepri:start(?RA_SYSTEM, RaServerConfig)
%% RaServerConfig includes the Ra cluster name, log settings, snapshot config
The rabbit_db_* Abstraction Layer #
The rabbit_db_* modules provide a backend-agnostic API:
%% rabbit_db_exchange.erl: backend-agnostic exchange store
get(XName) ->
rabbit_khepri:handle_fallback(#{
mnesia => fun() -> mnesia_get(XName) end,
khepri => fun() -> khepri_get(XName) end
}).
create_or_get(X) ->
rabbit_khepri:handle_fallback(#{
mnesia => fun() -> mnesia_create_or_get(X) end,
khepri => fun() -> khepri_create_or_get(X) end
}).
rabbit_khepri:handle_fallback/1 checks the khepri_db feature flag:
- If enabled: calls the
kheprifunction. - If disabled: calls the
mnesiafunction. - If migrating (transitional state): watches for Mnesia
no_existsexceptions and retries or switches to Khepri.
This dispatch pattern appears in:
rabbit_db_exchange— exchange definitionsrabbit_db_binding— binding rulesrabbit_db_queue— queue recordsrabbit_db_vhost— vhost definitionsrabbit_db_user— user credentialsrabbit_db_policy— policies and runtime parametersrabbit_db_topic_exchange— topic trie (special: Khepri projection, not a direct store)
Khepri Paths and the Tree Structure #
Khepri stores data at tree paths. The RabbitMQ-specific paths:
[rabbit, exchanges, VHost, ExchangeName] → exchange record
[rabbit, queues, VHost, QueueName] → queue record
[rabbit, bindings, XName, DestinationType, DestName, RoutingKey] → binding
[rabbit, vhosts, VHostName] → vhost record
[rabbit, users, UserName] → user record
[rabbit, permissions, VHost, UserName] → permission record
[rabbit, policies, VHost, PolicyName] → policy record
[rabbit, topic_trie, XName, NodeId, Word] → topic trie edge (projection target)
Each path node can have a value (the record data) and child nodes. Khepri transactions operate on the tree atomically.
Khepri Projections: ETS for Read Performance #
The critical insight: Raft reads are expensive (they require the leader to confirm quorum before responding). Routing, consumer delivery, and other hot paths cannot afford a Raft read on every operation. Khepri solves this with projections — ETS tables derived from the Khepri tree, updated automatically when the tree changes.
%% Registering a projection in rabbit_khepri.erl
register_projections() ->
ok = rabbit_db_topic_exchange:register_khepri_projection(),
ok = rabbit_db_binding:register_khepri_projection(),
...
A projection is defined as:
- A path pattern in the Khepri tree to watch (e.g.,
[rabbit, topic_trie, '_', '_', '_']). - A function that maps tree changes to ETS table operations.
When Khepri commits a change to a watched path, it calls the projection function on all nodes. The projection function updates the ETS table. Reads from the ETS table are then local, in-memory, and uncoordinated.
%% Topic trie projection
-define(KHEPRI_PROJECTION, rabbit_khepri_topic_trie_v3).
%% trie_child_in_khepri in rabbit_db_topic_exchange.erl
trie_child_in_khepri(X, Node, Word) ->
case ets:lookup(
?KHEPRI_PROJECTION,
#trie_edge{exchange_name = X, node_id = Node, word = Word}) of
[#topic_trie_edge_v2{node_id = NextNode}] -> {ok, NextNode};
[] -> error
end.
The projection name rabbit_khepri_topic_trie_v3 includes a version (v3) — when the projection schema changes (e.g., adding fields), the version is bumped. Old projection ETS tables are unregistered and new ones populated from the current Khepri tree state.
Peer Discovery #
When a RabbitMQ node starts, it must find other nodes in its cluster. rabbit_peer_discovery implements the peer discovery substrate, with pluggable backends:
rabbit_peer_discovery_classic_config— node list inrabbitmq.conf(cluster_formation.classic_config.nodes)rabbit_peer_discovery_dns— DNS SRV or A recordsrabbit_peer_discovery_k8s— Kubernetes API (list pods by label selector)rabbit_peer_discovery_consul— HashiCorp Consul service registryrabbit_peer_discovery_etcd— etcd key registration
Discovery finds a peer list. The node then joins the Khepri cluster (khepri_cluster:join/2) or Mnesia cluster (rabbit_mnesia:join_cluster/2).
On cluster join:
- The joining node adds itself to the Ra cluster (for Khepri) or copies Mnesia tables.
- Khepri streams the full Raft log (or a snapshot) to the joining node.
- The node registers its Khepri projections and begins receiving tree change events.
- Quorum queue membership reconciliation runs:
rabbit_quorum_queue_periodic_membership_reconciliationadds the new node to all eligible quorum queues.
Mnesia to Khepri Migration #
The migration runs live, without downtime:
- Operator enables the
khepri_dbfeature flag on one node. - All nodes negotiate the feature flag via the feature flag mechanism — they all agree to enable simultaneously or not at all.
khepri_db_migration_enable/1is called: it reads all Mnesia tables and writes them to Khepri.- During migration,
handle_fallbackwatches for Mnesiano_existsexceptions (which occur when a table has been transferred) and retries against Khepri. - Once migration completes,
khepri_dbis enabled and all operations go through Khepri. Mnesia tables are retained for a period to support rollback.
The migration is designed to be safe under concurrent operations: the handle_fallback retry logic ensures that operations that started against Mnesia and hit the migration window are transparently retried against Khepri.
Cluster State and Split-Brain Behavior #
Mnesia: network partition → two independent Mnesia islands → conflicting writes possible → operator intervention required on heal.
Khepri/Ra: network partition → minority partition loses quorum → Raft refuses writes (returns {error, not_leader}) → minority nodes are read-only for metadata → no conflicting writes → on heal, minority rejoins majority and replays the Raft log. No split-brain possible.
The pause-minority partition handling strategy in RabbitMQ (for Mnesia clusters) is a workaround for Mnesia’s lack of quorum: the smaller partition pauses itself to avoid divergence. Khepri makes this strategy unnecessary — quorum is inherent.
Summary #
The cluster/metadata substrate stores all broker topology (exchanges, queues, bindings, users, permissions, policies) durably and consistently. Khepri provides strong consistency via Raft — no split-brain, writes require majority. ETS projections (populated from Khepri change events) give read performance to the routing and delivery hot paths without Raft round-trips. The rabbit_db_* abstraction layer dispatches to Mnesia or Khepri based on the khepri_db feature flag, supporting live migration. Peer discovery is pluggable (DNS, K8s, Consul, etcd, static config). Mnesia’s partition weakness (split-brain) is eliminated by Khepri’s Raft-based design.