Cluster and Metadata Substrate — Khepri and Mnesia

Table of Contents

Cluster and Metadata Substrate — Khepri and Mnesia #

The cluster/metadata substrate is the shared ground truth for everything that must be consistent across all broker nodes: exchange definitions, queue records, bindings, vhost configurations, user credentials, permissions, and policies. Every other substrate reads from it. Changes to it must be durable and consistent — losing a binding or user record is a correctness failure, not just a performance problem.

RabbitMQ has two implementations of this substrate: Mnesia (the legacy backend, an Erlang-native distributed database) and Khepri (the new backend, a Ra-based hierarchical key-value store). Both are live in the current codebase, selectable via the khepri_db feature flag.

Mnesia: The Legacy Backend #

Mnesia is an Erlang/OTP built-in distributed database. RabbitMQ has used it since version 1.0. Its key properties:

Replicated tables: each Mnesia table can be declared as disc_copies (replicated on disk on all nodes), ram_copies (in-memory only), or disc_only_copies.
Transactions: ACID transactions using two-phase locking.
Schema: tables are Erlang record types. Schema changes require coordinated cluster-wide operations.
Partition handling: Mnesia uses a simple “island” partition model — each partition continues independently. On heal, one island wins; the other loses its changes. No quorum requirement.

Mnesia’s partition behavior is the critical weakness. A network partition can cause two nodes to independently accept conflicting writes (e.g., two different users both create a queue named orders). On partition heal, Mnesia detects the conflict but cannot automatically resolve it — the broker enters an indeterminate state requiring operator intervention.

Khepri: The Raft-Based Replacement #

Khepri is a new metadata store built on Ra (the same Raft library as quorum queues). Its key properties:

Hierarchical tree: data is stored in a tree, addressed by path ([rabbit, exchanges, <<"/">>, <<"amq.topic">>]). Think of it as a file system path where each node in the tree can hold a value.
Raft consistency: all writes go through the Raft leader. Writes are committed only after a majority of nodes acknowledge. No split-brain.
Projections: Khepri can maintain ETS projections — derived views of the tree maintained incrementally as the tree changes.
Feature flag controlled: the khepri_db feature flag controls which backend is active. Migration happens live, with a one-time data transfer from Mnesia tables to the Khepri tree.

%% rabbit_khepri.erl: starting Khepri
khepri:start(?RA_SYSTEM, RaServerConfig)
%% RaServerConfig includes the Ra cluster name, log settings, snapshot config

The `rabbit_db_*` Abstraction Layer #

The rabbit_db_* modules provide a backend-agnostic API:

%% rabbit_db_exchange.erl: backend-agnostic exchange store
get(XName) ->
    rabbit_khepri:handle_fallback(#{
        mnesia => fun() -> mnesia_get(XName) end,
        khepri => fun() -> khepri_get(XName) end
    }).

create_or_get(X) ->
    rabbit_khepri:handle_fallback(#{
        mnesia => fun() -> mnesia_create_or_get(X) end,
        khepri => fun() -> khepri_create_or_get(X) end
    }).

rabbit_khepri:handle_fallback/1 checks the khepri_db feature flag:

If enabled: calls the khepri function.
If disabled: calls the mnesia function.
If migrating (transitional state): watches for Mnesia no_exists exceptions and retries or switches to Khepri.

This dispatch pattern appears in:

rabbit_db_exchange — exchange definitions
rabbit_db_binding — binding rules
rabbit_db_queue — queue records
rabbit_db_vhost — vhost definitions
rabbit_db_user — user credentials
rabbit_db_policy — policies and runtime parameters
rabbit_db_topic_exchange — topic trie (special: Khepri projection, not a direct store)

Khepri Paths and the Tree Structure #

Khepri stores data at tree paths. The RabbitMQ-specific paths:

[rabbit, exchanges, VHost, ExchangeName]          → exchange record
[rabbit, queues, VHost, QueueName]                → queue record
[rabbit, bindings, XName, DestinationType, DestName, RoutingKey] → binding
[rabbit, vhosts, VHostName]                       → vhost record
[rabbit, users, UserName]                         → user record
[rabbit, permissions, VHost, UserName]            → permission record
[rabbit, policies, VHost, PolicyName]             → policy record
[rabbit, topic_trie, XName, NodeId, Word]         → topic trie edge (projection target)

Each path node can have a value (the record data) and child nodes. Khepri transactions operate on the tree atomically.

Khepri Projections: ETS for Read Performance #

The critical insight: Raft reads are expensive (they require the leader to confirm quorum before responding). Routing, consumer delivery, and other hot paths cannot afford a Raft read on every operation. Khepri solves this with projections — ETS tables derived from the Khepri tree, updated automatically when the tree changes.

%% Registering a projection in rabbit_khepri.erl
register_projections() ->
    ok = rabbit_db_topic_exchange:register_khepri_projection(),
    ok = rabbit_db_binding:register_khepri_projection(),
    ...

A projection is defined as:

A path pattern in the Khepri tree to watch (e.g., [rabbit, topic_trie, '_', '_', '_']).
A function that maps tree changes to ETS table operations.

When Khepri commits a change to a watched path, it calls the projection function on all nodes. The projection function updates the ETS table. Reads from the ETS table are then local, in-memory, and uncoordinated.

%% Topic trie projection
-define(KHEPRI_PROJECTION, rabbit_khepri_topic_trie_v3).

%% trie_child_in_khepri in rabbit_db_topic_exchange.erl
trie_child_in_khepri(X, Node, Word) ->
    case ets:lookup(
        ?KHEPRI_PROJECTION,
        #trie_edge{exchange_name = X, node_id = Node, word = Word}) of
        [#topic_trie_edge_v2{node_id = NextNode}] -> {ok, NextNode};
        [] -> error
    end.

The projection name rabbit_khepri_topic_trie_v3 includes a version (v3) — when the projection schema changes (e.g., adding fields), the version is bumped. Old projection ETS tables are unregistered and new ones populated from the current Khepri tree state.

Peer Discovery #

When a RabbitMQ node starts, it must find other nodes in its cluster. rabbit_peer_discovery implements the peer discovery substrate, with pluggable backends:

rabbit_peer_discovery_classic_config — node list in rabbitmq.conf (cluster_formation.classic_config.nodes)
rabbit_peer_discovery_dns — DNS SRV or A records
rabbit_peer_discovery_k8s — Kubernetes API (list pods by label selector)
rabbit_peer_discovery_consul — HashiCorp Consul service registry
rabbit_peer_discovery_etcd — etcd key registration

Discovery finds a peer list. The node then joins the Khepri cluster (khepri_cluster:join/2) or Mnesia cluster (rabbit_mnesia:join_cluster/2).

On cluster join:

The joining node adds itself to the Ra cluster (for Khepri) or copies Mnesia tables.
Khepri streams the full Raft log (or a snapshot) to the joining node.
The node registers its Khepri projections and begins receiving tree change events.
Quorum queue membership reconciliation runs: rabbit_quorum_queue_periodic_membership_reconciliation adds the new node to all eligible quorum queues.

Mnesia to Khepri Migration #

The migration runs live, without downtime:

Operator enables the khepri_db feature flag on one node.
All nodes negotiate the feature flag via the feature flag mechanism — they all agree to enable simultaneously or not at all.
khepri_db_migration_enable/1 is called: it reads all Mnesia tables and writes them to Khepri.
During migration, handle_fallback watches for Mnesia no_exists exceptions (which occur when a table has been transferred) and retries against Khepri.
Once migration completes, khepri_db is enabled and all operations go through Khepri. Mnesia tables are retained for a period to support rollback.

The migration is designed to be safe under concurrent operations: the handle_fallback retry logic ensures that operations that started against Mnesia and hit the migration window are transparently retried against Khepri.

Cluster State and Split-Brain Behavior #

Mnesia: network partition → two independent Mnesia islands → conflicting writes possible → operator intervention required on heal.

Khepri/Ra: network partition → minority partition loses quorum → Raft refuses writes (returns {error, not_leader}) → minority nodes are read-only for metadata → no conflicting writes → on heal, minority rejoins majority and replays the Raft log. No split-brain possible.

The pause-minority partition handling strategy in RabbitMQ (for Mnesia clusters) is a workaround for Mnesia’s lack of quorum: the smaller partition pauses itself to avoid divergence. Khepri makes this strategy unnecessary — quorum is inherent.

Summary #

The cluster/metadata substrate stores all broker topology (exchanges, queues, bindings, users, permissions, policies) durably and consistently. Khepri provides strong consistency via Raft — no split-brain, writes require majority. ETS projections (populated from Khepri change events) give read performance to the routing and delivery hot paths without Raft round-trips. The rabbit_db_* abstraction layer dispatches to Mnesia or Khepri based on the khepri_db feature flag, supporting live migration. Peer discovery is pluggable (DNS, K8s, Consul, etcd, static config). Mnesia’s partition weakness (split-brain) is eliminated by Khepri’s Raft-based design.