Split Storage Substrate — BundleStorage and the Split Bundle Format

Table of Contents

Split Storage Substrate — `BundleStorage` and the Split Bundle Format #

Quickwit splits are immutable objects on object storage. Once a split is uploaded, it is never modified — merges produce new splits; old splits are deleted. This immutability enables the storage substrate to use simple object storage (S3, GCS, Azure Blob) as its durable backend, with no need for locking or coordination on the storage layer itself.

The Split Bundle Format #

A split is a single file on object storage: {split_id}.split. It is a concatenated bundle of files with metadata appended at the end. The layout (from quickwit/docs/internals/split-format.md):

[ Tantivy segment files... ]
[ split_fields (serialized split metadata) ]
[ BundleStorageFileOffsets (JSON) ]
[ BundleStorageFileOffsets length (u32 LE) ]
[ HotCache bytes ]
[ HotCache length (u32 LE) ]

The footer-within-footer structure means the file can be opened in two fetches:

Footer fetch: read the last N bytes (known from the split’s footer offsets stored in the metastore). This retrieves the hotcache and bundle metadata in one I/O.
Data fetch: use the BundleStorageFileOffsets to do range reads for specific Tantivy segment files as needed during search.

`BundleStorageFileOffsets` #

#[derive(Debug, Default, Serialize, Deserialize, Clone)]
pub struct BundleStorageFileOffsets {
    /// The files and their offsets in the body
    pub files: HashMap<PathBuf, Range<u64>>,
}

impl BundleStorageFileOffsets {
    /// File need to include split data (with hotcache at the end).
    /// [Files, FileMetadata, FileMetadata Len, HotCache, HotCache Len]
    /// Returns (Hotcache, Self)
    fn open_from_split_data(file: FileSlice) -> anyhow::Result<(FileSlice, Self)> {
        let (bundle_and_hotcache_bytes, hotcache_num_bytes_data) =
            file.split_from_end(SPLIT_HOTBYTES_FOOTER_LENGTH_NUM_BYTES);
        let hotcache_num_bytes: u32 = u32::from_le_bytes(
            hotcache_num_bytes_data
                .read_bytes()?
                .as_ref()
                .try_into()
                .unwrap(),
        );
        // ... reads bundle metadata length, then deserializes JSON offsets
    }
}

BundleStorageFileOffsets is a JSON map from file path to Range<u64> byte offset within the bundle. When BundleStorage receives a read request for a specific file (e.g., segment.fast for fast fields), it translates it to a range read on the object storage key.

The versioning system:

#[derive(Copy, Clone, Default)]
#[repr(u32)]
pub enum BundleStorageFileOffsetsVersions {
    #[default]
    V1 = 1,
}

impl VersionedComponent for BundleStorageFileOffsetsVersions {
    const MAGIC_NUMBER: u32 = 403_881_646u32;
    // ...
}

The magic number allows the reader to detect version mismatches. If the magic number does not match, the bundle was written by an incompatible version.

`BundleStorage` #

pub struct BundleStorage {
    storage: Arc<dyn Storage>,
    /// The file path of the bundle in the storage.
    bundle_filepath: PathBuf,
    metadata: BundleStorageFileOffsets,
}

impl BundleStorage {
    pub fn open_from_split_data(
        storage: Arc<dyn Storage>,
        bundle_filepath: PathBuf,
        split_data: FileSlice,
    ) -> anyhow::Result<(FileSlice, Self)> {
        let (hotcache, metadata) = BundleStorageFileOffsets::open_from_split_data(split_data)?;
        Ok((
            hotcache,
            BundleStorage {
                storage,
                bundle_filepath,
                metadata,
            },
        ))
    }

    pub fn iter_files(&self) -> impl Iterator<Item = &PathBuf> {
        self.metadata.files.keys()
    }
}

BundleStorage implements the Storage trait. When Tantivy’s segment reader requests a file (e.g., the .term dictionary file), BundleStorage looks up the file’s byte range in metadata.files and issues a range-get on the underlying object storage. This makes Tantivy unaware that it is reading from a remote bundle — it sees an ordinary Storage interface.

The Hotcache #

The hotcache is the most important optimization in the split storage substrate. Tantivy’s file format places a footer at the end of each segment file. The footer contains the field-level statistics and block offsets needed to open the segment. Without the hotcache, opening a split would require N separate range-reads (one per segment component file) just to read footers.

The hotcache bundles all the footers (and some hot data, like the beginning of fast field files) into a single blob appended to the split bundle. When the split footer is fetched (a single range-read), the hotcache is retrieved with it. Tantivy’s HotDirectory serves footer reads from this in-memory blob, eliminating extra I/O for split open.

In leaf.rs:

pub(crate) async fn open_split_bundle(
    searcher_context: &SearcherContext,
    index_storage: Arc<dyn Storage>,
    split_and_footer_offsets: &SplitIdAndFooterOffsets,
) -> anyhow::Result<(FileSlice, BundleStorage)> {
    let split_file = PathBuf::from(format!("{}.split", split_and_footer_offsets.split_id));
    let footer_data = get_split_footer_from_cache_or_fetch(
        index_storage.clone(),
        split_and_footer_offsets,
        &searcher_context.split_footer_cache,
    )
    .await?;

    // footer_data contains: bundle metadata + hotcache
    // open_from_split_data_with_owned_bytes splits it into (hotcache_FileSlice, BundleStorage)
    let (hotcache, bundle_storage) =
        BundleStorage::open_from_split_data_with_owned_bytes(
            index_storage,
            split_file,
            footer_data,
        )?;
    Ok((hotcache, bundle_storage))
}

The returned hotcache (a FileSlice) is then used to create a HotDirectory wrapping the BundleStorage, so Tantivy reads footer bytes from memory rather than object storage.

`SplitPayloadBuilder`: Building the Bundle #

SplitPayloadBuilder constructs the bundle during upload:

#[derive(Default)]
pub struct SplitPayloadBuilder {
    payloads: Vec<(String, Box<dyn PutPayload>, Range<u64>)>,
    current_offset: usize,
}

impl SplitPayloadBuilder {
    pub fn get_split_payload(
        split_files: &[PathBuf],
        serialized_split_fields: &[u8],
        hotcache: &[u8],
    ) -> anyhow::Result<SplitPayload> {
        let mut split_payload_builder = SplitPayloadBuilder::default();
        for file in split_files {
            split_payload_builder.add_file(file)?;
        }
        split_payload_builder.add_payload(
            SPLIT_FIELDS_FILE_NAME.to_string(),
            Box::new(serialized_split_fields.to_vec()),
            // ... tracks current_offset to build Range<u64> for each file
        );
        // appends BundleStorageFileOffsets JSON
        // appends hotcache
        // returns SplitPayload with the full byte stream and footer_range
    }
}

pub struct SplitPayload {
    payloads: Vec<Box<dyn PutPayload>>,
    pub footer_range: Range<u64>,
}

SplitPayload implements PutPayload (a streaming upload interface). The Uploader actor calls index_split_store.store_split(split_payload), which streams the bundle to object storage. The footer_range is recorded in the split’s metastore entry so that searchers know how much to fetch from the end of the file.

Upload Semaphore #

static CONCURRENT_UPLOAD_PERMITS_INDEX: OnceCell<Semaphore> = OnceCell::new();
static CONCURRENT_UPLOAD_PERMITS_MERGE: OnceCell<Semaphore> = OnceCell::new();

The upload budget (max_concurrent_split_uploads) is split between indexing and merge pipelines. Indexing gets the larger share (to minimize indexing latency); merging gets the remainder. This prevents merge activity from starving indexing uploads during periods of high merge pressure.

Each Uploader actor acquires a semaphore permit before starting an upload and releases it when the upload (and staging) complete. The permit is held across the staging and upload to prevent the split from being visible in the metastore (Staged state) without a corresponding upload in progress.

SplitCache: Node-Local Caching #

For repeated searches on the same splits, quickwit-storage provides SplitCache: a node-local cache of recently accessed .split files. When a searcher node receives a search request, it checks SplitCache before going to object storage:

// In leaf.rs (paraphrased)
let index_storage_with_split_cache = wrap_storage_with_cache(
    index_storage.clone(),
    searcher_context.split_cache.clone(),
);

SplitCache stores full split files or partial ranges on local disk. Cache eviction is size-bounded. Splits frequently queried on a node will be served from local disk rather than S3, dramatically reducing search latency and object storage costs.

The SearchJobPlacer (chapter 5) uses Rendezvous hashing to route queries for the same split to the same node, maximizing cache hit rates: a split that is in the node’s local SplitCache will consistently be queried by that node.

Summary #

A Quickwit split is a self-describing bundle on object storage: segment files concatenated with JSON file offsets and a hotcache containing all footers. BundleStorage translates Tantivy’s file reads to range-gets on the bundle. The hotcache eliminates per-segment footer fetches during split open. SplitPayloadBuilder assembles the bundle in order during upload. Upload concurrency is bounded by semaphores split between indexing and merge. Node-local SplitCache with Rendezvous-hashing affinity routing maximizes cache hit rates across the cluster.