Search Leaf Substrate — Split Opening, Hotcache, and Tantivy Search

Table of Contents

Search Leaf Substrate — `leaf.rs` #

The search leaf substrate executes search against a set of assigned splits on a single node. The key challenge is efficiently opening Tantivy indexes stored as remote split bundles while minimizing object storage I/O. Three layers solve this: the split footer cache (avoids re-fetching bundle metadata), the hotcache (avoids per-segment footer I/O), and the SplitCache (avoids full bundle re-download for hot splits).

Entry Point: `leaf_search` #

When a LeafSearchRequest arrives, the leaf executes search against each assigned split. The leaf batches splits using a greedy LPT algorithm:

fn greedy_batch_split<T>(
    items: Vec<T>,
    weight_fn: impl Fn(&T) -> u64,
    max_items_per_batch: NonZeroUsize,
) -> Vec<Vec<T>> { ... }

Splits are batched to bound the number of concurrent Tantivy index opens. Each batch runs under a search permit — a semaphore that limits concurrent split searches to prevent memory exhaustion.

async fn get_split_footer_from_cache_or_fetch(
    index_storage: Arc<dyn Storage>,
    split_and_footer_offsets: &SplitIdAndFooterOffsets,
    footer_cache: &MemorySizedCache<String>,
) -> anyhow::Result<OwnedBytes> {
    {
        let possible_val = footer_cache.get(&split_and_footer_offsets.split_id);
        if let Some(footer_data) = possible_val {
            return Ok(footer_data);
        }
    }
    let split_file = PathBuf::from(format!("{}.split", split_and_footer_offsets.split_id));
    let footer_data_opt = index_storage
        .get_slice(
            &split_file,
            split_and_footer_offsets.split_footer_start as usize
                ..split_and_footer_offsets.split_footer_end as usize,
        )
        .await?;

    footer_cache.put(
        split_and_footer_offsets.split_id.to_owned(),
        footer_data_opt.clone(),
    );

    Ok(footer_data_opt)
}

MemorySizedCache<String> is a size-bounded LRU cache keyed by split ID. The footer bytes contain both the BundleStorageFileOffsets (file range metadata) and the HotCache (Tantivy footers for all segment components). Once cached, subsequent searches on the same split skip the object storage range-read entirely.

The footer offsets (split_footer_start, split_footer_end) come from the metastore — they are recorded when the split is uploaded and stored alongside the split metadata. The root passes them to the leaf in SplitIdAndFooterOffsets.

Step 2: Opening the Split Bundle #

pub(crate) async fn open_split_bundle(
    searcher_context: &SearcherContext,
    index_storage: Arc<dyn Storage>,
    split_and_footer_offsets: &SplitIdAndFooterOffsets,
) -> anyhow::Result<(FileSlice, BundleStorage)> {
    let split_file = PathBuf::from(format!("{}.split", split_and_footer_offsets.split_id));
    let footer_data = get_split_footer_from_cache_or_fetch(
        index_storage.clone(),
        split_and_footer_offsets,
        &searcher_context.split_footer_cache,
    )
    .await?;

    let (hotcache, bundle_storage) =
        BundleStorage::open_from_split_data_with_owned_bytes(
            index_storage,
            split_file,
            footer_data,
        )?;
    Ok((hotcache, bundle_storage))
}

BundleStorage::open_from_split_data_with_owned_bytes parses the footer bytes to extract the BundleStorageFileOffsets and the HotCache. It returns:

hotcache: FileSlice — the raw hotcache bytes as a FileSlice (Tantivy’s lazy byte slice abstraction).
BundleStorage — the object storage backend with file range lookup.

Step 3: Layered Directory Stack #

Tantivy accesses data through a Directory abstraction. The leaf builds a layered directory stack:

HotDirectory (serves footer reads from hotcache bytes)
    └── CachingDirectory (byte-range cache for repeated sub-file reads)
        └── wrap_storage_with_cache(SplitCache wrapper)
            └── BundleStorage (translates file reads to S3 range-gets)

Each layer serves what it can from its cache; misses fall through to the next layer:

HotDirectory: wraps BundleStorage. For any file read, it checks if the requested byte range is covered by the hotcache. The hotcache covers all segment footers and the beginning of fast field files — the most commonly read ranges. Hits are served from in-memory bytes.
CachingDirectory (byte-range cache): wraps HotDirectory. For reads that miss the hotcache (e.g., reading the middle of a large .term dictionary), the result is cached in a ByteRangeCache. This avoids repeated S3 reads for the same byte range within a single search.
SplitCache wrapper: wraps the base storage. If the entire split file is cached locally on disk, reads are served from the local disk file instead of S3.
BundleStorage: the base layer. Translates file path + byte range to storage.get_slice(bundle_filepath, offset_range) on the object storage backend.

Step 4: Tantivy Index Open and Warmup #

With the directory stack assembled, the leaf opens the Tantivy index:

let index = Index::open(hot_directory)?;
let reader = index.reader_builder()
    .reload_policy(ReloadPolicy::Manual)
    .try_into()?;
let searcher = reader.searcher();

After opening, the leaf performs a warmup phase. Warmup prefetches data from object storage into the CachingDirectory before the actual query execution. The WarmupInfo (derived from the query’s DocMapper analysis) specifies:

Which fast fields to preload (needed for sorting, aggregations, tag filtering).
Which term dictionary ranges to preload (needed for term queries).
Whether to preload the entire field norm data.

Warmup issues multiple prefetch I/Os concurrently, hiding storage latency. Without warmup, query execution would stall repeatedly waiting for individual byte ranges.

// Paraphrased from leaf.rs
let warmup_info: WarmupInfo = query_ast.collect_warmup_info(&doc_mapper)?;
warmup(&searcher, &warmup_info, ctx.clone()).await?;

Step 5: Query Execution #

After warmup, the leaf executes the Tantivy query:

let collector = make_collector_for_split(
    split_id,
    &search_request,
    &aggregations,
    &searcher,
)?;

let split_search_result = searcher.search(&query, &collector)?;

make_collector_for_split builds the appropriate Tantivy collector:

Top-K hits: TopScoreCollector or sort-based collector.
Aggregations: QuickwitAggregations collector wrapping Tantivy’s aggregation framework.
Count: CountCollector.

The collector produces a LeafSearchResponse with PartialHits (document IDs + sort values) and intermediate aggregation results.

IncrementalCollector and Early Termination #

For sorted queries, the leaf uses IncrementalCollector. As it processes segments within a split, it can terminate early if the current segment’s minimum score is worse than the global top-K threshold accumulated so far. This prunes large fractions of documents in the common case where results are concentrated in recent (highest-scored) segments.

Search Permits and Memory Management #

pub struct SearchPermit {
    _permit: SemaphorePermit<'static>,
    memory_allocation: ByteSize,
}

Before opening a split, the leaf acquires a SearchPermit from the SearchPermitProvider. The permit bounds two resources:

Concurrency: the semaphore limits concurrent open Tantivy indexes.
Memory: each permit carries a memory_allocation estimate based on the split’s size. The provider tracks total allocated memory and slows permit issuance if total allocation approaches the node’s searcher_memory_limit.

compute_initial_memory_allocation estimates memory from split metadata (num docs, data size) before the split is opened. This is a heuristic, but it prevents memory exhaustion when many large splits are searched simultaneously.

Leaf Cache: Result Caching #

quickwit-search includes a LeafSearchCache for caching full LeafSearchResponses. For identical queries on the same split, the cached result is returned immediately without re-executing the Tantivy query. Cache keys include the query, split ID, and search parameters. This is effective for dashboard queries that repeatedly run the same aggregation.

Summary #

The search leaf substrate implements a five-step split search: (1) fetch the split footer from cache or S3, (2) open BundleStorage with file range metadata, (3) build a layered directory stack (HotDirectory → CachingDirectory → SplitCache → BundleStorage), (4) warmup by prefetching fast field and term dictionary ranges concurrently, (5) execute the Tantivy query. Search permits bound concurrency and memory. The hotcache eliminates per-segment footer fetches; the CachingDirectory amortizes repeated range reads within a query; the SplitCache eliminates full-bundle re-downloads for frequently queried splits.

Search Leaf Substrate — leaf.rs #

Entry Point: leaf_search #

Step 1: Footer Fetch and Caching #