Metastore Substrate — Split Lifecycle and Checkpoint Management

Table of Contents

Metastore Substrate — Split Lifecycle and Checkpoint Management #

The metastore is Quickwit’s control-plane database. It is the authoritative record of what data exists: which indexes are configured, which splits have been published, and how far each source has been consumed. Every other substrate reads from or writes to the metastore to coordinate work. The metastore is the only stateful service that requires durable, consistent storage — everything else (the WAL, splits on S3) is either ephemeral or managed via metastore records.

`MetastoreService` Trait #

The metastore exposes a gRPC-based API defined by the MetastoreService trait (generated from protobuf). The key operations:

// Index management
async fn create_index(&mut self, req: CreateIndexRequest) -> MetastoreResult<CreateIndexResponse>;
async fn index_metadata(&mut self, req: IndexMetadataRequest) -> MetastoreResult<IndexMetadataResponse>;
async fn list_indexes_metadata(&mut self, req: ListIndexesMetadataRequest) -> MetastoreResult<ListIndexesMetadataResponse>;
async fn delete_index(&mut self, req: DeleteIndexRequest) -> MetastoreResult<EmptyResponse>;

// Split management
async fn stage_splits(&mut self, req: StageSplitsRequest) -> MetastoreResult<EmptyResponse>;
async fn publish_splits(&mut self, req: PublishSplitsRequest) -> MetastoreResult<EmptyResponse>;
async fn list_splits(&mut self, req: ListSplitsRequest) -> MetastoreResult<MetastoreServiceStream<ListSplitsResponse>>;
async fn mark_splits_for_deletion(&mut self, req: MarkSplitsForDeletionRequest) -> MetastoreResult<EmptyResponse>;
async fn delete_splits(&mut self, req: DeleteSplitsRequest) -> MetastoreResult<EmptyResponse>;

// Source management
async fn add_source(&mut self, req: AddSourceRequest) -> MetastoreResult<EmptyResponse>;
async fn reset_source_checkpoint(&mut self, req: ResetSourceCheckpointRequest) -> MetastoreResult<EmptyResponse>;

MetastoreServiceExt adds convenience methods:

#[async_trait]
pub trait MetastoreServiceExt: MetastoreService {
    async fn index_exists(&mut self, index_id: &str) -> MetastoreResult<bool> {
        let request = IndexMetadataRequest::for_index_id(index_id.to_string());
        match self.index_metadata(request).await {
            Ok(_) => Ok(true),
            Err(MetastoreError::NotFound { .. }) => Ok(false),
            Err(error) => Err(error),
        }
    }
}

The metastore is accessed via MetastoreServiceClient — a gRPC client wrapper. The indexing pipeline’s Publisher, the merge pipeline, the Janitor, and the control plane all hold MetastoreServiceClient instances. The metastore itself runs as a separate service (or in-process for single-node deployments).

Split Lifecycle: The State Machine #

A split passes through five states:

Staged → Published → ScheduledForDelete → (Deleted)
                                        ↑
                            (also from Published, via retention policy)

Staged: the split has been allocated a split ID and its metadata has been written to the metastore, but its data has not been fully uploaded to object storage. Staged splits are not visible to searches. This state exists to reserve the metastore record before the upload starts — if the uploader crashes after staging but before finishing the upload, the Janitor cleans up orphaned staged splits.

Published: the split’s data is fully on object storage and it is visible to searches. The transition from Staged to Published is atomic: publish_splits transitions the split state and advances the source checkpoint in a single transaction:

// From publisher.rs
let publish_splits_request = PublishSplitsRequest {
    index_uid: Some(index_uid),
    staged_split_ids: split_ids.clone(),
    replaced_split_ids: replaced_split_ids.clone(), // IDs of splits being replaced by merges
    index_checkpoint_delta_json_opt, // checkpoint delta to commit atomically
    publish_token_opt, // fencing token to prevent stale publishes
    ..
};
// This is a single atomic write: transitions splits AND advances checkpoint
metastore.publish_splits(publish_splits_request).await?;

The replaced_split_ids field handles merges: when N small splits are merged into one large split, the new split is published and the old splits are atomically transitioned to ScheduledForDelete in the same call.

ScheduledForDelete: the split is no longer visible to searches. It is waiting for the Janitor to confirm that no running query is using it before issuing the actual delete from object storage.

`IndexCheckpointDelta`: Source Position Tracking #

The metastore tracks, per (index_uid, source_id), how far the source has been consumed. This enables crash recovery: when a pipeline restarts, it reads the metastore checkpoint to know where to resume reading from the source.

// From quickwit-metastore/src/checkpoint.rs (inferred from usage)
pub struct IndexCheckpointDelta {
    pub source_id: SourceId,
    pub source_delta: SourceCheckpointDelta,
}

pub struct SourceCheckpointDelta {
    // partition_id → (old_position, new_position)
    pub per_partition: HashMap<PartitionId, (Position, Position)>,
}

The source_delta records the old and new position for each partition. The metastore validates that the old position matches the currently stored checkpoint before applying the new position — this is the optimistic concurrency check that prevents stale pipeline generations from overwriting a newer checkpoint.

`PublishToken`: Fencing #

pub type PublishToken = String; // a UUID generated per pipeline generation

When a new IndexingPipeline is spawned (initial or after restart), the control plane allocates a new PublishToken (UUID). This token is embedded in every PublishSplitsRequest. The metastore validates the token: if the token doesn’t match the current token for the (index_uid, source_id), the publish is rejected with MetastoreError::InvalidArgument.

This prevents a stale pipeline generation (e.g., a zombie from before a restart) from racing with the current generation to publish splits. The control plane invalidates the old token when it issues a new one.

`StageSplitsRequest` and `StagedSplit` Cleanup #

When Uploader stages a split before uploading:

let stage_splits_request = StageSplitsRequest {
    index_uid: Some(index_uid.clone()),
    split_metadata_list_serialized_json: split_metadatas_json,
};
metastore.stage_splits(stage_splits_request).await?;

The staged split record contains SplitMetadata with all the split’s attributes: split ID, index UID, source ID, time range, doc count, uncompressed size, and the footer_range needed by searchers.

Orphaned staged splits (staged but never published, e.g., due to uploader crash) are detected by the Janitor. Any split that has been in Staged state for more than a grace period (e.g., 1 hour) is assumed orphaned and deleted.

`list_splits`: Streaming Large Result Sets #

async fn list_splits(
    &mut self,
    req: ListSplitsRequest,
) -> MetastoreResult<MetastoreServiceStream<ListSplitsResponse>>;

list_splits returns a streaming response rather than a single response. For indexes with millions of splits (common in log analytics), loading all splits into memory at once would be prohibitive. The stream is consumed page by page:

#[async_trait]
pub trait MetastoreServiceStreamSplitsExt {
    async fn collect_splits(mut self) -> MetastoreResult<Vec<Split>>;
    async fn collect_splits_metadata(mut self) -> MetastoreResult<Vec<SplitMetadata>>;
    async fn collect_split_ids(mut self) -> MetastoreResult<Vec<SplitId>>;
}

async fn collect_splits(mut self) -> MetastoreResult<Vec<Split>> {
    let mut all_splits = Vec::new();
    while let Some(list_splits_response) = self.try_next().await? {
        let splits = list_splits_response.deserialize_splits().await?;
        all_splits.extend(splits);
    }
    Ok(all_splits)
}

ListSplitsRequest supports filtering by split state, time range, tags, and source ID. The search root uses this to list relevant splits for a query; the merge planner uses it to identify splits that should be merged.

Backends: PostgreSQL and File #

The MetastoreService trait has two primary backends:

PostgreSQL backend: used in production. Splits, indexes, sources, and checkpoints are stored in PostgreSQL tables. PostgreSQL provides ACID transactions for the critical publish_splits atomicity (split state transition + checkpoint advance). The connection pool uses sqlx.

File backend: used for local development and testing. Splits are stored as JSON files in a directory. No ACID guarantees, but sufficient for single-node operation where concurrent writes do not race.

The backend is selected at startup via the metastore_uri configuration:

postgres://... → PostgreSQL backend
file:///... → file backend

The Janitor Service #

The Janitor is a background service that runs cleanup tasks:

Orphaned staged split cleanup: deletes splits that have been Staged for longer than a grace period.
Scheduled-for-deletion execution: for splits in ScheduledForDelete, the Janitor: a. Lists splits in ScheduledForDelete state. b. Waits for the split_deletion_grace_period (typically 2 hours) to ensure no running query is using them. c. Issues storage.delete(split_file) for each split. d. Calls metastore.delete_splits(...) to remove the metastore record.
Retention policy enforcement: applies retention policies (e.g., delete splits older than 30 days) by transitioning qualifying published splits to ScheduledForDelete.

The 2-hour grace period matches the max_scroll_ttl in the search root:

fn max_scroll_ttl() -> Duration {
    // split_deletion_grace_period - 2 minutes safety margin
    split_deletion_grace_period - Duration::from_secs(60 * 2)
}

This ensures that a scroll query started just before a split enters ScheduledForDelete will complete before the split is deleted from storage.

Merge Planner and Metastore Interaction #

The MergePlanner actor reads the published split list from the metastore and applies a merge policy (Quickwit uses a logarithmic merge policy similar to Lucene’s LogByteSizeMergePolicy) to select candidates for merging. The criteria:

Splits below a minimum size threshold are candidates.
Candidates are grouped by time range (splits covering the same time window are merged together to preserve range-query pruning effectiveness).
The merge policy produces MergeTask objects that feed the MergeExecutor.

After a merge completes, the merge pipeline’s Publisher calls publish_splits with the new merged split as a staged_split_id and the merged-away splits as replaced_split_ids — the atomic transition ensures no documents are lost or double-counted.

Summary #

The metastore substrate is the control-plane database for Quickwit: it owns the split lifecycle state machine (Staged → Published → ScheduledForDelete), the source checkpoint per (index_uid, source_id, partition), and the index/source configuration. publish_splits is the critical atomic operation that transitions split state and advances checkpoints in one transaction, preventing partial commits. PublishToken fences stale pipeline generations. list_splits is a streaming API to handle indexes with millions of splits. The Janitor enforces retention policies and cleans up orphaned splits, with a grace period matching the maximum scroll TTL to prevent queries from reading deleted splits.