The Write Path: From Acknowledged to Searchable

Table of Contents

The Write Path: From Acknowledged to Searchable #

A write to a search index crosses two distinct boundaries before it fulfills its full contract. The first boundary is durability — the write is safe from a process crash. The second boundary is visibility — the write is searchable. These two boundaries are separated by design, and the gap between them is the core operational tradeoff of distributed search.

The Translog: Durability Before Searchability #

Every write to an OpenSearch shard first goes to the transaction log (translog) — an append-only durability journal. The translog is flushed to disk with fsync before the write is acknowledged to the client. This is the moment the write becomes durable.

The write is not yet searchable. It exists only in the translog and in an in-memory Lucene buffer using appendable data structures (hash maps). These structures are optimized for fast insertion, not for search. A query arriving at this moment will not find the document.

The sequence is:

Write arrives at the primary shard.
Written to the translog; translog is fsynced to disk.
Client receives acknowledgment — write is durable.
Write is added to the in-memory Lucene buffer.
[Refresh boundary] — Lucene converts the in-memory buffer into sorted, searchable segment files; a new index reader is opened; document becomes searchable.
[Flush boundary] — segment files are fsynced to disk; translog entries covering those segments are purged.

The gap between steps 3 and 5 is the NRT (Near Real-Time) window — controlled by refresh_interval, default 1 second. A write acknowledged by OpenSearch may be invisible to search for up to one refresh_interval.

Refresh: The Visibility Boundary #

A refresh converts the in-memory Lucene write buffer into a set of immutable segment files on disk and opens a new index reader over them. This is sometimes called a “soft commit” — the data is on disk, but not yet guaranteed durable because no fsync has occurred. The segment files could be lost in a crash. The translog still covers them.

POST /orders/_refresh

The default refresh_interval: 1s means every second, all buffered writes become visible. Reducing it toward 100ms increases write visibility latency at the cost of more frequent small segments (which increases merge pressure). Setting it to -1 disables automatic refresh entirely — useful for bulk indexing where you want maximum throughput and will trigger a manual refresh at the end.

A critical failure scenario:

Client writes document. Write is acknowledged (translog durable).
refresh_interval has not elapsed — document is not yet in a segment.
Node crashes before refresh.
Node restarts — it replays the translog, recovers the write, eventually refreshes it into a segment.
Document reappears in search. No data loss, but a visibility gap occurred.

This is the contract: acknowledged means durable, not immediately searchable.

Flush: The Durability Checkpoint #

A flush performs two operations: it fsyncs all current Lucene segment files to disk, then truncates the translog for all operations now represented in those durable segments. After a flush, the segments themselves are the source of truth — the translog no longer needs to cover them.

Flushes happen automatically when the translog grows too large. The translog size is bounded to prevent unbounded recovery time on restart — the larger the translog, the longer it takes to replay on crash recovery. OpenSearch triggers a flush when the translog reaches its configured size limit (default 512MB) or age limit (default 30 minutes).

Translog  → covers all un-flushed writes
Flush     → segments fsynced → translog truncated
After flush: recovery reads segments, not translog

Primary-Replica Replication: The Write Fan-Out #

Once the primary shard acknowledges a write to its translog, it fans the operation out to all replica shards in parallel. This is document replication — the primary forwards the full write operation to each replica, which independently applies it to its own translog and Lucene buffer.

The sequence for a replicated write:

Write arrives at coordinating node, forwarded to primary shard.
Primary validates the operation (mapping compatibility, routing).
Primary writes to its own translog and in-memory buffer.
Primary forwards the operation to all replicas in parallel.
Each replica writes to its own translog.
Primary waits for acknowledgment from all in-sync replicas (ISR).
Primary acknowledges to the coordinating node; client receives response.

Replicas that fall too far behind are removed from the ISR. A write is only as durable as the replicas in the ISR — if the ISR shrinks to just the primary, the write is acknowledged with single-copy durability.

Sequence Numbers and Checkpoints: Tracking Replication Position #

OpenSearch tracks replication progress with two per-shard counters:

Local checkpoint — the highest sequence number for which all operations up to that point have been acknowledged on this shard. Operations may arrive out of order; the local checkpoint only advances when there are no gaps.
Global checkpoint — the highest sequence number for which all operations up to that point have been acknowledged on all active replicas. The global checkpoint advances when all in-sync replicas report their local checkpoints to the primary.

Primary:  seq 1, 2, 3, 4, 5  → local checkpoint = 5
Replica1: seq 1, 2, 3         → local checkpoint = 3
Replica2: seq 1, 2, 3, 4      → local checkpoint = 4

Global checkpoint = min(5, 3, 4) = 3

The global checkpoint is the safe truncation point for the translog across all copies. Operations below the global checkpoint are no longer needed for replica recovery — every replica already has them. Operations above the global checkpoint must be retained for replica catch-up.

This is the mechanism that enables fast replica recovery: when a replica rejoins after a brief outage, it does not need to copy all segments from the primary. It requests only the operations above its last known local checkpoint, replaying from the translog rather than re-copying the full index.

Segment Replication: An Alternative Model #

OpenSearch 2.x introduced segment replication as an alternative to document replication. Instead of each replica independently applying write operations, the primary builds its Lucene segments and copies the finished segment files directly to replicas.

	Document replication	Segment replication
CPU on replicas	High — each replica re-indexes	Low — replicas copy pre-built files
Write throughput	Lower	Higher (primary does all indexing work)
Replica visibility lag	Same as primary	Slightly higher — wait for segment copy
Replica recovery	Replay translog from checkpoint	Copy segments from primary or remote store

Segment replication shifts CPU cost from replicas to the primary and increases the primary’s network I/O. It is most beneficial for read-heavy workloads with many replicas where replica indexing CPU was the bottleneck.

The NRT Visibility Gap: What `refresh=wait_for` Solves #

By default, a write acknowledged by OpenSearch is not immediately searchable. For most workloads this is acceptable — a 1-second visibility lag is invisible to users. For use cases where a write must be immediately queryable (write-then-read workflows), OpenSearch provides the refresh parameter on index requests:

PUT /orders/_doc/1?refresh=wait_for
{
  "order_id": "A123",
  "status": "pending"
}

`refresh` value	Behavior	Use case
`false` (default)	Acknowledged without refresh. Visible within next `refresh_interval`.	All normal writes
`true`	Forces immediate refresh on all receiving shards. Visible instantly.	Low-throughput, correctness-critical writes
`wait_for`	Blocks until next scheduled refresh includes this write.	Write-then-read workflows at scale

refresh=wait_for is the correct choice for write-then-read workflows at scale. refresh=true is a correctness hammer that creates a new segment on every write — use it indiscriminately and write throughput collapses.