The Index as a Distributed Object
Table of Contents
The Index as a Distributed Object #
A search index is not a file. It is not a table. It is a distributed agreement — a set of guarantees about where documents live, how many copies exist, and what happens when a node disappears. Understanding OpenSearch requires internalizing this distinction first, because every other protocol in the system derives from it.
The Document as the Unit of State #
The fundamental unit in OpenSearch is the document — a JSON object with a set of fields. Documents are not rows in a mutable table. They are immutable once written to a segment. An “update” is actually a new version written to a new segment; the old version is marked deleted and eventually reclaimed during segment merging. This immutability is not an implementation detail — it is what makes the read path safe without locks.
{
"name": "John Doe",
"gpa": 3.89,
"grad_year": 2022
}
The Index as a Policy Envelope #
An index is not merely a collection of documents. It is a policy envelope — a set of guarantees that apply uniformly to every document within it. When you create an index, you are making three commitments:
number_of_shards— how many parallel units of storage the index is split into. Fixed at creation time. Cannot be changed without reindexing.number_of_replicas— how many copies of each shard exist. Changeable at runtime. Controls both durability and read throughput.refresh_interval— how frequently writes become visible to search. Controls the latency/throughput tradeoff on the write path.
PUT /orders
{
"settings": {
"number_of_shards": 3,
"number_of_replicas": 1,
"index.refresh_interval": "1s"
}
}
These three parameters are not tuning knobs — they are protocol commitments. number_of_shards determines the routing function. number_of_replicas determines the write acknowledgment quorum. refresh_interval determines when writes cross the visibility boundary. Changing them mid-flight changes the protocol, which is why number_of_shards is immutable.
Shards: The Unit of Distribution, Ordering, and Failure #
A shard is to OpenSearch what a partition is to Kafka: the unit at which ordering is preserved, work is distributed, and failure is isolated. Each shard is a complete Lucene index — a running process consuming CPU and memory. This is a critical implication: shard count is not free. A 400GB index split into 1,000 shards creates 1,000 Lucene instances. The practical constraint is shard size between 10–50GB, with the cluster manager enforcing a maximum of 1,000 shards per node by default.
Routing: how a document finds its shard. Every document is assigned to a shard deterministically:
shard = hash(document_id) % number_of_shards
This function is what makes number_of_shards immutable. If you change the denominator, every document routes to a different shard — the entire index must be rebuilt. Custom routing keys allow you to override this, co-locating related documents on the same shard. This matters when queries consistently filter by a specific field: if all documents for customer_id=42 land on shard 2, a query filtered by that customer can be directed to shard 2 alone rather than fanning out to all shards.
Primary and Replica: Not Just Backup #
The distinction between primary and replica shards is commonly misread as “primary writes, replica is a backup.” The reality is more precise:
- Primary shard — the authority for writes. All indexing operations go to the primary first. The primary is responsible for validating, applying, and forwarding writes to replicas.
- Replica shard — a synchronized copy that can serve reads. Replicas are not read-only caches; they receive every write the primary receives and maintain an identical state.
The implication: number_of_replicas directly controls read throughput. A search-heavy workload with number_of_replicas: 2 has three copies of every shard available to serve queries — three times the read parallelism of a zero-replica index. Adding replicas is the search equivalent of read replicas in a database, except the replication is synchronous on the write path.
OpenSearch guarantees that primary and replica shards of the same index are never placed on the same node. A two-node cluster with one replica per shard survives the loss of either node with no data loss. A single-node cluster with replicas configured will have all replicas remain UNASSIGNED — the cluster manager will not violate placement constraints even if it means replicas go unallocated.
The Cluster Manager: Coordinator, Not Bottleneck #
The cluster manager node holds the cluster state — the authoritative map of which shards live on which nodes, which nodes are alive, and what the current index mappings are. It does not participate in indexing or search. All data-plane operations (writes, queries) bypass it entirely.
The cluster manager’s role is:
- Shard allocation — deciding where to place shards when indexes are created or nodes join/leave
- Metadata management — propagating mapping changes and index settings to all nodes
- Failure response — detecting node loss and promoting replica shards to primary when primaries go missing
Any node in the cluster can receive a client request. If it does not hold the relevant shard, it acts as a coordinating node — forwarding the request to the shard’s node, collecting the response, and returning it to the client. The cluster manager is not in this path.
wait_for_active_shards: The Durability Ladder #
When you create an index, OpenSearch does not return until the required number of shard copies are active:
PUT /orders?wait_for_active_shards=all
wait_for_active_shards=1 (the default) means only the primary must be active — replicas may still be initializing. A write acknowledged with this setting is durable only to the degree the primary is durable. wait_for_active_shards=all means every configured copy must be active before the operation proceeds — stronger durability guarantee, higher latency cost during node recovery.
This is the search equivalent of Kafka’s acks ladder: the same tradeoff between write latency and durability guarantee, expressed as the number of copies that must acknowledge before the client receives a response.
wait_for_active_shards | Durability guarantee | Cost |
|---|---|---|
1 (default) | Primary only — replica loss is undetected | Lowest latency |
2 | Primary + 1 replica | Survives single replica failure |
all | All configured copies | Highest durability, blocks on slow replicas |