Skip to main content
  1. Search Indices: From Protocol to Production/

Query DSL as a Semantic Protocol

Query DSL as a Semantic Protocol #

The OpenSearch Query DSL is a JSON-encoded protocol for expressing search intent. It distinguishes between two fundamentally different operations: filtering (does this document match?) and scoring (how well does this document match?). Getting this distinction wrong is the single most common performance mistake in search — applying scoring where filtering was intended, or filtering where scoring was needed.

Query Context vs Filter Context #

Every clause in a search request executes in one of two contexts:

Query context — the clause both determines whether a document matches and contributes to the relevance score (_score). Expensive. Cannot be cached as a bitset because the score depends on the query.

Filter context — the clause only determines whether a document matches. No score is computed. The result is a bitset per segment: a binary mask indicating which documents pass. Bitsets are cached and reused across queries that share the same filter clause.

The practical rule: use filter context for binary conditions (date ranges, status codes, boolean flags, IDs). Use query context only where score contribution matters (the user-visible ranked result).

GET /orders/_search
{
  "query": {
    "bool": {
      "must": [
        { "match": { "description": "payment failed" } }
      ],
      "filter": [
        { "term":  { "status": "pending" } },
        { "range": { "created": { "gte": "2024-01-01" } } }
      ]
    }
  }
}

description match is in query context — it scores. status and created are in filter context — they prune without scoring, and their results are cached.

Term-Level Queries: Exact Match Without Analysis #

Term-level queries operate on the indexed value directly. They do not analyze the query string. Use them for keyword, numeric, date, and boolean fields.

// Exact term match
{ "term": { "status": { "value": "pending" } } }

// Multiple exact values (OR)
{ "terms": { "status": ["pending", "processing"] } }

// Numeric or date range
{ "range": { "amount": { "gte": 100, "lte": 500 } } }

// Field existence
{ "exists": { "field": "cancelled_at" } }

// Prefix on keyword field
{ "prefix": { "order_id": { "value": "ORD-2024" } } }

// Wildcard — expensive, avoid on high-cardinality fields
{ "wildcard": { "order_id": { "value": "ORD-2024-*" } } }

Sending a term query to a text field compares the query value to indexed tokens. If the field was lowercased during analysis but the query string is "Pending" (capitalized), the term query finds nothing — the indexed token is pending. This is why term queries on analyzed fields produce surprising results: the analysis chain is not applied to the query value.

Full-Text Queries: Analysis Before Matching #

Full-text queries analyze the query string using the same analyzer as the indexed field before executing the search. They are designed for text fields.

match — the most common full-text query. The query string is analyzed; the resulting tokens are searched with OR semantics by default.

{
  "match": {
    "description": {
      "query": "payment processing failed",
      "operator": "and"
    }
  }
}

With operator: "and", all tokens must appear. With operator: "or" (default), any token matching raises the score. The document does not need to contain all tokens.

match_phrase — tokens must appear in order with no gaps (or bounded gaps with slop).

{ "match_phrase": { "description": { "query": "payment failed", "slop": 1 } } }

slop: 1 allows one transposition — “failed payment” also matches.

multi_match — apply a match query across multiple fields.

{
  "multi_match": {
    "query": "payment failed",
    "fields": ["description", "notes^2", "error_message"],
    "type": "best_fields"
  }
}

^2 boosts the notes field’s contribution. type: best_fields scores by the single best-matching field (useful when documents may have the query in any of several fields but you want to rank by the strongest signal). type: most_fields sums scores across fields. type: cross_fields treats fields as one combined field (useful for person names split across first_name and last_name).

The Bool Query: Algebraic Composition #

The bool query is the primary composition mechanism. It has four clauses:

Bool query clause semantics and scoring

ClauseSemanticsScoringFilter cache
mustAND — document must match, contributes to scoreYesNo
shouldOR — boosts score if matchesYesNo
must_notNOT — document must not matchNoYes
filterAND — document must match, no scoringNoYes

must and filter both require the document to match — the difference is purely whether the clause contributes to _score. For binary conditions that do not affect ranking, always use filter.

should without must or filter at the top level requires at least one should clause to match (controlled by minimum_should_match, default 1 for a standalone should). When must or filter clauses are present, should becomes optional — it only adds to the score without gating inclusion.

{
  "bool": {
    "must": [
      { "match": { "description": "payment" } }
    ],
    "should": [
      { "term": { "priority": "high" } },
      { "range": { "amount": { "gte": 1000 } } }
    ],
    "must_not": [
      { "term": { "status": "cancelled" } }
    ],
    "filter": [
      { "range": { "created": { "gte": "now-30d" } } }
    ],
    "minimum_should_match": 1
  }
}

bool queries nest: any clause can itself be a bool, enabling arbitrary logical expressions.

Relevance Scoring: BM25 #

The default scoring algorithm is BM25, a probabilistic ranking function that weighs:

  • TF (Term Frequency) — how often the term appears in the document. More occurrences → higher score, but with diminishing returns (controlled by parameter k1, default 1.2).
  • IDF (Inverse Document Frequency) — how rare the term is across the index. Rarer terms → higher score.
  • Field length normalization — shorter fields score higher for the same term count (controlled by parameter b, default 0.75).

The explain parameter exposes the full scoring calculation:

GET /orders/_search
{
  "explain": true,
  "query": { "match": { "description": "payment" } }
}

This returns a tree of score contributions per clause per document — indispensable for debugging unexpected ranking.

Controlling Relevance: boost, function_score, script_score #

boost — multiplies the score contribution of a specific clause.

{
  "bool": {
    "should": [
      { "match": { "title": { "query": "payment", "boost": 3 } } },
      { "match": { "description": { "query": "payment" } } }
    ]
  }
}

A match in title contributes 3× the score of the same match in description.

function_score — wraps a query and modifies scores using one or more functions. The canonical use case is recency decay: documents become less relevant as they age.

{
  "function_score": {
    "query": { "match": { "description": "payment" } },
    "functions": [
      {
        "gauss": {
          "created": {
            "origin": "now",
            "scale": "7d",
            "decay": 0.5
          }
        }
      }
    ],
    "boost_mode": "multiply"
  }
}

The gauss decay function produces a score multiplier that falls from 1.0 at the origin to 0.5 at the scale boundary. boost_mode: multiply multiplies the BM25 score by the decay factor.

script_score — arbitrary score computation in Painless script. Flexible but expensive — scripts execute per matching document on each shard.

Nested Queries: Querying Correlated Object Arrays #

Arrays of objects in OpenSearch are flattened into the parent document by default. For a document with multiple line items:

{ "order_id": "A1", "items": [
  { "sku": "X", "qty": 2 },
  { "sku": "Y", "qty": 5 }
]}

A query for items.sku == "X" AND items.qty > 4 would match — sku: X appears in one item and qty > 4 appears in another. The correlation between sku and qty within the same item is lost in the flat inverted index.

The nested field type preserves the per-item relationship by indexing each array element as a hidden separate document:

PUT /orders/_mapping
{
  "properties": {
    "items": {
      "type": "nested",
      "properties": {
        "sku": { "type": "keyword" },
        "qty": { "type": "integer" }
      }
    }
  }
}
{
  "nested": {
    "path": "items",
    "query": {
      "bool": {
        "must": [
          { "term":  { "items.sku": "X" } },
          { "range": { "items.qty": { "gt": 4 } } }
        ]
      }
    }
  }
}

The nested query executes against each hidden nested document independently. Only items where both conditions hold on the same item contribute to a match. The cost: nested documents are stored separately in Lucene, increasing index size and query overhead.

Constant Score and Boosting Queries #

constant_score — wraps a filter and assigns a fixed score to all matching documents. Useful for ranking multiple conditions at discrete score levels without BM25 interference.

{
  "bool": {
    "should": [
      {
        "constant_score": {
          "filter": { "term": { "tier": "premium" } },
          "boost": 10
        }
      },
      {
        "constant_score": {
          "filter": { "term": { "tier": "standard" } },
          "boost": 5
        }
      },
      { "match": { "description": "payment" } }
    ]
  }
}

Premium-tier documents score 10 points from the tier clause, standard-tier score 5, and all documents get additional BM25 score from the description match.

boosting query — promotes documents matching a positive query while demoting documents also matching a negative query, using a negative_boost multiplier.

{
  "boosting": {
    "positive": { "match": { "description": "payment" } },
    "negative": { "term": { "status": "test" } },
    "negative_boost": 0.1
  }
}

Documents flagged as test still appear in results but with 10% of their normal score — they are pushed toward the bottom rather than excluded.