Document Summarization, RAG, and Grounding in Rust — No Legacy Is So Rich as Honesty

The previous post built autonomous agents that could chain tool calls and retrieve context from a vector store. That infrastructure holds for well-scoped queries against a curated document set. Then the enterprise use cases arrived: a hundred-page policy PDF, a corporate wiki with thousands of articles, and a user who trusted a confidently-stated AI answer that turned out to be entirely fabricated.

Each of these is a different failure mode. This post addresses all three.

Act I: The 100-Page PDF #

The first attempt was the obvious one. A user uploaded a dense technical specification — about a hundred pages — and asked for a summary. We extracted the text and sent it to the model in a single prompt. The API returned context_length_exceeded before the model saw a word of it.

The second attempt was sequential summarization: split the document into chunks, summarise each chunk in order, combine. This worked. It also took four and a half minutes for a hundred pages, at which point the HTTP connection had timed out and the user had refreshed the page.

The chunks are independent. Each one can be summarised without knowing what the others say. This is a parallel workload being processed sequentially — the simplest kind of performance problem to fix.

The MapReduce approach maps that structure directly. The “Map” step summarises each chunk concurrently. The “Reduce” step combines the intermediate summaries into a final overview. Because each chunk goes to the model independently, all of them can be in flight simultaneously:

use rig::providers::openai;
use futures::future::join_all;

pub async fn map_reduce_summarize(document_chunks: Vec<&str>) -> String {
    let client = openai::Client::from_env();
    let model = client.agent("gpt-4o").build();

    // MAP: summarise each chunk concurrently
    let map_futures = document_chunks.into_iter().map(|chunk| {
        let agent = model.clone();
        async move {
            agent
                .prompt(&format!("Summarize this section concisely: {}", chunk))
                .await
                .unwrap()
        }
    });

    let chunk_summaries = join_all(map_futures).await;

    // REDUCE: combine into a final cohesive summary
    let combined = chunk_summaries.join("\n\n");
    model
        .prompt(&format!(
            "Create a final, cohesive summary from these section notes:\n\n{}",
            combined
        ))
        .await
        .unwrap()
}

join_all from the futures crate runs all the map tasks concurrently under tokio. A hundred-page document that took four and a half minutes sequentially finishes in roughly the time it takes to summarise the longest single chunk. The reduce step is one additional call. No framework, no worker pool configuration — just async tasks and a combinator.

Act II: From Summaries to Questions #

Summarisation solved the throughput problem. It did not solve the question-answering problem.

Users did not just want summaries. They wanted to ask specific questions: “What does section 4.2 say about data retention?” “Is there a policy on remote access for contractors?” Summarisation collapses the document — it answers “what is this about” but discards the detail needed to answer “what exactly does it say about X.”

The previous post introduced RAG for a small document set. The corporate wiki case is the same problem at a different scale: thousands of articles, updated continuously, queries that could touch any of them. The in-memory vector store from before is not viable. The retrieval architecture needs to be robust.

rig-rs handles the full pipeline — ingestion, embedding, storage, retrieval, augmented generation — without an orchestration framework:

use rig::providers::openai;
use rig::vector_store::{InMemoryVectorStore, VectorStore};
use rig::embeddings::EmbeddingsBuilder;

pub async fn chat_with_knowledge_base(user_query: &str) -> String {
    let client = openai::Client::from_env();
    let embedding_model = client.embedding_model("text-embedding-3-small");

    // Index: convert document chunks into semantic embeddings
    let embeddings = EmbeddingsBuilder::new(embedding_model.clone())
        .document("Servers are backed up nightly at 3:00 AM UTC.")
        .unwrap()
        .document("Remote access for contractors requires a signed NDA and IT approval.")
        .unwrap()
        .build()
        .await
        .unwrap();

    // Store: in production, swap for rig-qdrant or rig-lancedb
    let mut vector_store = InMemoryVectorStore::default();
    vector_store.add_documents(embeddings).await.unwrap();

    // Retrieve and generate: the agent fetches the top 2 relevant chunks
    // and injects them into the prompt context automatically
    let rag_agent = client.context_rag_agent("gpt-4o")
        .preamble("You are a helpful IT assistant. Answer questions based on the context provided.")
        .dynamic_context(2, vector_store.index(embedding_model))
        .build();

    rag_agent.prompt(user_query).await.unwrap()
}

The dynamic_context call handles retrieval and augmentation. On each prompt, it embeds the query, fetches the top N matching document chunks from the vector store, and injects them into the context before the model generates a response. For production scale, replacing InMemoryVectorStore with the rig-qdrant or rig-lancedb extension is a single line in the initialisation — the rest of the pipeline is unchanged.

Act III: The Answer That Should Not Have Been #

The RAG system was working. Queries returned relevant answers. Then a user filed a support ticket: they had asked the system about a security exception policy, received a detailed and confidently-worded answer, acted on it, and discovered the policy described did not exist. The knowledge base contained nothing on that topic. The model had filled the gap with a plausible-sounding fabrication.

This is hallucination, and retrieval-augmented systems are not immune to it. When the retrieved chunks do not contain the answer, the model defaults to its training — and its training is full of text that sounds authoritative regardless of whether it is accurate for your specific domain.

The fix is grounding: explicitly constraining the model to answer only from the provided context, and requiring it to acknowledge when that context is insufficient. Rather than hoping the model will self-limit, we make the instruction concrete and the fallback response explicit:

use rig::providers::openai;

pub async fn strictly_grounded_agent(context: &str, user_query: &str) -> String {
    let client = openai::Client::from_env();

    let system_prompt = format!(
        "You are a meticulous librarian. Answer the user's question ONLY using the provided context.

Context:
{}

CRITICAL INSTRUCTION:
If the context does not contain sufficient information to answer the question,
you must reply EXACTLY with:
'I do not have sufficient context to answer this question. Would you like to rephrase or provide more detail?'

Do not speculate. Do not use knowledge outside the provided context.",
        context
    );

    let grounded_agent = client.agent("gpt-4o")
        .preamble(&system_prompt)
        .build();

    grounded_agent.prompt(user_query).await.unwrap()
}

The system prompt does two things. It establishes the constraint — answer only from context — and it specifies the exact fallback response the model must produce when the constraint cannot be satisfied. The user receives a clear admission of uncertainty rather than a confident wrong answer. “I do not have sufficient context” is useful. A fabricated policy is dangerous.

The title of this post comes from All’s Well That Ends Well: “No legacy is so rich as honesty.” A knowledge base is a legacy — it encodes what an organisation knows and believes. The grounding mechanism is what makes that legacy trustworthy. A system that admits the limits of its knowledge is more valuable than one that fills every gap with a confident guess.

What the Three Pieces Do Together #

These capabilities address different points in the same pipeline:

MapReduce summarisation handles documents too large for a single context window, using tokio concurrency to process chunks in parallel rather than sequentially
RAG with rig-rs enables question-answering against a live knowledge base, with rig-qdrant and rig-lancedb providing production-grade storage behind the same interface
Grounding prevents the model from substituting fabrication for missing context, surfacing uncertainty rather than hiding it

Rust’s async model makes the parallel summarisation straightforward. The rig-rs abstractions make the RAG pipeline composable without an orchestration framework. The grounding is a prompt constraint — but writing it explicitly, rather than assuming the model will self-regulate, is the difference between a system that is reliable and one that is merely optimistic.

Act I: The 100-Page PDF #

Act II: From Summaries to Questions #

Act III: The Answer That Should Not Have Been #

What the Three Pieces Do Together #

References #