RAG, Memory, and Autonomous Agents in Rust — All the World's a Stage

The previous post introduced embeddings and semantic search. We had a model that could understand meaning and a vector store to search it. The obvious next question was: what do we build with it?

The answer, in enterprise AI, is usually the same set of capabilities: chain operations together, retrieve documents on demand, remember what the user said three turns ago, and let the model decide on its own when to call which tool. In the JavaScript ecosystem these are LangChain problems. In Rust they turned out to be smaller problems than expected — and the solution was to stop looking for a framework and start writing code.

Act I: The Framework Search #

The first instinct when we needed to chain AI operations was to look for a Rust equivalent of LangChain. We found half-finished ports, abandoned crates, and a lot of README files that promised LCEL-style pipe operators and delivered nothing that compiled against current rig-rs.

We spent two days on this before stepping back. LangChain exists because JavaScript has no native way to express typed sequential async pipelines cleanly. Rust does. The async/await model with strict return types is exactly the primitive you need to chain AI operations — you just write the steps in order.

A “chain” in rig-rs is an extractor feeding into an agent via a native Rust struct:

use rig::providers::openai;
use serde::{Deserialize, Serialize};

#[derive(Serialize, Deserialize, Debug)]
struct TranslationTask {
    target_language: String,
    text: String,
}

pub async fn run_translation_chain(user_input: &str) {
    let client = openai::Client::from_env();

    // Step 1: Parse intent into a typed struct
    let extractor = client.extractor::<TranslationTask>("gpt-4o").build();
    let task = extractor.extract(user_input).await.expect("Failed to parse task");

    // Step 2: Act on the parsed struct
    let agent = client.agent("gpt-4o")
        .preamble(&format!("Translate the following text into {}", task.target_language))
        .build();

    let result = agent.prompt(&task.text).await.unwrap();
    println!("Chain Output: {}", result);
}

The extractor’s output type is the next step’s input. The compiler enforces the contract. No DSL, no framework — just sequential await calls that Rust already knows how to check.

Act II: The Context Window and the 200-Page PDF #

The next problem arrived as a support ticket: a user had uploaded a large technical document and asked the AI to answer questions about it. We tried the naive approach — put the whole document in the context — and got context_length_exceeded before the model ever saw the question.

We fell back to the cosine similarity approach from the previous post: chunk the document manually, embed each chunk, search at query time. It worked, but it was brittle. Every document upload required custom chunking logic. The similarity search was a hand-rolled loop over a Vec. Adding a second document type meant touching the ingestion code in three places.

We were building RAG infrastructure when rig-rs already had it.

rig-rs provides a VectorStore trait and an EmbeddingsBuilder that handle ingestion, and a context_rag_agent builder that wires retrieval into the agent automatically. The pipeline that took three hundred lines to hand-roll collapses to this:

use rig::providers::openai;
use rig::vector_store::{InMemoryVectorStore, VectorStore};
use rig::embeddings::EmbeddingsBuilder;

pub async fn build_rag_pipeline() {
    let client = openai::Client::from_env();
    let embedding_model = client.embedding_model("text-embedding-3-small");

    // Ingest document chunks and generate embeddings
    let embeddings = EmbeddingsBuilder::new(embedding_model.clone())
        .document("The deployment process requires a staging sign-off.")
        .unwrap()
        .document("Production rollbacks are initiated via the ops runbook.")
        .unwrap()
        .build()
        .await
        .unwrap();

    // Store embeddings — swap InMemoryVectorStore for rig-qdrant or rig-lancedb in production
    let mut vector_store = InMemoryVectorStore::default();
    vector_store.add_documents(embeddings).await.unwrap();

    // The agent retrieves the top 2 relevant chunks automatically on every query
    let rag_agent = client.context_rag_agent("gpt-4o")
        .preamble("You are a knowledgeable assistant. Answer based on the provided context.")
        .dynamic_context(2, vector_store.index(embedding_model))
        .build();

    let answer = rag_agent
        .prompt("How do I roll back a production deployment?")
        .await
        .unwrap();

    println!("{}", answer);
}

The dynamic_context call is doing the work we were doing by hand: on every prompt, it embeds the query, retrieves the top N matching document chunks, and injects them into the context before the model sees the question. Swapping the in-memory store for Qdrant or LanceDB via the rig-qdrant or rig-lancedb extension crates is a one-line change in initialization — the rest of the pipeline is unchanged.

Act III: The Forgetful Model #

Two turns into any conversation, the model forgot what the user had said in turn one. We had built streaming, state management, a dual-state architecture — and the AI still answered each message as if it were the first.

This is the nature of LLMs: they are stateless. Each API call is independent. If the model needs to remember the conversation, we have to send the conversation history with every request.

The first attempt was to store history in the frontend and send it back on each turn. This worked until conversations grew long, and suddenly we were back at the context window problem — this time from the message history rather than document content.

The cleaner model is to keep history in the Axum backend state as a Vec<rig::completion::Message>, apply the token budget logic from the previous post to prune old messages if needed, and pass what remains directly to rig-rs on each request:

use rig::completion::Message;
use rig::providers::openai;

pub async fn chat_with_memory(history: &mut Vec<Message>, user_input: &str) {
    let client = openai::Client::from_env();
    let agent = client.agent("gpt-4o").build();

    // Append the new user message
    history.push(Message::user(user_input));

    // Pass the full history to the agent
    let response_text = agent.chat("gpt-4o", history.clone()).await.unwrap();

    // Record the assistant's reply
    history.push(Message::assistant(&response_text));

    println!("{}", response_text);
}

No memory module, no buffer abstraction. A Vec is the memory. The server owns it; the client never sees the full history. This is the same dual-state separation from the architecture posts — AI state on the server, UI state on the client — applied to conversation history.

Act IV: The Agent That Stopped at One Tool #

The final capability was the hardest to get right. We wanted the model to answer a question like “What is our current inventory level for product X, and how does that compare to last month?” — a question that requires a database lookup, a second database lookup, and arithmetic. The model had access to all three tools. It would pick one, return an answer based only on that one, and stop.

Getting an agent to chain tool calls — use one tool, observe the result, decide whether to call another, eventually synthesise a final answer — required us to write our own loop. We tracked intermediate results in a Vec, fed them back to the model as tool outputs, and iterated until the model stopped requesting tools. It worked but it was fragile, and it duplicated logic that any serious agentic framework handles internally.

rig-rs handles the loop. Register the tools, prompt the agent, and the framework manages the observe-act cycle:

use rig::providers::openai;
use rig_derive::tool;

#[tool(description = "Search the web for real-time information")]
async fn web_search(query: String) -> String {
    // In production, calls an actual search API
    format!("The current stock price of ACME Corp is $150.")
}

#[tool(description = "Evaluate a mathematical expression")]
fn calculator(expression: String) -> String {
    // In production, parses and evaluates the expression
    format!("75")
}

pub async fn run_autonomous_agent(user_query: &str) {
    let client = openai::Client::from_env();

    let agent = client.agent("gpt-4o")
        .preamble("You are an autonomous agent. Use your tools to find the answer.")
        .tool(web_search)
        .tool(calculator)
        .build();

    // "What is ACME Corp's stock price divided by 2?"
    // The agent calls web_search, observes the result,
    // calls calculator, observes the result, returns the final answer.
    let final_answer = agent.prompt(user_query).await.unwrap();

    println!("{}", final_answer);
}

The #[tool] macro generates the schema the model uses to decide when to call each function. The framework runs the loop: detect tool call request, execute the function, feed the result back, continue generation until the model produces a final answer with no further tool requests. The application sees one prompt call and one response. The multi-step reasoning is contained entirely inside the agent.

This is where the title finds its meaning. Jacques in As You Like It describes the world as a stage on which every person plays many parts across the acts of a life. An autonomous agent plays many parts too — it searches, calculates, retrieves, remembers — all within a single response. The user sees only the final answer. The performance happens behind the curtain.

What This Unlocks #

Combined with the infrastructure from earlier posts, the application now supports:

Chaining: sequential typed operations using native async/await, no framework DSL required
RAG: document ingestion, embedding, and context-aware retrieval via rig-rs built-ins, backed by production vector stores through extension crates
Memory: conversation history as a server-side Vec<Message>, pruned by token budget, never exposed to the client
Autonomous agents: multi-tool, multi-step reasoning managed by the rig-rs execution loop

The Rust ecosystem does not need a LangChain. It needs traits, async functions, and a crate that knows how to talk to the models. The rest is just code.

Act I: The Framework Search #

Act II: The Context Window and the 200-Page PDF #

Act III: The Forgetful Model #

Act IV: The Agent That Stopped at One Tool #

What This Unlocks #

References #