Prompt Engineering and Embeddings in Rust — What's in a Name?
The previous posts got the architecture right. Provider abstraction, streaming, server functions, dual-state management — the plumbing was solid. But the responses the model was producing were generic. Users could tell they were talking to a system that had been pointed at a model and left to its own devices. The AI was technically functional and practically unsatisfying.
There are two layers to fixing this. The first is prompt engineering: learning to speak to the model precisely enough that it behaves the way you intend. The second is embeddings: giving the model a way to reason about meaning rather than matching keywords. Both are invisible to the user. Both make the difference between an AI feature people use once and one they come back to.
Act I: The Context Window Cliff #
The first hard failure was a 400 error we did not expect.
The application had been running well in development, where conversations were short. In production, a user ran a long support session — forty exchanges, detailed error messages pasted in full. The next message came back with context_length_exceeded. We had been sending the entire conversation history with each request, and at some point the total crossed the model’s token limit without us noticing.
We had no instrumentation on prompt size. We did not know how many tokens our system prompt consumed, how fast a typical conversation grew, or where the cliff was. We were flying blind until we fell off the edge.
The fix starts with measurement. LLMs do not read text — they process tokens, numerical representations of text chunks. Providers charge by the token and enforce hard limits on how many a single request can contain. Before we send anything to the model, we need to know how large it is.
In Rust, the tiktoken-rs crate gives us the same tokeniser the OpenAI models use:
use tiktoken_rs::cl100k_base;
pub fn calculate_prompt_tokens(system_prompt: &str, user_prompt: &str) -> usize {
let bpe = cl100k_base().unwrap();
let system_tokens = bpe.encode_with_special_tokens(system_prompt);
let user_tokens = bpe.encode_with_special_tokens(user_prompt);
let total = system_tokens.len() + user_tokens.len();
println!("Total tokens to be processed: {}", total);
total
}
With token counts available, we can make deliberate decisions before a request goes out: drop the oldest messages from history if the total approaches the limit, summarise earlier turns, or warn the user that the context window is nearly full. The cliff is still there — now we can see it coming.
Act II: The Generic Response Problem #
Token management kept us from crashing. It did not make the responses better.
The model was answering questions correctly in a narrow sense — factually accurate, grammatically fine — and uselessly in practice. Asked to help diagnose a crashing application, it would produce a five-paragraph essay covering every possible cause, hedged with disclaimers, ending with “please contact support.” Users wanted three actionable steps. The model had no idea.
We tried writing clearer system prompts. “Be concise.” “Respond in bullet points.” “Focus on the most likely cause.” The model would follow these instructions for simple inputs and quietly abandon them when the query got complex. Instructions without examples are aspirational. The model would interpret them loosely, and “loosely” meant differently from what we intended.
Two techniques close the gap. The first is few-shot learning: rather than describing the desired output, show the model exactly what it looks like. The second is chain-of-thought prompting: rather than asking for an answer, ask the model to reason step by step. A model that writes out its reasoning catches more of its own mistakes.
Together, they go into the system prompt as a concrete template:
use rig::providers::openai;
pub async fn specialized_support_agent(user_query: &str) -> String {
let client = openai::Client::from_env();
let system_prompt = "
You are a technical support assistant.
Task: Resolve user issues by thinking step-by-step.
Example (Few-Shot):
User: 'My app is crashing on startup.'
Assistant:
Step 1: Identify the OS and app version.
Step 2: Check for known outage reports.
Final Answer: 'I see your app is crashing. Could you provide your OS version so we can investigate?'
";
let agent = client.agent("gpt-4o")
.preamble(system_prompt)
.build();
agent.prompt(user_query).await.expect("Failed to generate response")
}
The example does more than the instruction. The model sees the format, the reasoning style, and the level of specificity expected — all at once. Responses became consistent and scoped. Users stopped getting essays and started getting steps.
Act III: The Keyword Search Dead End #
With prompts under control, the next request was a knowledge base. The team had several hundred internal FAQ documents. Users should be able to ask a question in natural language and get the most relevant document back.
The first attempt was keyword matching: extract significant words from the query, search for documents containing those words. It worked for exact questions. It failed for everything else. A user asking “my laptop won’t boot” got no results because the FAQ entry said “computer fails to start.” A question about “slow internet” missed the document titled “network performance degradation.” The words did not overlap. The meaning was identical.
Keyword search treats text as a bag of characters. What we needed was a way to search by meaning.
This is what embeddings are for. When text is converted to an embedding, it becomes a high-dimensional vector — a list of floating-point numbers — that encodes its semantic content. Texts with similar meaning end up near each other in that high-dimensional space, regardless of whether they share any words. “My laptop won’t boot” and “computer fails to start” map to nearby vectors. The geometry captures what the keywords miss.
Using rig-rs, generating an embedding is a single call:
use rig::providers::openai;
pub async fn generate_embedding(text: &str) -> Vec<f64> {
let client = openai::Client::from_env();
let embedding_model = client.embedding_model("text-embedding-3-small");
let embeddings = embedding_model.embed_text(text).await.unwrap();
embeddings.first().unwrap().vec.clone()
}
The title earns itself here. Juliet asks what’s in a name — whether a rose called by another name would still smell as sweet. The answer, for embeddings, is yes. “Spaghetti carbonara” and “fettuccine alfredo” end up near each other in the vector space not because they share words but because they share meaning: pasta, cream, Italian. The exact words are incidental. The semantic content is what the vector captures.
Act IV: Putting It Together — Semantic Search #
With embeddings available, the knowledge base becomes a geometric search problem. Pre-compute embeddings for all FAQ documents and store them. When a query arrives, embed it and find the stored vector closest to it. Closeness in vector space corresponds to similarity in meaning.
The standard measure of closeness for embeddings is cosine similarity — it measures the angle between two vectors rather than the distance, which makes it robust to differences in text length:
fn cosine_similarity(vec1: &[f64], vec2: &[f64]) -> f64 {
let dot_product: f64 = vec1.iter().zip(vec2.iter()).map(|(a, b)| a * b).sum();
let norm1: f64 = vec1.iter().map(|a| a * a).sum::<f64>().sqrt();
let norm2: f64 = vec2.iter().map(|b| b * b).sum::<f64>().sqrt();
dot_product / (norm1 * norm2)
}
pub async fn find_relevant_answer(user_query: &str) -> String {
let query_vector = generate_embedding(user_query).await;
let faqs: Vec<(&str, Vec<f64>)> = vec![
("How do I reset my password?", vec![/* stored vector */]),
("My computer won't turn on.", vec![/* stored vector */]),
];
let mut best_match = "";
let mut highest_score = -1.0_f64;
for (question, stored_vector) in faqs.iter() {
let score = cosine_similarity(&query_vector, stored_vector);
if score > highest_score {
highest_score = score;
best_match = question;
}
}
format!("Most relevant FAQ: {}", best_match)
}
The in-memory approach is sufficient for prototyping, but rig-rs ships official extension crates for production vector stores so you do not have to roll your own storage or similarity search. rig-qdrant integrates with Qdrant, rig-lancedb with LanceDB, and rig-mongodb brings vector search to MongoDB Atlas. Each extension implements the same VectorStore trait from rig-core, so the retrieval logic in your application stays identical regardless of which backend you choose — swapping from an in-memory store to Qdrant in production is a one-line change in the initialization code.
// Switching from in-memory to Qdrant is a backend swap, not a logic change
use rig_qdrant::QdrantVectorStore;
let vector_store = QdrantVectorStore::new(qdrant_client, embedding_model, collection_name);
let results = vector_store.top_n::<Document>("my laptop won't boot", 3).await?;
A user asking “my laptop won’t boot” now finds “My computer won’t turn on.” without needing a single word in common — and that search runs against a collection of any size.
What Changed #
The architecture from the previous posts handles the mechanics of an AI application: how requests travel, how state is managed, how the model is called securely. This post handles the quality layer on top of it:
- Token counting prevents silent context overflow
- Few-shot examples make the model’s output format reliable without fine-tuning
- Chain-of-thought improves reasoning on complex inputs
- Embeddings replace brittle keyword matching with search by meaning
rigvector store extensions bring production-grade storage without changing retrieval logic
None of these are visible in the UI. Users do not see token budgets or embedding vectors. They see an assistant that gives consistent, useful answers and finds the right document even when they cannot remember the exact words for what they are looking for.