Skip to main content
  1. Posts/

Debugging, Profiling, and Testing AI in Rust — The Readiness Is All

7 mins

The previous posts built the happy path: streaming, RAG, memory, autonomous agents, grounding. In development, against a stable API with short sessions, all of it worked. Then production arrived.

A streamed response was failing silently partway through. Users reported the AI “getting stuck.” Memory on the server was climbing after a week without a restart. A Monday morning traffic spike triggered a cascade of 429s from the AI provider that the error handling was not equipped for. And the CI pipeline was spending forty seconds and real money on OpenAI API calls for every test run.

None of these are AI problems. They are engineering problems that happen to involve AI. The tools to address them are the same tools used for any async Rust service — applied carefully to the specific shapes that AI workloads produce.

Act I: The Silent Failure at the Boundary #

The streaming failure was the hardest to diagnose because it left no trace. The Leptos frontend showed a spinner; the Axum logs showed a completed request; the user saw a response that stopped mid-sentence. The failure was happening somewhere between the rig-rs stream and the SSE encoder, or possibly at the network layer, and we had no instrumentation that crossed the Wasm boundary cleanly.

println! debugging in async code is nearly useless — the output ordering tells you nothing about the actual execution order across tasks. What we needed was structured, contextual logging that could trace a single request from the moment the Leptos component called the server function to the moment the last SSE event was sent.

The tracing crate gives us this. The #[instrument] macro creates a span around each function invocation, capturing arguments and propagating context through async boundaries. The failure, once we could see the span tree, was immediately visible: the SSE stream was completing successfully, but the final Event::default().data(text) was being dropped on a connection that the client had already closed — and we were not handling that case.

use tracing::{info, error, instrument};
use rig::providers::openai;

#[server(ContinueConversation, "/api")]
#[instrument(skip(history))]
pub async fn continue_conversation(
    input: String,
    history: Vec<Message>,
) -> Result<String, ServerFnError> {
    info!("Starting AI generation for input length: {}", input.len());

    let client = openai::Client::from_env();
    let agent = client.agent("gpt-4o").build();

    match agent.prompt(&input).await {
        Ok(response) => {
            info!("Generation succeeded.");
            Ok(response)
        }
        Err(e) => {
            error!(error = ?e, "AI generation failed.");
            Err(ServerFnError::ServerError("AI provider failed".into()))
        }
    }
}

With tracing-subscriber configured to output structured JSON, these spans pipe directly into Elastic Stack or Datadog. More importantly, they give you a request-scoped view of what happened and when — which is the only way to debug failures that span async tasks, network hops, and Wasm boundaries.

Act II: The Climbing Memory #

After a week in production, the server’s RSS was climbing steadily. It was not dramatic — no OOM kills — but the graph had a clear upward slope with no floor. Something was not being dropped.

Our first guess was conversation history: Vec<Message> values accumulating in the Axum state without cleanup. We added explicit cleanup after sessions expired. The slope did not change.

The actual cause was SSE connections that were not closing. When a user navigated away mid-stream, the browser closed the connection, but the async_stream on the server side was still polling the rig-rs stream and yielding events into a dropped receiver. The task was alive, holding the model’s stream open, until the model finished generating — which for long prompts could be thirty seconds after the user had left.

In the JavaScript world you would reach for a heap profiler. In async Rust, the right tool is tokio-console. Adding console_subscriber::init() to the Axum startup gives you a real-time terminal dashboard of every live async task: how long it has been running, whether it is polling or sleeping, and what it is waiting on.

The hanging tasks were immediately visible — a column of stream-polling tasks that should have exited seconds earlier. The fix was a tokio::select! that races the SSE stream against an abort signal tied to the connection lifecycle. When the client disconnects, the signal fires and the stream task exits cleanly.

tokio-console costs nothing in production: the subscriber is a no-op unless TOKIO_CONSOLE_ENABLE=1 is set. It is the first tool to reach for when async memory does not behave.

Act III: The Monday Morning 429 #

The rate limit failure was straightforward in cause and expensive in effect. A traffic spike hit the API provider’s per-minute token limit, the provider returned HTTP 429, and the error propagated as an unhandled Err that the SSE handler converted into an abrupt stream close. The user saw an empty response. The frontend showed no error. Support tickets arrived.

Rate limits and provider outages are not edge cases — they are scheduled maintenance windows, peak traffic, and billing surprises. The application needs a fallback strategy that is expressed in the error handling, not bolted on after the first incident.

Rust’s Result and match make the fallback chain readable:

use rig::providers::{openai, gemini};

pub async fn resilient_prompt(input: &str) -> String {
    let primary = openai::Client::from_env().agent("gpt-4o").build();

    match primary.prompt(input).await {
        Ok(response) => response,
        Err(primary_err) => {
            tracing::warn!(
                error = ?primary_err,
                "Primary provider failed. Attempting fallback."
            );

            let fallback = gemini::Client::from_env()
                .agent("gemini-2.5-flash")
                .build();

            match fallback.prompt(input).await {
                Ok(fallback_response) => fallback_response,
                Err(fallback_err) => {
                    tracing::error!(
                        error = ?fallback_err,
                        "Fallback provider also failed."
                    );
                    "I am experiencing high traffic at the moment. Please try again shortly."
                        .to_string()
                }
            }
        }
    }
}

The warning log captures the primary failure. The error log captures the double failure. If both providers are down the user receives a human-readable message rather than a broken stream. The tracing spans from Act I mean the failure is visible in the monitoring dashboard without a user report.

Act IV: The Forty-Second Test Suite #

The CI pipeline was making real OpenAI API calls. This was not intentional — it was the path of least resistance when writing the first integration tests. The consequences accumulated slowly: forty seconds per run, flaky failures when the CI runner hit rate limits, and a non-trivial monthly bill from the test suite alone.

The underlying problem is that LLMs are non-deterministic and expensive. You cannot write a test that asserts an exact output string, because the model will answer the same question differently on different runs. And you should not make real API calls in unit or integration tests — you are testing your code’s behaviour, not the model’s.

The solution is to mock the LLM boundary. Replace the actual provider call with a function that returns a fixed string, and test the logic that wraps it: input sanitisation, output parsing, context injection, error handling. The mock does not need to be smart — it needs to be fast and deterministic.

#[cfg(test)]
mod tests {
    use super::*;

    async fn mock_llm(_prompt: &str) -> String {
        "Regular exercise significantly improves physical health and fitness.".to_string()
    }

    fn validate_constraints(text: &str, required_words: &[&str]) -> bool {
        required_words
            .iter()
            .all(|&word| text.to_lowercase().contains(&word.to_lowercase()))
    }

    #[tokio::test]
    async fn test_constrained_output_parsing() {
        let response = mock_llm("Summarize benefits of exercise.").await;
        let valid = validate_constraints(&response, &["health", "fitness"]);
        assert!(valid, "Pipeline failed to enforce required output constraints.");
    }

    #[tokio::test]
    async fn test_rag_context_injection() {
        // Test that the RAG pipeline correctly formats the context before calling the LLM,
        // without ever touching the network
        let context = "Servers are backed up nightly at 3:00 AM UTC.";
        let query = "When do backups run?";
        let prompt = format!("Context:\n{}\n\nQuestion: {}", context, query);

        let response = mock_llm(&prompt).await;
        assert!(!response.is_empty(), "Pipeline produced no output.");
    }
}

The mock tests the pipeline’s contract: does the context get injected correctly, does the output get parsed correctly, does the fallback trigger on an error? The model’s actual reasoning is tested separately, manually, against real prompts during development — not in CI. The test suite went from forty seconds to under one second and stopped appearing in the billing dashboard.

What Resilience Looks Like #

Hamlet’s observation — “the readiness is all” — is about accepting that failure will arrive and preparing for it rather than preventing it. Rate limits will fire. Connections will drop. Memory will accumulate if tasks are not cleaned up. The model will occasionally produce output the parser cannot handle.

The infrastructure in this post does not prevent those events. It makes the system ready for them:

  • tracing with #[instrument] makes async failures visible across Wasm and server boundaries
  • tokio-console surfaces hanging tasks before they become memory problems
  • Fallback chains with Result matching ensure provider failures degrade gracefully rather than silently
  • Mocked LLM boundaries make the test suite fast, deterministic, and cheap to run

Each capability is small in isolation. Together they are the difference between an AI application that works in development and one that holds up in production.

References #