Adding Eyes to the AI — O Brave New World
In the previous post, we built out provider abstraction and streaming for our Rust-based AI application. The model could respond to text and stream its output token by token. It was working well. Then a user complaint arrived that was hard to ignore.
“I pasted a screenshot of the error and the AI ignored it.”
They had dragged an image into the chat input, expected the model to look at it, and received a response that made clear nothing had been seen. We had to tell them the feature did not support images. Meanwhile, the models we were already calling had supported images for months. The gap was not in the model. It was in our pipeline — we were silently throwing away everything that was not text.
The Struggle #
The first attempt was the obvious one: read the uploaded file in the frontend, attach it as a raw binary blob to the request, handle it on the Axum side. This ran into two problems immediately.
Multipart form data from a Wasm frontend is more involved than it looks. The browser’s File API is asynchronous and callback-based, which does not map naturally onto Rust’s async model without careful bridging via web-sys and wasm-bindgen-futures. We got something working, but the Axum handler had to deal with a streaming multipart body, extract the image part, buffer it, and then construct the LLM request — all before the model saw a single byte.
The second problem was size. Raw image uploads let users send anything. A 12MB HEIC photo from an iPhone crashed the handler the first time it arrived. We needed a contract between the frontend and backend about what would be transmitted — not raw bytes, but a structured payload both sides agreed on.
The solution that stuck was simpler: encode the image to Base64 in the browser before it leaves the Wasm layer, and send it as a plain JSON string alongside the text prompt. The backend never touches multipart encoding. The LLM receives a structured compound message. The contract is a Rust struct with serde.
The Implementation #
On the Leptos frontend, we use web-sys to access the browser’s FileReader API and read the uploaded file into a Vec<u8>, then Base64-encode it before sending:
use web_sys::{FileReader, HtmlInputElement};
use wasm_bindgen::JsCast;
fn read_image_as_base64(file: web_sys::File) -> String {
// FileReader reads the file as a data URL (Base64-encoded)
let reader = FileReader::new().unwrap();
reader.read_as_data_url(&file).unwrap();
// In practice this is wired to an onload callback via wasm-bindgen-futures
reader.result().unwrap().as_string().unwrap()
}
The encoded string and the text prompt travel together in a single JSON payload:
use serde::{Deserialize, Serialize};
#[derive(Serialize, Deserialize)]
pub struct MultimodalRequest {
pub text_prompt: String,
pub image_base64: Option<String>,
}
On the Axum backend, the handler reconstructs the compound message and forwards both modalities to the LLM:
async fn process_multimodal_request(
Json(payload): Json<MultimodalRequest>,
) {
if let Some(image_data) = payload.image_base64 {
// Construct a compound message with both text and image,
// then pass it to the rig-rs agent.
println!(
"Processing image alongside prompt: {}",
payload.text_prompt
);
}
}
A few constraints are worth stating plainly. The model struggles with low-contrast images and poor lighting — the visual equivalent of mumbled speech. Very large images exceed payload limits or make each request expensive to run. We enforce a 4MB cap in the Leptos layer before encoding, return a visible error to the user rather than silently dropping the image, and strip the data:image/...;base64, prefix before forwarding to the model.
The Result #
Once this was in place, the complaint inverted. Users started sending us examples of things they could do that surprised them — screenshots of error logs, photos of hardware they wanted identified, wireframes they wanted reviewed for accessibility. The title earns itself here: Miranda, isolated on Prospero’s island and having never seen the wider world, calls it a brave new world when it finally comes into view. Our application, having processed only text for its entire existence, now opens its eyes.
The implementation is straightforward once the encoding contract is clear. The interesting part is what becomes possible: the text channel and the visual channel collapse into a single conversation, and the model reasons over both as naturally as a person reading a message that comes with an attachment.