All posts
March 19, 2026·AI·11 min read

Building LLM agents that don't hallucinate (much)

Tool use, retrieval, evals, guardrails — the unglamorous engineering that turns a demo agent into something you can put in front of a paying customer.

LLMAgentsRAGEvalsAnthropicOpenAI

The gap between an agent that demos well and one you'd ship to production is enormous. It's not the model — Claude and GPT and Gemini are all good enough now. It's the engineering around the model.

Retrieval first If your agent needs facts, give it facts. Don't trust pretraining. A small embedding index over your own docs beats every prompt-engineering trick.

Tools, not text Anywhere the model needs to look up a value, take an action, or hit an API — give it a tool. Free-text "the user's email is X" is a hallucination waiting to happen. `get_user_email()` returning a real value is not.

Evals before prompts Write 50 test cases before you write the prompt. Score them automatically. When you tweak the prompt, re-run the suite. Without evals you're vibes-coding a system your customers depend on.

Guardrails are part of the product Output validation. Refusal handling. Cost caps. Token limits. PII redaction on the way in. These aren't nice-to-haves — they're the difference between a feature and a lawsuit.

Keep reading

Got a project to bake?

Web, mobile, desktop, or AI — Bug Bakery has shipped over 350 projects for 1,000+ clients. Tell us what you need.

Start a project