A software team would not merge a payment change without tests. The same standard has to apply when an AI agent can read customer records, draft replies, update a CRM, reconcile an invoice, or trigger a handoff. Agent evaluations are the QA suite for work that now includes judgment.
The early phase of AI adoption rewarded demos. A prompt looked smart, a workflow completed once, and the team moved on to the next use case. Production punishes that habit. The agent sees messier inputs, stale records, partial outages, contradictory instructions, and edge cases nobody remembered to put in the demo script.
That is why evals should be treated as engineering infrastructure, not a research notebook. They decide whether the agent is still fit for the job after a model upgrade, a prompt edit, a new tool permission, or a change in the business rule.
Test the work, not the personality
Many AI tests start in the wrong place. They ask whether the response sounds good. Useful agent evals ask whether the work was done correctly, with the right evidence, inside the approved boundary.
A customer-support agent should be scored on whether it classified the request correctly, cited the right account record, escalated high-risk cases, avoided unsupported promises, and left the right note behind. A warehouse reconciliation agent should be scored on matching purchase orders, bills of lading, receiving exceptions, and vendor invoices. A marketing operations agent should be scored on intake completeness, approval routing, launch readiness, and reporting accuracy.
The question is not "did the model answer nicely?" The question is "would we let this run again tomorrow without creating cleanup work?"
Start with golden tasks
A golden task is a known piece of work with a known-good outcome. It might be an inbound email, a support ticket, an invoice exception, or a CRM update with private details removed. The important part is that the business can say what good looks like before the agent runs.
Golden tasks should cover normal work and the awkward cases. Include the clean intake form, but also the customer who sent three emails, the order with a missing field, the ambiguous refund request, the duplicate record, and the case that should stop for human review. A small set of sharp examples beats a huge set of vague prompts.
Each task needs expected behavior. Which fields should be extracted? Which source should be cited? Which tool calls are allowed? What should be blocked? What should be escalated? The answer can be deterministic, rubric-scored, or human-reviewed, but it cannot be taste.
Replay traces catch regressions
The agent run itself is evidence. Inputs, retrieved documents, tool calls, policy decisions, intermediate notes, final output, approval status, and downstream changes should be captured in a trace. Without that record, a bad run becomes an argument about what probably happened.
Replay makes the trace useful. When the prompt changes, replay yesterday's tasks. When the model changes, replay the tasks that used to pass. When a connector gets a new field, replay the cases that depend on that system. The agent does not need to be frozen in place, but every change should answer one simple question: did we make the work better or worse?
This is where agentic systems start to look like mature software. You can test the new version against the old version. You can see which cases improved, which cases regressed, and which failures are worth accepting because the business tradeoff changed.
Evals belong beside workflow code
If the agent workflow lives in one place and the evals live in a separate spreadsheet, the evals will drift. They will be updated late, skipped during urgent changes, and forgotten when the person who made them moves on.
The better pattern is boring on purpose. Keep eval fixtures, expected outputs, policy checks, and replay commands near the workflow code. Run them in pull requests for changes that touch prompts, tools, routing, permissions, retrieval, or business rules. Make the test result part of the deployment decision.
Some checks should be hard gates. The agent must not send payment instructions to a new recipient. It must not update a closed opportunity. It must not attach a confidential file to an external reply. Other checks can be scored and reviewed, such as tone, completeness, or judgment calls where a human may accept a tradeoff.
Production evals need business metrics
Offline tests are not enough. The production system should report how often the agent escalates, how often humans accept its drafts, how often policy blocks a risky action, how many runs require correction, and how much time the workflow saves after cleanup is counted.
Those metrics keep the team honest. An agent that completes more tasks but creates more rework is not improving. An agent that escalates less by guessing more is not improving. A useful agent reduces drag while staying inside the company's risk tolerance.
This also gives operators a practical feedback loop. The cases they correct become new golden tasks. The confusing policy becomes a clearer rule. The missed retrieval becomes a better index. The eval suite grows from the work the business sees every week.
The release checklist changes
Once evals exist, the question before release is no longer "does the agent work?" It is more specific. Did the golden tasks pass? Did the replay set regress? Did high-risk actions stay blocked? Did the new version improve the cases it was meant to improve? Is there an audit trail for what changed?
That standard is especially important as teams move from single agents to coordinated workflows. One agent may classify the request, another may retrieve context, another may draft the response, and another may update a system of record. Evals make sure the handoffs are tested as a workflow, not admired as a diagram.
Foundation builds agent workflows with evals from the start: golden tasks, replay traces, regression checks, policy gates, and metrics tied to the work. If your agents are moving past demos and into production, talk to us.
