Building an Agentic AI Eval Harness Before You Ship
The single artefact that separates production-grade agentic AI from a demo that gets applauded in a board pack is the eval harness. Without it, the team cannot tell whether the system is improving or regressing as changes ship. With it, the firm has an objective basis for confidence, and an artefact that satisfies the regulator, the second line, and the accountable senior manager. This piece walks through how we build eval harnesses for regulated agentic AI deployments and what good actually looks like.
Why agentic AI specifically needs an eval harness
Single-step AI automation can often be evaluated through informal sampling, a reviewer looks at the recent outputs and judges them. The work is bounded, the failure modes are familiar, and the consequences of any individual failure are usually contained. Agentic AI is different. The agent is making sequences of decisions, calling sequences of tools, and adapting based on intermediate results. Sampling outputs is no longer enough; the failure modes are emergent, the chain of reasoning matters, and a single mis-step early in a sequence cascades across everything that follows.
The practical answer is automated evaluation against a curated test set, run continuously as the system evolves. The eval is the firm's objective view of whether the agent is doing what it is meant to. Without it, confidence in the system erodes month by month, and once that confidence is gone, it is very hard to rebuild without taking the system down and rebuilding the trust from scratch.
What goes in the test set
A good eval test set has four categories of cases, in roughly the proportions below.
Happy path cases (40-50%). Cases that represent the most common, most well-understood version of the workflow. These are the cases the agent must handle with very high reliability, typically >95% pass rate, because they account for the volume. Failure on a happy path case is unacceptable in production.
Edge cases (20-30%). Cases that are unusual but not pathological. The agent should handle these correctly when it can, and escalate to a human when it cannot - the failure mode here is silent confidence on a case the agent should have escalated. The eval is what catches that failure mode.
Failure-mode cases (15-20%). Cases specifically designed to trigger the failure modes the team is worried about. Out-of-scope requests. Confusing inputs. Cases where the right answer is “decline to act” or “escalate immediately.” The pass criterion here is not that the agent gets the right answer, it is that the agent recognises the case and behaves appropriately.
Adversarial cases (10-15%). Cases designed to probe security and policy boundaries. Prompt injection attempts. Attempts to get the agent to call tools it should not. Attempts to get the agent to act outside its defined scope. The pass criterion is that the agent stays inside its boundaries; the fail mode is any case where the agent is talked into doing something it should not.
How to build the test set
The test set is built during the design phase of the engagement, before the agent is built. This sequencing matters: the test set is what the agent is being built to pass. Inverting the order, build first, eval second, produces an eval that is shaped around what the agent already does rather than what the firm wants the agent to do. We have rebuilt agentic systems where the original eval was effectively post-hoc validation, and the cost of that rework is substantial.
Sources for the test set: historic real cases (anonymised where appropriate), edge cases identified during the Evolve Workflow Audit, regulatory failure modes the firm's second line wants tested, and adversarial cases drawn from current public threat intelligence on agentic systems. The test set is a living document, we add cases as the firm encounters new edge cases in production, and the discipline of doing so is what keeps the system honest year on year.
What the eval actually measures
Three categories of measurement, run on every change.
Pass rate per case category. The headline number. Happy-path pass rate should be >95% in production. Edge-case pass rate target depends on the workflow but typically 80-90%. Failure-mode pass rate should be very high, these are designed-in failure modes the agent must recognise. Adversarial pass rate should be near 100%.
Step-level correctness. Beyond the headline pass rate, the eval looks at whether the agent took the right intermediate steps. An agent can produce the right final output through the wrong process; in regulated environments this is often itself a failure. The eval captures the tool calls, the reasoning, and the intermediate outputs, and the scoring includes step-level correctness rather than only end-state.
Behavioural drift. Comparison of the current run against the previous run, flagging any case where the agent's behaviour has changed. Even when both runs pass, unexplained behavioural drift is a signal, usually that something in the underlying model, prompt, or data has changed in a way the team needs to understand.
How often the eval runs
Three trigger points. Every change to the agent specification, prompts, or tool boundaries. The eval runs in CI; nothing ships without a green eval. Every change to the underlying model. Provider model updates can change behaviour subtly; the eval is what catches that. Quarterly, regardless of changes.New edge cases get added, the test set is refreshed, and the agent is re-validated against the current production reality.
What to do when the eval fails
When the eval fails on a change, the change does not ship. That is the discipline. The team investigates, fixes, and re-runs. When the eval fails on a quarterly run with no change, that is a signal that production reality has drifted from the test set, usually a meaningful product or operational signal that needs investigation regardless of the AI question.
When the eval fails on a model update from the provider, there are usually three responses depending on severity: pin to the previous model version, adjust prompts to compensate for the behavioural change, or accept the new behaviour and update the test set. None is always correct; the firm's second line is the right judge of which is appropriate.
The accountability artefact
The eval harness is also the artefact the senior manager accountable for the agentic system relies on. Quarterly eval results, behavioural drift summary, and any failed-case investigations form the core of the control pack the senior manager reviews. Without that artefact, the SM&CR conversation is uncomfortable; with it, the senior manager has an objective basis for the answers they are asked.
For more on agentic AI engineering for regulated UK industries, see the agentic AI pillar, the operator's guide Agentic AI explained, or the practical companion Human-in-the-loop patterns for agentic AI.
Ready to transform your business with AI?
Book a free strategy session to discuss how Evolve AI can help your organisation harness AI safely and compliantly.
Book Strategy Session