Implementation

From Pilot to Production: A 12-Week Pattern for Agentic AI Deployments

8 May 2026|11 min read

Most failed agentic AI projects fail because they jumped from proof-of-concept to production without the structure in between. The proof-of-concept looked impressive in a demo; the production deployment never arrived, or arrived in a state that made the firm uncomfortable, or arrived and was quietly switched off six months later when a senior manager could no longer answer the supervisor's questions about it. This piece walks through the twelve-week pattern we use to take agentic AI from concept to governed production at UK mid-market firms, and what each phase looks like in practice.

Why twelve weeks

Twelve weeks is enough time to do the work properly, and short enough that the engagement does not lose momentum or organisational sponsor attention. Shorter than this, and the eval, governance, and pilot phases get compressed into a state that is uncomfortable for the firm's second line. Longer than this, and the firm starts to wonder whether the project will ever deliver, usually a sign that the scope was wrong rather than the timeline.

The pattern below is what most engagements look like at a 200-500-person UK mid-market firm in financial services, legal, or healthcare. Larger scopes are sequenced into multiple twelve-week cycles rather than scaled into one big-bang launch. The first cycle delivers a working agentic system into production; subsequent cycles extend it.

Weeks 1-4: Workflow Audit and design

The first four weeks are the Evolve Workflow Audit applied specifically to confirming whether the candidate workflow is genuinely agentic-suited, and designing the agent specification, the eval test set, and the governance plan.

Week 1: discovery sessions. Structured time with the people doing the work, the people supervising the work, and the people accountable for the work. Mapping how the workflow actually runs, the workarounds, the exceptions, the points where the firm currently relies on tacit judgement.

Week 2: workflow mapping and confirmation. The detailed map of the workflow as it runs, the confirmation that the workflow is genuinely multi-step (not single-step in disguise), the identification of the right human-in-the-loop checkpoints, and the first draft of the agent specification, what the agent does, what tools it can call, what its boundaries are.

Week 3: eval test set construction. A curated test set covering happy-path, edge-case, failure-mode, and adversarial cases. The test set is the artefact the agent will be built to pass; it is constructed before any code is written so the build is shaped around the right target. Sources include historic real cases (anonymised where appropriate), edge cases identified during discovery, and regulatory failure modes the firm's second line wants tested.

Week 4: governance plan and design sign-off. The governance plan covering accountability, audit-trail design, escalation thresholds, monitoring, rollback path, and the artefacts the senior manager will own. Design sign-off from the firm's second line, the accountable senior manager, and (where relevant) the firm's compliance and clinical safety functions. No build starts until design sign-off is in place.

Weeks 5-8: build and eval

Four weeks of focused build with the eval harness running continuously. The eval is what gates progress; we do not move on until the agent is reliably above the bar set in design.

Week 5: skeleton agent and tool integrations. The agent is built in its first end-to-end form, with the tool integrations connecting to the firm's real systems (CRM, document management, case management, whatever the workflow requires). The first eval run happens at the end of this week, typically the agent is not yet meeting the bar, but the eval is now running.

Week 6: audit logging and human-in-the-loop checkpoints. The step-level audit logging is built in. The human-in-the-loop checkpoint UI is built, the surface through which reviewers see the agent's output, its inputs, its reasoning, and the uncertainty signals it surfaced. The eval is run again; behaviour should be improving.

Week 7: refinement against eval results. The most analytical week of the engagement. The eval results from the previous runs are analysed for failure patterns; prompts are refined, tool boundaries adjusted, the agent specification updated. By the end of the week the agent should be passing the eval at production-ready levels.

Week 8: stress-testing and rollback rehearsal. The adversarial test cases are run again, the boundaries are probed, the rollback path is rehearsed. The control pack , model documentation, current prompts, eval results, monitoring dashboards, rollback procedure, is assembled. The accountable senior manager reviews the control pack and signs off that the system is ready for pilot.

Weeks 9-12: pilot and production

Four weeks of running the agent against real work, refining based on what production reveals, and rolling out under monitoring with the rollback path in place.

Week 9: controlled pilot. The agent runs on a slice of real work, typically a single team, a defined case type, or a percentage of inbound volume. Every case the agent handles is logged step-by-step. Every human-in-the-loop checkpoint is observed for whether the review is real or rubber-stamp. The first production-reality eval cases are added to the test set.

Week 10: refinement against pilot data. Analysis of what the agent did on the pilot cases. Confidence thresholds are usually adjusted based on real production confidence distribution rather than the pre-pilot estimate. Prompts are refined. New edge cases identified during pilot are added to the eval. The agent is re-eval'd before any wider rollout.

Week 11: rollout under monitoring. The agent is rolled out to the full intended scope, with monitoring dashboards live and the second line watching. The rollback path is on standby. Production volume rises gradually, and the team is on hand for the first cases that fall outside the patterns the eval covered.

Week 12: handover and quarterly cadence. The system is in production with the firm's own team operating it day-to-day. The quarterly governance cadence is established, the eval refresh, the control pack update, the second-line review. The engagement formally closes; ongoing support shifts to a quarterly governance and improvement rhythm.

What gets cut when timelines slip

Three patterns we see at firms that try to compress this further than twelve weeks.

The eval gets cut. The most common compression. The team builds without an eval test set, intends to add one later, and never does. The cost is slow loss of confidence in the system and, eventually, an uncomfortable conversation with the second line about what good looks like.

The pilot gets cut. The team goes from build straight to production. The cost is that production reality reveals failure modes the eval did not cover, and they reveal them in front of customers rather than in a controlled environment.

The governance plan gets cut. The team builds first and intends to write the governance plan when the system is shipped. The cost is a much more expensive retrofit when the regulator, the second line, or the senior manager starts asking questions.

What gets cut when scopes are too ambitious

The other compression mode is keeping the timeline but inflating the scope. The pattern we recommend instead: ship the smallest agentic system that delivers the headline value, then extend in subsequent cycles. Firms that ship a working agentic system to production at week 12 are in a much better position to extend it; firms that miss week 12 because the scope was too big tend to lose momentum and end up with an unfinished system that quietly stops being a priority.

For more on agentic AI for regulated UK industries, see the agentic AI pillar, the practical companion Building an agentic AI eval harness, and Human-in-the-loop patterns for agentic AI in regulated industries.