Engineering

From Demo to Durable: The Production Checklist That Saves the 40% of Agentic AI Projects About to Fail

Gartner is saying that more than 40% of agentic AI projects will be canceled by the end of 2027. The reasons they list — escalating costs, unclear value, inadequate risk controls — are real, but they are not the actual cause of death. The actual cause of death is that the project was greenlit on the strength of a demo and the demo did not survive contact with production.

A demo is a closed system. Inputs are known. The agent has a happy path. The audience claps. Production is an open system: weird inputs, partial outages, retries, edge cases at 2 a.m., somebody renames a field in the upstream CRM and nobody tells you. Most agentic AI projects fail in the gap between those two environments, and the gap is bigger than people expect — typically 3× the demo budget, occasionally more.

This post is the checklist we use at Foundation AI before we let an agent take real work. Five gates. If your project hasn’t cleared all five, you have a demo, not a system.

Gate 1: Authentication and authorization that survives an audit

The demo agent had a personal access token in an environment variable. That’s fine for a demo. It is not fine in production.

What production requires:

  • The agent authenticates as itself, not as a borrowed human. Service accounts, OAuth client credentials, signed JWTs — anything that doesn’t tie the audit trail to a specific employee.
  • Scopes are narrow. If the agent only needs read on Jobs and write on Notes, it gets exactly that — not Admin.
  • Credentials live in a secret store with rotation. Hardcoded keys, plaintext .env files in shared repos, and tokens pasted into prompt templates are disqualifying.
  • Every action the agent takes is traceable to a request ID, a timestamp, and the upstream caller (human or system) that initiated it. If you can’t tell, after the fact, why the agent did a thing, you can’t defend it when someone asks.

This sounds like security hygiene because it is. The reason it kills projects is that nobody scoped it into the original budget, and bolting it on after the agent is already taking real actions is expensive and politically painful.

Gate 2: Observability you’d trust at 3 a.m.

When the agent does something wrong in production, you have minutes — not hours — to understand what happened. That requires telemetry that was designed in, not retrofitted.

Minimum bar:

  • Structured logs for every tool call, with the prompt, the response, the latency, the model version, the cost, and the request ID. Plaintext logs aren’t enough; you need to query them.
  • Distributed traces so you can follow an incoming request from the trigger through every tool call and external API hit. Without traces, your debugging strategy is “stare at the logs and guess.” That strategy doesn’t survive a real incident.
  • A model-cost dashboard broken down by use case, agent, and per-customer if you’re running multi-tenant. Surprise model bills are one of the top three causes of project cancellation. Make them visible early.
  • Failure-mode dashboards — what % of runs fail, where, why, and is the rate trending up. If your only failure signal is “users complain,” you will fail.

If you can’t answer “what did the agent do for customer X today, and how much did it cost?” in under two minutes, you are not ready for production.

Gate 3: Idempotency and at-most-once semantics for anything that touches the outside world

Agents retry. Tools time out. Networks fail. If your agent sends an email, posts to Slack, creates a record in the CRM, or charges a credit card — and the call times out — what happens on the retry?

In the best case, nothing. In the worst case, the customer gets the same dunning email three times, the Slack channel gets spammed, the CRM has duplicate records, and the credit card is charged twice. Then you have a trust incident, and the project gets the brakes pumped.

The fix is not interesting and not optional:

  • Every external write goes through an idempotency key derived from the run ID plus the action ID.
  • The receiving system either deduplicates on that key, or you wrap it in a local guard that does.
  • You build the assumption that the agent will retry into every tool, not just the ones you’re worried about today.

This is the gate that most demos haven’t touched. It is also the gate that, when it fails in production, embarrasses you in front of the customer.

Gate 4: Rollback and a kill switch a non-engineer can hit

When the agent is misbehaving, you need to stop it now — not after the engineer who built it gets back from lunch. The hard requirements:

  • A kill switch for each agent, accessible to operations staff (not just engineering), that halts new runs and gracefully drains in-flight ones.
  • A rollback path to a previous prompt/model/tooling version, with clear semantics about what happens to in-flight work.
  • Feature flags controlling which customers, accounts, or workflows the agent touches, so you can scope the blast radius of a regression to one tenant rather than all of them.
  • A documented incident runbook — who gets paged, what’s the first thing they do, when do you escalate, when do you notify the customer.

The point is not to avoid failure. The point is to make failure recoverable. Projects die when failures are unrecoverable and the team’s only option is to keep apologizing.

Gate 5: An eval harness that runs continuously, not just at launch

Here’s the gate everyone skips, because it doesn’t show up in the demo and it’s annoying to build. You will regret skipping it.

Models drift. Vendors silently retrain. The world changes. The prompt that produced flawless ServiceTitan job notes in March produces subtly worse notes in May, and nobody notices until a customer complains. By then you have a credibility problem.

The eval harness:

  • A labeled test set of representative inputs and expected behaviors, curated by someone who actually knows the domain — not generated by the LLM you’re testing.
  • Automated grading that runs every time you change a prompt, a model, or a tool. The grading is partly programmatic (did the right tool get called, in the right order, with the right arguments) and partly LLM-as-judge with calibrated rubrics, sampled to humans for spot checks.
  • Production sampling — a random 1-5% of real production runs are scored against the same rubric, so you catch drift as it happens, not weeks later.
  • A regression gate in CI so a bad prompt change can’t reach production without somebody overriding the failure intentionally.

The eval harness is the difference between an agent that gets better over time and one that quietly rots.

The point is not to avoid failure. The point is to make failure recoverable. Projects die when failures are unrecoverable and the team’s only option is to keep apologizing.

What we learned shipping a ServiceTitan agent at Capstone Plumbing

We built an agent for Capstone Plumbing that lives inside their ServiceTitan installation, handles after-hours dispatch summaries, and triages customer follow-up. It’s been running daily for months. It works.

Three things from that build that surprised us, and surprise nearly every client:

  1. The integration was 70% of the work. The agent logic — the prompt, the tool use, the model selection — was a couple of weeks. The ServiceTitan API plumbing, the field-mapping edge cases, the customer-name normalization, the timezone handling — that was months. This ratio is normal. Plan for it.
  2. The eval harness paid for itself in the first model update. When the underlying model got a quiet revision and the dispatch summaries started omitting a key field, our continuous eval caught it the same day. Without it we would have shipped degraded output to a customer for weeks before anyone noticed.
  3. The kill switch got used exactly once, in week three, when an upstream rate-limit change caused a retry storm. It worked. Nothing burned down. That single use justified the entire afterthought of building it.

Budgeting the unsexy work

If you’re scoping an agentic project right now, here is the rough cost breakdown we use to size proposals:

  • Demo / prototype:
  • Integration with real systems: 2-3×
  • Production hardening (the five gates above): 1-2×
  • First year of monitoring, tuning, and eval maintenance: 1× per year ongoing

Total, before you count the human time on your side: 4-7× the demo cost in year one. If the project sponsor signed off on 1× and expects production at that price, the project will be in Gartner’s 40% by 2027.

Foundation AI builds agents that ship, then runs them, monitors them, and tunes them. The five gates above aren’t optional checklist items for us — they’re the thing we sell. If your team is great at the demo and stuck on the production gap, that’s exactly where we add value. Start a conversation — or see what production looks like in our showcase.