How we ship agentic workflows

Shipping an AI colleague is harder than shipping a web app. A web app fails visibly; an AI colleague fails politely. It will keep answering, keep executing, keep calling tools — and you might not notice for a week.

So the question we started with was not “how do we deploy fast” but “how do we deploy safely and reverse quickly when we’re wrong.” Everything in our pipeline is built around that inversion.

One event log, everywhere

Every tool call, every message, every memory write is an event in a single append-only log. We wrote this twice before getting it right. The first time we put events in Postgres and it was fine for a week and miserable for the rest — the moment we started doing replay at scale, every query turned into a long-running join across three tables. The second time we put them in a real log system (with a per-conversation partition and a per-workspace retention policy) and stopped thinking about it.

The schema has been stable for six months:

interface ToolCall {
  id: string;
  colleague: string;     // which colleague made the call
  conversation: string;  // which conversation triggered it
  tool: string;          // e.g. "salesforce.contacts.search"
  args: unknown;
  result: unknown;
  startedAt: Date;
  endedAt: Date;
  spanId: string;        // for tracing across nested calls
  status: 'ok' | 'error' | 'timeout';
  tokensIn: number;
  tokensOut: number;
}

The big win is that replaying a production bug is free. You grab the events for the conversation, feed them into a local colleague wired to the same tools in dry-run mode, and step through. No seeding databases, no synthetic fixtures. The bug is the data.

The secondary win is that product, support, and eng share one source of truth. When support asks “what did Lisa say to this customer on Tuesday at 3pm?”, we don’t have to go spelunking across three systems. It’s one query against the log.

Preview environments per PR

Every PR gets its own preview tenant. New database, new agents, new webhook URLs, new Slack workspace wired in. You can click a link in the PR description and talk to the proposed version of Lisa in Slack before any reviewer has to read the diff.

We wrote this with a lot of fear. Provisioning a whole tenant per PR sounded expensive. In practice it costs us about $4 per PR per day and has paid for itself every week since we turned it on. The two biggest wins:

Reviewing a colleague’s change is actually doing the work, not reading code. A change to Ruby’s reporting tone is reviewed by asking Ruby for a report. A change to how Lisa qualifies leads is reviewed by sending Lisa a pipeline and seeing how she ranks it. Diffs are necessary but not sufficient.
Customer bug reports reproduce on preview. When a customer says “Lisa got the email address wrong”, we attach their (sanitized) event log to the preview tenant and the PR author can literally watch the bug happen. One week in, we stopped asking “how do we repro this?”

Reversible deploys

The hard rule we ended up with: if you cannot roll this change back in one minute, it is not a deploy, it is a migration. Migrations get their own track — separate PR, separate review, separate playbook. They go behind a feature flag, they get a dashboard, they get a pre-ship rehearsal in staging.

This rule has not made us faster. It has made us less scared, which is the same thing in the long run.

Concretely, what “reversible” means for us:

Model version rollbacks are a config change, not a deploy. If we see regression on a new model, we flip back without shipping code.
Prompt changes are versioned and rollback-able per colleague. A conversation that started on prompt v12 will finish on v12, even if we rolled to v13 mid-conversation.
Tool integrations are gated behind per-workspace flags. If Salesforce breaks us, we turn Salesforce off for new calls and existing conversations finish on the old integration.

Observability is deterministic, outputs are not

The thing that took us longest to internalize: you cannot use the model’s output as the signal that the colleague is working. Outputs are non-deterministic and sometimes wrong in ways that look right. We gate alerting on the deterministic signals around the output — tool call success rate, latency, token costs, conversation completion rate, human-override rate — and treat the output itself as data for the eval suite, not the monitor.

Our eval suite is the other half of the pipeline. Every PR triggers a replay across ~2,000 historical conversations that were human-annotated for correctness. We look at two things: the regression rate (conversations that passed before and fail now) and the fix rate (conversations that failed before and now pass). A PR that improves fix rate by 3% but regresses 1% of previously-passing conversations doesn’t ship without discussion — the people who were being helped by the old behavior don’t care about the average.

What we’d do differently

If we were starting over:

Build the replay harness on day one, not day thirty. We delayed it because it felt like infrastructure, not product. It is product — it is the tool that lets engineering sleep at night.
Make the event log the contract. We spent a long time thinking of the DB as the source of truth and the event log as a mirror. Once we inverted it — the log is primary; the DB is a projection — everything about debugging, replay, and rollback got simpler.
Version prompts like code, not like config. Prompts are code. They have behavior, edge cases, regressions. Treating them as strings in a config file meant they didn’t get reviewed, tested, or rolled back like the rest of our system. That caused real incidents.

We’ll keep writing up pieces of this as we refine them. If your team is shipping agentic workflows and doing any of this differently, we’d genuinely love to compare notes.