Writing

Part 4: Agent Orchestration at Scale - Building an Enterprise Platform

2025

AI-for-Work Agents Platform

The moment an agent can call tools, the problem stops being prompt design and becomes distributed systems. You now need to schedule tasks, isolate execution, manage credentials, stream state, recover from failure, and explain afterward what happened. None of this is new computer science. What's new is that the worker in the middle of the workflow is probabilistic and occasionally very confident about the wrong thing.

A runtime, not a long chat

The biggest mental shift is to stop modeling agent work as a conversation and start modeling it as workflow infrastructure. "Let the model keep thinking" is not an operational model once the model can touch real systems. The platform needs the unglamorous machinery: task ids, queues, leases, retries, idempotency keys, artifact storage, tool scopes, approvals, and cancellation that actually cancels.

Concretely, a production agent run should move through explicit states — created, context-loaded, planning, waiting-for-approval, executing, blocked, completed, failed, canceled, rolled-back — and every transition should be logged with the actor, the model version, the policy version, and the evidence behind it. That sounds like bureaucracy until the first time you have to resume a half-finished run, or explain to a security review why an agent did what it did. Then it sounds like the whole point.

Design for the failure modes

Most of the runtime's value is in the failures it makes survivable. The ones I'd design for first:

Duplicate execution after a retry or worker crash — this is why idempotency keys are not optional.
A tool call that succeeds while the artifact it produced never gets persisted.
Partial completion that leaves an external system in an ambiguous state with no record of how far the agent got.
An approval granted for one narrow action, quietly reused for a broader one.
Runaway loops that burn model budget without ever advancing the task state.

That last category of failure — approvals stretching beyond what was approved — is the one that connects back to the gateway from Part 2. Permissions need to live in the runtime and the gateway, scoped per task, not granted to "the agent" as a standing identity. An agent doesn't need the access its busiest day requires; it needs the access this task requires.

I don't think the agent future depends on one perfect agent. It depends on the runtime that makes many imperfect agents useful enough to trust — observable, resumable, auditable, and constrained. Build the boring machinery and the demos turn into infrastructure. Skip it and the demos stay demos.

This is Part 4 of the AI-for-Work series. Next: Part 5: Preserving Institutional Knowledge.