AI Labs

Decision guide

Agent vs durable workflow

Chat agents for ambiguity and tool selection; durable workflows for steps that must survive retries, time, and human approval.

Business sponsorsTechnical leaders

Why the distinction matters for enterprise AI

Vendors use agent for many things. In architecture terms, an agent is a model loop that chooses tools from conversation context. A durable workflow is a persisted sequence of steps with known handoffs, timers, and audit events. See AI SDK agents and durable workflow orchestration for contrasting patterns.

Confusing the two leads to either rigid chatbots that cannot handle exceptions, or autonomous loops that cannot survive restarts, prove SLAs, or satisfy regulators. Map both to the AI SDK agents decision record before build.

Most production systems are hybrid: workflow for intake, logging, and approvals; agent for the human-facing exception path. The streaming agent example and workflow orchestration example demo each piece for workshops.

Sponsors should ask for a timeline of steps, not only a chat transcript. Engineers should ask whether the happy path fits a flowchart without "model decides." Review production readiness conversation before regulated writes.

  • Agent: next step not known in advance; user drives path
  • Workflow: sequence defined upfront; system drives state. See AWS Step Functions.
  • Hybrid: workflow backbone + agent surface for exceptions
  • Workshop question: "What must still be true after a server restart mid-case?"

What an agent is good at

An agent runs a loop: model proposes tool calls, runtime executes, results return to context, model continues. The user steers with natural language; troubleshooting and multi-turn clarification are natural fits.

Agents shine when ambiguity is real: which CRM object, which knowledge article, whether to escalate. They struggle when the process is already drawn on a whiteboard with fixed SLAs. OpenAI function calling docs explain the loop mechanics.

Streaming UX matters for agents. Users tolerate thinking time if they see partial reasoning and tool status. Batch-only agents feel broken in operations centres. Instrument with AI SDK telemetry.

Start read-only: lookup tools, search, draft text. Add writes only behind OpenAI safety best practices patterns demonstrated in the HITL approval example.

  • Operational copilots with CRM, ITSM, or knowledge-base tools. See Bedrock Agents.
  • Ad-hoc analysis with escalation on low confidence
  • Exploratory Q&A where tool selection depends on the question. Demo Anthropic tool use.
  • Poor fit: fixed compliance pipeline with mandatory timestamps
  • What good looks like: tool allow list and trace per session

What a durable workflow is good at

A workflow encodes known steps with persistence: intake, classify, route, notify, wait for approval. Steps retry, pause for days, and emit audit events. If your server restarts mid-process, state survives. durable workflow orchestration and Azure Logic Apps both model this pattern.

Use workflows when the sequence is defined upfront and compliance needs timestamps for every transition. Legal and operations teams think in case states, not chat turns. Align with NIST AI RMF Manage for audit evidence.

Document ingestion, index refresh, and regulated approvals are classic workflow workloads. The model may classify or draft inside a step, but the orchestration stays deterministic. See Foundry orchestration example.

Azure Logic Apps, AWS Step Functions, and durable workflow orchestration represent different anchors; pick the control plane your organisation already operates. Cross-check the choosing cloud anchor guide.

  • Transformation intake: parse, risk score, recommendation
  • Document ingestion, chunking, and index refresh jobs. Link to Azure RAG solution guide.
  • Regulated approvals with SLA timers and escalation. Pair with OpenAI safety best practices.
  • Batch handoffs between teams (legal, security, architecture)
  • What good looks like: every transition logged with actor and time

Safety: agents need guardrails

Agents with write tools are high risk on day one. Production patterns almost always start read-only, add human-in-the-loop for writes via the OpenAI safety best practices, then narrow automated writes to low blast-radius actions.

Combine tool allow lists, argument validation, rate limits, and escalation queues. Prompt pleading alone does not stop tool abuse or prompt injection via retrieved content. Layer Azure prompt shields where appropriate.

Run kill-switch drills: disable all tools without taking down read-only Q&A. Security and operations should practise this before go-live. Document drills in the AWS AI compliance.

Pen-test tool endpoints directly. Parameter tampering on internal APIs is a common gap when only the chat UI is tested. Review OWASP LLM Top 10 for agent-specific threats.

  • Allow list of tool names; reject unknown tools at runtime
  • Schema validation on every tool argument. Follow OpenAI safety best practices.
  • Global disable-tools flag wired to incident runbook
  • Log tool name, args (redacted), approver, outcome
  • Common mistake: CRM write on first sprint

Combine agents and workflows in production

The most common enterprise shape is hybrid. A workflow backbone receives the ticket, records state, and enforces SLAs. An agent surface helps the human draft, propose actions, or explore edge cases. Foundry agents can sit inside either layer.

Example: workflow receives a ticket, agent drafts a response and proposes a CRM update, workflow records the proposal, supervisor approves, workflow executes the write and closes the loop. Model the HITL approval example.

Keep proposal and execution separate so idempotency and retries stay predictable. The agent should not call write APIs directly in regulated paths unless policy explicitly allows it. See OpenAI safety best practices propose, approve, execute language.

Compare Azure Foundry agents and durable workflow orchestration documentation when demoing the split without overbuilding on day one.

  • Workflow owns case ID, timestamps, and approval queue
  • Agent owns language, retrieval, and draft quality. Link RAG patterns where needed.
  • Execution step is deterministic API or integration
  • What good looks like: replay a case from logs without chat guesswork

Decision checklist

Ask these in architecture review before picking agent-only. Document answers in the charter; "we will see" is not a decision. Use OpenAI evals to score agent proposals on golden cases.

If three or more answers point to workflow, do not ship agent-only because the demo was faster to build. Escalate through the NIST AI RMF Govern if stakeholders disagree.

Council exceptions for agent-only writes need an expiry and named risk owner. Align with AI security controls.

  • Can you draw the happy path as a flowchart without model decides? → workflow
  • Must the process survive restarts and multi-day waits? → workflow
  • Is the user exploring or following a case? → agent with tools
  • Do writes touch customers or money? → workflow + approval; agent proposes only
  • Do regulators require step-level audit? → workflow
  • Workshop question: "Where does a human always intervene?"

Operating model and ownership

Agents need an on-call owner for tool failures, prompt drift, and escalation queues. Workflows need a process owner for step changes and SLA breaches. Both need OpenAI evals metrics on one dashboard.

In pilots, name who approves new tools, who reviews override rates, and who signs off before writes go live. Unowned queues become silent failure modes. Champions route questions through the NIST AI RMF Govern.

Platform engineering owns SDK versions, keys, routing, and observability via AI Gateway or cloud-native logging. Business owns outcomes and escalation SLAs. Risk owns allow lists and data classes.

Do not make champions shadow admins for API keys or logging bypass. That anti-pattern appears in OWASP LLM Top 10 every quarter.

  • Business owner: workflow outcomes and escalation SLAs
  • Platform owner: gateway, telemetry, deploy pipeline
  • Risk owner: tool and data class allow lists. See security controls.
  • Champion: adoption signals, not production on-call unless funded

SLAs, timers, and human queues

Workflows express SLAs explicitly: wait 48 hours for approval, escalate to L2, notify sponsor. Agents rarely enforce time without workflow wrapping them. durable workflow orchestration timers model this cleanly.

Design approval UIs with one-click approve/reject, policy flags, citations, and audit export. Supervisors will not read long chat threads under pressure. Follow the OpenAI safety best practices.

Measure queue depth, time-to-approve, and override reasons. Rising overrides mean prompts, tools, or retrieval need tuning, not more autonomy. Track in the OpenAI evals dashboard.

Pair with the OpenAI safety best practices for propose, approve, execute language sponsors understand.

  • Pending approval SLA with escalation path
  • After-hours behaviour: queue vs auto-decline writes
  • Idempotent execute step to survive retries
  • What good looks like: monthly queue health in steering deck

What to show in a steering forum

Bring a timeline of steps (workflow) or a live trace (agent) rather than a chat transcript alone. Executives want to see where humans intervene and what gets logged. Export Bedrock invocation logs or SDK telemetry samples.

Show one incident drill narrative: disable tools, fall back to human queue, preserve logs. Confidence in operations beats demo eloquence. Reference production readiness conversation checklist items.

Compare cost per successful case for agent-heavy vs workflow-heavy paths. Long agent loops are expensive and hard to cap without workflow guardrails. Use unified model gateway forecasting.

Link metrics to the OpenAI evals guide so finance and engineering share one page.

  • Diagram: states, transitions, approval points
  • Sample log line: correlation ID through tools and workflow. See OpenTelemetry.
  • Override rate and top three override reasons
  • Common mistake: steering demo with no logging story

Failure modes and anti-patterns

Naming failures early builds trust with sceptical IT and risk leaders. Publish them alongside OWASP LLM Top 10 the council already tracks.

Agent-only on regulated writes fails the first audit. Workflow-only with no agent surface frustrates users when cases do not fit the flowchart. Hybrid patterns need named owners per layer.

Chat as database: storing case state only in conversation history without workflow IDs is fragile and non-compliant. Persist state in a workflow engine such as durable workflow orchestration.

Tool sprawl: every new integration added as an agent tool without review doubles attack surface. Require security controls review for each tool.

  • Agent-only for multi-day approvals: state lost on restart
  • Workflow-only for exploratory analysis: users work around in shadow AI. Offer a read-only streaming agent.
  • Unlogged tool calls: impossible incident response
  • Infinite agent loops: no step cap or cost cap
  • What good looks like: documented hybrid with owners per layer

Workshop: map one real case

Bring a real ticket or intake record from last month. Draw states on a whiteboard: received, classified, drafted, approved, executed, closed. Use NIST AI RMF charter format for scope.

Mark which steps are deterministic (workflow) and which need language or tool choice (agent). Mark mandatory human gates aligned with OpenAI safety best practices.

End with a one-page architecture: components, data stores, logging, and first pilot scope (read-only week 1 to 2). Attach to the AWS AI compliance.

Assign owners before anyone leaves. Unowned workshops generate slides, not systems. Schedule follow-up with the NIST AI RMF Govern.

  • 0:00 to 0:20: Read real case aloud, list systems touched
  • 0:20 to 0:50: State diagram and human gates
  • 0:50 to 1:10: Tool allow list and logging plan. Review Foundry agents.
  • 1:10 to 1:30: Pilot scope and stop rules

Metrics that fit each pattern

Agents: tool success rate, loops per session, escalation rate, tokens per resolved session, safety refusal rate. Instrument via AI SDK telemetry.

Workflows: step latency, SLA breach count, approval time, retry count, stuck cases. Export to the same BI layer as OpenAI evals KPIs.

Hybrids: attribute cost and quality to the layer that failed. Blaming "the AI" without separation slows improvement. Run golden evals per layer.

Export to the same dashboard sponsors see weekly during pilots. Include cost per task alongside quality.

  • Agent metric: proposals rejected by policy gate
  • Workflow metric: cases stuck in approval > 72 hours
  • Shared metric: cost per successful case
  • Workshop question: "Which metric would make us pause the pilot?"

Next step

Talk about your next pilot

Patterns, metrics, and runnable demos for architecture reviews and pilots, from The Ops Toolbox.

Prefer the web form? The Ops Toolbox.

  • One workflow, clear metrics
  • Your cloud, your keys
  • Written handoff, not dependency