AI Labs by The Ops Toolbox

Business sponsorsTechnical leaders

Why the distinction matters for enterprise AI

Vendors use agent for many things. In architecture terms, an agent is a model loop that chooses tools from conversation context. A durable workflow is a persisted sequence of steps with known handoffs, timers, and audit events. See AI SDK agents and durable workflow orchestration for contrasting patterns.

Confusing the two leads to either rigid chatbots that cannot handle exceptions, or autonomous loops that cannot survive restarts, prove SLAs, or satisfy regulators. Map both to the AI SDK agents decision record before build.

Most production systems are hybrid: workflow for intake, logging, and approvals; agent for the human-facing exception path. The streaming agent example and workflow orchestration example demo each piece for workshops.

Sponsors should ask for a timeline of steps, not only a chat transcript. Engineers should ask whether the happy path fits a flowchart without "model decides." Review production readiness conversation before regulated writes.

Agent: next step not known in advance; user drives path
Workflow: sequence defined upfront; system drives state. See AWS Step Functions.
Hybrid: workflow backbone + agent surface for exceptions
Workshop question: "What must still be true after a server restart mid-case?"

What an agent is good at

An agent runs a loop: model proposes tool calls, runtime executes, results return to context, model continues. The user steers with natural language; troubleshooting and multi-turn clarification are natural fits.

Agents shine when ambiguity is real: which CRM object, which knowledge article, whether to escalate. They struggle when the process is already drawn on a whiteboard with fixed SLAs. OpenAI function calling docs explain the loop mechanics.

Streaming UX matters for agents. Users tolerate thinking time if they see partial reasoning and tool status. Batch-only agents feel broken in operations centres. Instrument with AI SDK telemetry.

Start read-only: lookup tools, search, draft text. Add writes only behind OpenAI safety best practices patterns demonstrated in the HITL approval example.

Operational copilots with CRM, ITSM, or knowledge-base tools. See Bedrock Agents.
Ad-hoc analysis with escalation on low confidence
Exploratory Q&A where tool selection depends on the question. Demo Anthropic tool use.
Poor fit: fixed compliance pipeline with mandatory timestamps
What good looks like: tool allow list and trace per session

What a durable workflow is good at

A workflow encodes known steps with persistence: intake, classify, route, notify, wait for approval. Steps retry, pause for days, and emit audit events. If your server restarts mid-process, state survives. durable workflow orchestration and Azure Logic Apps both model this pattern.

Use workflows when the sequence is defined upfront and compliance needs timestamps for every transition. Legal and operations teams think in case states, not chat turns. Align with NIST AI RMF Manage for audit evidence.

Document ingestion, index refresh, and regulated approvals are classic workflow workloads. The model may classify or draft inside a step, but the orchestration stays deterministic. See Foundry orchestration example.

Azure Logic Apps, AWS Step Functions, and durable workflow orchestration represent different anchors; pick the control plane your organisation already operates. Cross-check the choosing cloud anchor guide.

Transformation intake: parse, risk score, recommendation
Document ingestion, chunking, and index refresh jobs. Link to Azure RAG solution guide.
Regulated approvals with SLA timers and escalation. Pair with OpenAI safety best practices.
Batch handoffs between teams (legal, security, architecture)
What good looks like: every transition logged with actor and time

Safety: agents need guardrails

Agents with write tools are high risk on day one. Production patterns almost always start read-only, add human-in-the-loop for writes via the OpenAI safety best practices, then narrow automated writes to low blast-radius actions.

Combine tool allow lists, argument validation, rate limits, and escalation queues. Prompt pleading alone does not stop tool abuse or prompt injection via retrieved content. Layer Azure prompt shields where appropriate.

Run kill-switch drills: disable all tools without taking down read-only Q&A. Security and operations should practise this before go-live. Document drills in the AWS AI compliance.

Pen-test tool endpoints directly. Parameter tampering on internal APIs is a common gap when only the chat UI is tested. Review OWASP LLM Top 10 for agent-specific threats.

Allow list of tool names; reject unknown tools at runtime
Schema validation on every tool argument. Follow OpenAI safety best practices.
Global disable-tools flag wired to incident runbook
Log tool name, args (redacted), approver, outcome
Common mistake: CRM write on first sprint

Combine agents and workflows in production

The most common enterprise shape is hybrid. A workflow backbone receives the ticket, records state, and enforces SLAs. An agent surface helps the human draft, propose actions, or explore edge cases. Foundry agents can sit inside either layer.

Example: workflow receives a ticket, agent drafts a response and proposes a CRM update, workflow records the proposal, supervisor approves, workflow executes the write and closes the loop. Model the HITL approval example.

Keep proposal and execution separate so idempotency and retries stay predictable. The agent should not call write APIs directly in regulated paths unless policy explicitly allows it. See OpenAI safety best practices propose, approve, execute language.

Compare Azure Foundry agents and durable workflow orchestration documentation when demoing the split without overbuilding on day one.

Workflow owns case ID, timestamps, and approval queue
Agent owns language, retrieval, and draft quality. Link RAG patterns where needed.
Execution step is deterministic API or integration
What good looks like: replay a case from logs without chat guesswork

Decision checklist

Ask these in architecture review before picking agent-only. Document answers in the charter; "we will see" is not a decision. Use OpenAI evals to score agent proposals on golden cases.

If three or more answers point to workflow, do not ship agent-only because the demo was faster to build. Escalate through the NIST AI RMF Govern if stakeholders disagree.

Council exceptions for agent-only writes need an expiry and named risk owner. Align with AI security controls.

Can you draw the happy path as a flowchart without model decides? → workflow
Must the process survive restarts and multi-day waits? → workflow
Is the user exploring or following a case? → agent with tools
Do writes touch customers or money? → workflow + approval; agent proposes only
Do regulators require step-level audit? → workflow
Workshop question: "Where does a human always intervene?"

Operating model and ownership

Agents need an on-call owner for tool failures, prompt drift, and escalation queues. Workflows need a process owner for step changes and SLA breaches. Both need OpenAI evals metrics on one dashboard.

In pilots, name who approves new tools, who reviews override rates, and who signs off before writes go live. Unowned queues become silent failure modes. Champions route questions through the NIST AI RMF Govern.

Platform engineering owns SDK versions, keys, routing, and observability via AI Gateway or cloud-native logging. Business owns outcomes and escalation SLAs. Risk owns allow lists and data classes.

Do not make champions shadow admins for API keys or logging bypass. That anti-pattern appears in OWASP LLM Top 10 every quarter.

Business owner: workflow outcomes and escalation SLAs
Platform owner: gateway, telemetry, deploy pipeline
Risk owner: tool and data class allow lists. See security controls.
Champion: adoption signals, not production on-call unless funded

SLAs, timers, and human queues

Workflows express SLAs explicitly: wait 48 hours for approval, escalate to L2, notify sponsor. Agents rarely enforce time without workflow wrapping them. durable workflow orchestration timers model this cleanly.

Design approval UIs with one-click approve/reject, policy flags, citations, and audit export. Supervisors will not read long chat threads under pressure. Follow the OpenAI safety best practices.

Measure queue depth, time-to-approve, and override reasons. Rising overrides mean prompts, tools, or retrieval need tuning, not more autonomy. Track in the OpenAI evals dashboard.

Pair with the OpenAI safety best practices for propose, approve, execute language sponsors understand.

Pending approval SLA with escalation path
After-hours behaviour: queue vs auto-decline writes
Idempotent execute step to survive retries
What good looks like: monthly queue health in steering deck

What to show in a steering forum

Bring a timeline of steps (workflow) or a live trace (agent) rather than a chat transcript alone. Executives want to see where humans intervene and what gets logged. Export Bedrock invocation logs or SDK telemetry samples.

Show one incident drill narrative: disable tools, fall back to human queue, preserve logs. Confidence in operations beats demo eloquence. Reference production readiness conversation checklist items.

Compare cost per successful case for agent-heavy vs workflow-heavy paths. Long agent loops are expensive and hard to cap without workflow guardrails. Use unified model gateway forecasting.

Link metrics to the OpenAI evals guide so finance and engineering share one page.

Diagram: states, transitions, approval points
Sample log line: correlation ID through tools and workflow. See OpenTelemetry.
Override rate and top three override reasons
Common mistake: steering demo with no logging story

Failure modes and anti-patterns

Naming failures early builds trust with sceptical IT and risk leaders. Publish them alongside OWASP LLM Top 10 the council already tracks.

Agent-only on regulated writes fails the first audit. Workflow-only with no agent surface frustrates users when cases do not fit the flowchart. Hybrid patterns need named owners per layer.

Chat as database: storing case state only in conversation history without workflow IDs is fragile and non-compliant. Persist state in a workflow engine such as durable workflow orchestration.

Tool sprawl: every new integration added as an agent tool without review doubles attack surface. Require security controls review for each tool.

Agent-only for multi-day approvals: state lost on restart
Workflow-only for exploratory analysis: users work around in shadow AI. Offer a read-only streaming agent.
Unlogged tool calls: impossible incident response
Infinite agent loops: no step cap or cost cap
What good looks like: documented hybrid with owners per layer

Workshop: map one real case

Bring a real ticket or intake record from last month. Draw states on a whiteboard: received, classified, drafted, approved, executed, closed. Use NIST AI RMF charter format for scope.

Mark which steps are deterministic (workflow) and which need language or tool choice (agent). Mark mandatory human gates aligned with OpenAI safety best practices.

End with a one-page architecture: components, data stores, logging, and first pilot scope (read-only week 1 to 2). Attach to the AWS AI compliance.

Assign owners before anyone leaves. Unowned workshops generate slides, not systems. Schedule follow-up with the NIST AI RMF Govern.

0:00 to 0:20: Read real case aloud, list systems touched
0:20 to 0:50: State diagram and human gates
0:50 to 1:10: Tool allow list and logging plan. Review Foundry agents.
1:10 to 1:30: Pilot scope and stop rules

Metrics that fit each pattern

Agents: tool success rate, loops per session, escalation rate, tokens per resolved session, safety refusal rate. Instrument via AI SDK telemetry.

Workflows: step latency, SLA breach count, approval time, retry count, stuck cases. Export to the same BI layer as OpenAI evals KPIs.

Hybrids: attribute cost and quality to the layer that failed. Blaming "the AI" without separation slows improvement. Run golden evals per layer.

Export to the same dashboard sponsors see weekly during pilots. Include cost per task alongside quality.

Agent metric: proposals rejected by policy gate
Workflow metric: cases stuck in approval > 72 hours
Shared metric: cost per successful case
Workshop question: "Which metric would make us pause the pilot?"

Agent vs durable workflow

Why the distinction matters for enterprise AI

What an agent is good at

What a durable workflow is good at

Safety: agents need guardrails

Combine agents and workflows in production

Decision checklist

Operating model and ownership

SLAs, timers, and human queues

What to show in a steering forum

Failure modes and anti-patterns

Workshop: map one real case

Metrics that fit each pattern

Plan your next pilot

Agent vs durable workflow

Executive summary

Why the distinction matters for enterprise AI

What an agent is good at

What a durable workflow is good at

Safety: agents need guardrails

Combine agents and workflows in production

Decision checklist

Operating model and ownership

SLAs, timers, and human queues

What to show in a steering forum

Failure modes and anti-patterns

Workshop: map one real case

Metrics that fit each pattern

Plan your next pilot