AI Labs by The Ops Toolbox

Business sponsorsTechnical leaders

Why pilots fail before they start

Most enterprise AI pilots fail in the charter, not the model. Sponsors approve vague exploration budgets, teams build impressive demos, and six months later nobody can say whether handle time improved or risk increased.

A well scoped pilot answers one business question with one primary metric, a named owner, and stop rules written before the first API call. That discipline protects champions from endless beta status and gives finance a clear scale, pivot, or stop decision.

This guide assumes you already have NIST AI RMF Govern intake and a champion assigned. The focus here is turning an approved idea into a six-week execution plan both business and engineering can sign.

Failure pattern: prove transformation across five departments (OWASP LLM Top 10)
Failure pattern: no baseline week (OpenAI evals)
Failure pattern: production writes on day three (OpenAI safety best practices)
Success pattern: one workflow, one metric, stop rules
Workshop question: what changes in week seven? (NIST AI RMF Govern readout)

One workflow, one metric

Pick one end-to-end workflow where users repeat the same task weekly. Policy Q&A, ITSM triage, procurement intake, and CRM research are valid if you commit to one per pilot scoping discipline.

Choose one primary metric the steering forum will recognise per OpenAI evals. Secondary metrics support the story but do not dilute the headline.

Write the metric in plain language with a target range. "Reduce tier-1 policy lookup from 12 to 10 minutes across 30 cases per week" beats vague "employee experience" goals.

Align the metric with how the organisation already reports performance. If ServiceNow (ITSM platform overview) dashboards track first-contact resolution, anchor there rather than inventing a bespoke AI score.

Good charter: 15% policy lookup reduction (RAG with citations)
Good charter: 20% L1 triage with spot-check (golden set)
Weak charter: explore generative AI for the enterprise
Weak charter: chat usage without OpenAI evals sampling
Checklist: owner, users, baseline week (NIST AI RMF)
Checklist: metric formula agreed with finance

Stop rules and kill criteria

Stop rules define when the pilot ends early. They protect sponsors from sunk-cost pressure and give risk teams confidence that unsafe paths will not linger in production disguise.

Typical kill triggers: citation rate below threshold, Content Safety incidents, cost ceiling, adoption floor, or legal blockers.

Stop rules are not punishments. A clean stop with learnings beats a permanent beta. Document pivot options (agent vs workflow).

Review stop rules with legal and AI security controls in week one before build accelerates.

Citation below 85% on golden set policy questions two weeks running
Any unapproved write (OpenAI safety best practices)
Cost per case exceeds 2x pilot ceiling (cost controls)
Adoption below 40% in weeks five and six
Content Safety block rate above threshold
Workshop: evidence to stop funding in week three?

Choosing the right workflow

Strong pilot workflows have high volume, repeatable steps, available data, and tolerable error cost with human fallback.

Policy Q&A works with approved corpus and cite-only RAG. Triage works with stable categories and supervisor spot-checks.

Avoid workflows requiring invented facts, reputational single-error risk, or data scattered across systems you cannot connect in six weeks.

Use intake scorecard: volume, data readiness, risk, sponsor strength, reuse of reference architecture patterns patterns.

HR policy Q&A: high volume, RAG citations
ITSM L1 triage: deflection, needs golden set tickets
Procurement intake: structured extraction (function calling)
CRM research assist: read-only first, writes only after HITL (OpenAI safety best practices) pattern proven
Poor fit: executive compensation decisions, clinical diagnosis, one-off M&A analysis
What good looks like: champion can name 20 real cases from last month

Scope boundaries: in vs out

Every charter needs an explicit in scope and out of scope list. Ambiguity here is how pilots absorb adjacent requests until timelines collapse.

In scope for six weeks: one user cohort, one data source or index, one primary model route, logging, evals on golden questions (OpenAI evals), and read-only or propose-only integrations. Out of scope: multi-language, every region, full CRM automation, mobile apps, and custom fine-tuning (OpenAI guide) unless already justified.

Publish scope before demo day. Extra features trigger NIST AI RMF Govern charter amendment, not Slack negotiation.

Technical leads translate scope into API boundaries: which indexes, which tools, which writes disabled until week five.

In: 200 policy documents, English, vector index
Out: contractor policies, archives pre-2020, auto ticket closure
In: propose CRM note, HITL approves in UI
Out: autonomous CRM updates, bulk email
Scope page signed by sponsor and engineering lead

What to build vs what to simulate

Use real APIs on the critical path: RAG, generation, telemetry, and Content Safety. Stakeholders need representative latency and cost.

Simulate downstream writes until governance signs off. Propose-and-approve UI beats silent production integration per OpenAI safety best practices.

reference architecture patterns show live paths and honest unavailable states. Prefer honest partial integration over fake screenshots.

If data access slips, narrow the corpus rather than faking retrieval. Smaller approved vector index with citations beats hallucinated demos.

Build real: vector index (Azure vector search), embed pipeline, auth, structured logs
Build real: golden set (OpenAI evals) eval harness in CI or weekly job
Simulate: CRM write until HITL (OpenAI safety best practices) pattern reviewed
Simulate: billing actions always during pilot
Common mistake: skipping logging because "it's just a pilot"
What good looks like: same observability defaults as production readiness (checklist guide) guide

Week-by-week execution shape

Weeks 1 to 2 focus on discovery and baseline. Finalise golden questions (OpenAI evals), measure current handle time or triage accuracy, secure data access, and stand up retrieval with cite-only generation. No hero integrations yet.

Weeks 3 to 4 add Content Safety, evals, and narrow tooling. Introduce propose-only actions if required.

Weeks 5 to 6 put the workflow in front of users, measure against baseline, and draft scale, pivot, or stop per OpenAI evals. Reserve last three days for sponsor readout.

Mid-pilot checkpoint at week three. If stop rules trend red, escalate to NIST AI RMF Govern early.

Week 1: charter signed, baseline, vector index v1
Week 2: golden set scored, citation rate visible
Week 3: safety on, eval job, NIST AI RMF Govern checkpoint
Week 4: stable UI, HITL path, cost per task
Week 5: cohort live, weekly metric email
Week 6: metrics, deck, production readiness conversation handoff list

Roles and RACI for six weeks

Name owners for business outcome, technical delivery, data access, risk sign-off, and user adoption per NIST AI RMF RACI.

Business owner approves scope and attends OpenAI evals reviews. Engineering owns build, telemetry, and evals. Champion recruits users.

Legal reviews data privacy before indexing. Security reviews tool allow lists and Entra integration in week one.

Cost cap or write-tool changes: NIST AI RMF Govern chair plus security liaison.

Sponsor: metric, stop rules, steering readout
Engineering: architecture, evals, cost controls
Champion: users, feedback, shadow-AI signals
Risk liaison: Content Safety, kill switch
On-call for empty RAG index Monday 9am?

Data, identity, and environment decisions

Decide in week one which data classification enters the vector index, which Entra groups gate access, and sandbox vs production-adjacent tenants.

Prefer sandbox with telemetry and Content Safety matching production over production with logging disabled.

Index only documents the cohort may access. ACL model must match scale per Microsoft Copilot data protection.

Record model region, retention, subprocessors in charter appendix (Copilot privacy as reference).

Classify sources: public, internal, restricted
Map SSO before launch (Entra conditional access)
Document zero-retention options (production readiness conversation)
Common mistake: personal API keys
Secrets in vault (AI security controls)

Deliverables sponsors expect at week six

Executives fund decisions. Deliver one-page outcome summary, scale, pivot, or stop, cost forecast, risk residual, and owners.

Before-and-after chart on primary metric with sample size caveats per OpenAI evals.

Attach architecture, golden set eval summary, incident log, and production readiness conversation gap list if scaling.

If scale: name workflow, go-live, budget. If stop: what must change (RAG vs fine-tune) for retry.

Executive summary with headline metric and decision
Architecture and data-flow (AWS AI compliance)
Golden set report: citation and unknown-answer rates
Cost per successful task: pilot vs steady state (AI Gateway)
Open risks: retention, writes, region (data privacy)
Workshop: fund phase two tomorrow? (NIST AI RMF Govern)

Common mistakes and anti-patterns

Naming anti-patterns early prevents repeat failures (OWASP LLM Top 10 in vendor documentation).

Scope creep should trigger NIST AI RMF Govern charter amendment with timeline impact.

Skipping evals because the demo feels good fails by week five when real phrasing hits the golden set.

Silent production writes erase trust in one incident per OpenAI safety best practices.

No baseline week: cannot prove OpenAI evals value
Multiple primary metrics: steering argues
Fine-tuning for facts instead of RAG: refresh cost surprise
No stop rules: orphan beta service
Ignoring Copilot coexistence: duplicate email pilots
Demo-only logging: blind at production readiness conversation scale
NIST AI RMF Govern cites stop rule status monthly

Workshop agenda: half-day pilot charter

Run before sprint one. Attendees: sponsor, champion, engineering lead, risk liaison, ops analytics, NIST AI RMF Govern chair optional.

Morning: workflow scorecard and metric. Afternoon: stop rules, scope, week plan, RACI per this NIST AI RMF.

End with signed one-page charter. No charter, no sprint.

Capture open questions with owners before room clears.

0:00–0:30: Problem statement and user interviews
0:30–1:00: Workflow scorecard (reference architecture patterns)
1:00–1:45: Metric, baseline, data privacy class
1:45–2:15: Stop rules and scope in/out
2:15–3:00: Build vs simulate, RACI, NIST AI RMF Govern checkpoints
3:00–3:30: Readout and steering date

Scoping a six-week pilot

Why pilots fail before they start

One workflow, one metric

Stop rules and kill criteria

Choosing the right workflow

Scope boundaries: in vs out

What to build vs what to simulate

Week-by-week execution shape

Roles and RACI for six weeks

Data, identity, and environment decisions

Deliverables sponsors expect at week six

Common mistakes and anti-patterns

Workshop agenda: half-day pilot charter

Plan your next pilot

Scoping a six-week pilot

Executive summary

Why pilots fail before they start

One workflow, one metric

Stop rules and kill criteria

Choosing the right workflow

Scope boundaries: in vs out

What to build vs what to simulate

Week-by-week execution shape

Roles and RACI for six weeks

Data, identity, and environment decisions

Deliverables sponsors expect at week six

Common mistakes and anti-patterns

Workshop agenda: half-day pilot charter

Plan your next pilot