AI Labs

Decision guide

Scoping a six-week pilot

How to pick one workflow, define stop rules, and leave with metrics executives will fund, pivot, or kill.

Business sponsorsTechnical leaders

01

Why pilots fail before they start

Most enterprise AI pilots fail in the charter, not the model. Sponsors approve vague exploration budgets, teams build impressive demos, and six months later nobody can say whether handle time improved or risk increased.

A well scoped pilot answers one business question with one primary metric, a named owner, and stop rules written before the first API call. That discipline protects champions from endless beta status and gives finance a clear scale, pivot, or stop decision.

This guide assumes you already have NIST AI RMF Govern intake and a champion assigned. The focus here is turning an approved idea into a six-week execution plan both business and engineering can sign.

02

One workflow, one metric

Pick one end-to-end workflow where users repeat the same task weekly. Policy Q&A, ITSM triage, procurement intake, and CRM research are valid if you commit to one per pilot scoping discipline.

Choose one primary metric the steering forum will recognise per OpenAI evals. Secondary metrics support the story but do not dilute the headline.

Write the metric in plain language with a target range. "Reduce tier-1 policy lookup from 12 to 10 minutes across 30 cases per week" beats vague "employee experience" goals.

Align the metric with how the organisation already reports performance. If ServiceNow (ITSM platform overview) dashboards track first-contact resolution, anchor there rather than inventing a bespoke AI score.

  • Good charter: 15% policy lookup reduction (RAG with citations)
  • Good charter: 20% L1 triage with spot-check (golden set)
  • Weak charter: explore generative AI for the enterprise
  • Weak charter: chat usage without OpenAI evals sampling
  • Checklist: owner, users, baseline week (NIST AI RMF)
  • Checklist: metric formula agreed with finance

03

Stop rules and kill criteria

Stop rules define when the pilot ends early. They protect sponsors from sunk-cost pressure and give risk teams confidence that unsafe paths will not linger in production disguise.

Typical kill triggers: citation rate below threshold, Content Safety incidents, cost ceiling, adoption floor, or legal blockers.

Stop rules are not punishments. A clean stop with learnings beats a permanent beta. Document pivot options (agent vs workflow).

Review stop rules with legal and AI security controls in week one before build accelerates.

04

Choosing the right workflow

Strong pilot workflows have high volume, repeatable steps, available data, and tolerable error cost with human fallback.

Policy Q&A works with approved corpus and cite-only RAG. Triage works with stable categories and supervisor spot-checks.

Avoid workflows requiring invented facts, reputational single-error risk, or data scattered across systems you cannot connect in six weeks.

Use intake scorecard: volume, data readiness, risk, sponsor strength, reuse of reference architecture patterns patterns.

  • HR policy Q&A: high volume, RAG citations
  • ITSM L1 triage: deflection, needs golden set tickets
  • Procurement intake: structured extraction (function calling)
  • CRM research assist: read-only first, writes only after HITL (OpenAI safety best practices) pattern proven
  • Poor fit: executive compensation decisions, clinical diagnosis, one-off M&A analysis
  • What good looks like: champion can name 20 real cases from last month

05

Scope boundaries: in vs out

Every charter needs an explicit in scope and out of scope list. Ambiguity here is how pilots absorb adjacent requests until timelines collapse.

In scope for six weeks: one user cohort, one data source or index, one primary model route, logging, evals on golden questions (OpenAI evals), and read-only or propose-only integrations. Out of scope: multi-language, every region, full CRM automation, mobile apps, and custom fine-tuning (OpenAI guide) unless already justified.

Publish scope before demo day. Extra features trigger NIST AI RMF Govern charter amendment, not Slack negotiation.

Technical leads translate scope into API boundaries: which indexes, which tools, which writes disabled until week five.

  • In: 200 policy documents, English, vector index
  • Out: contractor policies, archives pre-2020, auto ticket closure
  • In: propose CRM note, HITL approves in UI
  • Out: autonomous CRM updates, bulk email
  • Scope page signed by sponsor and engineering lead

06

What to build vs what to simulate

Use real APIs on the critical path: RAG, generation, telemetry, and Content Safety. Stakeholders need representative latency and cost.

Simulate downstream writes until governance signs off. Propose-and-approve UI beats silent production integration per OpenAI safety best practices.

reference architecture patterns show live paths and configure-keys states. Prefer honest partial integration over fake screenshots.

If data access slips, narrow the corpus rather than faking retrieval. Smaller approved vector index with citations beats hallucinated demos.

  • Build real: vector index (Azure vector search), embed pipeline, auth, structured logs
  • Build real: golden set (OpenAI evals) eval harness in CI or weekly job
  • Simulate: CRM write until HITL (OpenAI safety best practices) pattern reviewed
  • Simulate: billing actions always during pilot
  • Common mistake: skipping logging because "it's just a pilot"
  • What good looks like: same observability defaults as production readiness (checklist guide) guide

07

Week-by-week execution shape

Weeks 1 to 2 focus on discovery and baseline. Finalise golden questions (OpenAI evals), measure current handle time or triage accuracy, secure data access, and stand up retrieval with cite-only generation. No hero integrations yet.

Weeks 3 to 4 add Content Safety, evals, and narrow tooling. Introduce propose-only actions if required.

Weeks 5 to 6 put the workflow in front of users, measure against baseline, and draft scale, pivot, or stop per OpenAI evals. Reserve last three days for sponsor readout.

Mid-pilot checkpoint at week three. If stop rules trend red, escalate to NIST AI RMF Govern early.

08

Roles and RACI for six weeks

Name owners for business outcome, technical delivery, data access, risk sign-off, and user adoption per NIST AI RMF RACI.

Business owner approves scope and attends OpenAI evals reviews. Engineering owns build, telemetry, and evals. Champion recruits users.

Legal reviews data privacy before indexing. Security reviews tool allow lists and Entra integration in week one.

Cost cap or write-tool changes: NIST AI RMF Govern chair plus security liaison.

  • Sponsor: metric, stop rules, steering readout
  • Engineering: architecture, evals, cost controls
  • Champion: users, feedback, shadow-AI signals
  • Risk liaison: Content Safety, kill switch
  • On-call for empty RAG index Monday 9am?

09

Data, identity, and environment decisions

Decide in week one which data classification enters the vector index, which Entra groups gate access, and sandbox vs production-adjacent tenants.

Prefer sandbox with telemetry and Content Safety matching production over production with logging disabled.

Index only documents the cohort may access. ACL model must match scale per Microsoft Copilot data protection.

Record model region, retention, subprocessors in charter appendix (Copilot privacy as reference).

10

Deliverables sponsors expect at week six

Executives fund decisions. Deliver one-page outcome summary, scale, pivot, or stop, cost forecast, risk residual, and owners.

Before-and-after chart on primary metric with sample size caveats per OpenAI evals.

Attach architecture, golden set eval summary, incident log, and production readiness conversation gap list if scaling.

If scale: name workflow, go-live, budget. If stop: what must change (RAG vs fine-tune) for retry.

11

Common mistakes and anti-patterns

Naming anti-patterns early prevents repeat failures (OWASP LLM Top 10 in vendor documentation).

Scope creep should trigger NIST AI RMF Govern charter amendment with timeline impact.

Skipping evals because the demo feels good fails by week five when real phrasing hits the golden set.

Silent production writes erase trust in one incident per OpenAI safety best practices.

12

Workshop agenda: half-day pilot charter

Run before sprint one. Attendees: sponsor, champion, engineering lead, risk liaison, ops analytics, NIST AI RMF Govern chair optional.

Morning: workflow scorecard and metric. Afternoon: stop rules, scope, week plan, RACI per this NIST AI RMF.

End with signed one-page charter. No charter, no sprint.

Capture open questions with owners before room clears.

  • 0:00–0:30: Problem statement and user interviews
  • 0:30–1:00: Workflow scorecard (reference architecture patterns)
  • 1:00–1:45: Metric, baseline, data privacy class
  • 1:45–2:15: Stop rules and scope in/out
  • 2:15–3:00: Build vs simulate, RACI, NIST AI RMF Govern checkpoints
  • 3:00–3:30: Readout and steering date

Provider & framework documentation

Official docs referenced in this guide. Use these in architecture reviews and security questionnaires.

Next step

Talk about your next pilot

Patterns, metrics, and runnable demos for architecture reviews and pilots, from The Ops Toolbox.

Prefer the web form? The Ops Toolbox.

  • One workflow, clear metrics
  • Your cloud, your keys
  • Written handoff, not dependency