For business sponsors
Executive summary
Skim this before the full guide. Technical detail follows in the sections below.
- Decision
- Which single workflow gets a six-week pilot budget and what success means.
- Primary metric
- One headline operational metric (e.g. handle time, deflection with quality sample).
- Stop rule
- End pilot if citation rate, safety, cost per case, or adoption misses written thresholds.
Related worked example
Policy Q&A with Seed RAG on VercelNeed facilitation on this topic? Start a conversation.
01
Why pilots fail before they start
Most enterprise AI pilots fail in the charter, not the model. Sponsors approve vague exploration budgets, teams build impressive demos, and six months later nobody can say whether handle time improved or risk increased.
A well scoped pilot answers one business question with one primary metric, a named owner, and stop rules written before the first API call. That discipline protects champions from endless beta status and gives finance a clear scale, pivot, or stop decision.
This guide assumes you already have NIST AI RMF Govern intake and a champion assigned. The focus here is turning an approved idea into a six-week execution plan both business and engineering can sign.
- Failure pattern: prove transformation across five departments (OWASP LLM Top 10)
- Failure pattern: no baseline week (OpenAI evals)
- Failure pattern: production writes on day three (OpenAI safety best practices)
- Success pattern: one workflow, one metric, stop rules
- Workshop question: what changes in week seven? (NIST AI RMF Govern readout)
02
One workflow, one metric
Pick one end-to-end workflow where users repeat the same task weekly. Policy Q&A, ITSM triage, procurement intake, and CRM research are valid if you commit to one per pilot scoping discipline.
Choose one primary metric the steering forum will recognise per OpenAI evals. Secondary metrics support the story but do not dilute the headline.
Write the metric in plain language with a target range. "Reduce tier-1 policy lookup from 12 to 10 minutes across 30 cases per week" beats vague "employee experience" goals.
Align the metric with how the organisation already reports performance. If ServiceNow (ITSM platform overview) dashboards track first-contact resolution, anchor there rather than inventing a bespoke AI score.
- Good charter: 15% policy lookup reduction (RAG with citations)
- Good charter: 20% L1 triage with spot-check (golden set)
- Weak charter: explore generative AI for the enterprise
- Weak charter: chat usage without OpenAI evals sampling
- Checklist: owner, users, baseline week (NIST AI RMF)
- Checklist: metric formula agreed with finance
03
Stop rules and kill criteria
Stop rules define when the pilot ends early. They protect sponsors from sunk-cost pressure and give risk teams confidence that unsafe paths will not linger in production disguise.
Typical kill triggers: citation rate below threshold, Content Safety incidents, cost ceiling, adoption floor, or legal blockers.
Stop rules are not punishments. A clean stop with learnings beats a permanent beta. Document pivot options (agent vs workflow).
Review stop rules with legal and AI security controls in week one before build accelerates.
- Citation below 85% on golden set policy questions two weeks running
- Any unapproved write (OpenAI safety best practices)
- Cost per case exceeds 2x pilot ceiling (cost controls)
- Adoption below 40% in weeks five and six
- Content Safety block rate above threshold
- Workshop: evidence to stop funding in week three?
04
Choosing the right workflow
Strong pilot workflows have high volume, repeatable steps, available data, and tolerable error cost with human fallback.
Policy Q&A works with approved corpus and cite-only RAG. Triage works with stable categories and supervisor spot-checks.
Avoid workflows requiring invented facts, reputational single-error risk, or data scattered across systems you cannot connect in six weeks.
Use intake scorecard: volume, data readiness, risk, sponsor strength, reuse of reference architecture patterns patterns.
- HR policy Q&A: high volume, RAG citations
- ITSM L1 triage: deflection, needs golden set tickets
- Procurement intake: structured extraction (function calling)
- CRM research assist: read-only first, writes only after HITL (OpenAI safety best practices) pattern proven
- Poor fit: executive compensation decisions, clinical diagnosis, one-off M&A analysis
- What good looks like: champion can name 20 real cases from last month
05
Scope boundaries: in vs out
Every charter needs an explicit in scope and out of scope list. Ambiguity here is how pilots absorb adjacent requests until timelines collapse.
In scope for six weeks: one user cohort, one data source or index, one primary model route, logging, evals on golden questions (OpenAI evals), and read-only or propose-only integrations. Out of scope: multi-language, every region, full CRM automation, mobile apps, and custom fine-tuning (OpenAI guide) unless already justified.
Publish scope before demo day. Extra features trigger NIST AI RMF Govern charter amendment, not Slack negotiation.
Technical leads translate scope into API boundaries: which indexes, which tools, which writes disabled until week five.
- In: 200 policy documents, English, vector index
- Out: contractor policies, archives pre-2020, auto ticket closure
- In: propose CRM note, HITL approves in UI
- Out: autonomous CRM updates, bulk email
- Scope page signed by sponsor and engineering lead
06
What to build vs what to simulate
Use real APIs on the critical path: RAG, generation, telemetry, and Content Safety. Stakeholders need representative latency and cost.
Simulate downstream writes until governance signs off. Propose-and-approve UI beats silent production integration per OpenAI safety best practices.
reference architecture patterns show live paths and configure-keys states. Prefer honest partial integration over fake screenshots.
If data access slips, narrow the corpus rather than faking retrieval. Smaller approved vector index with citations beats hallucinated demos.
- Build real: vector index (Azure vector search), embed pipeline, auth, structured logs
- Build real: golden set (OpenAI evals) eval harness in CI or weekly job
- Simulate: CRM write until HITL (OpenAI safety best practices) pattern reviewed
- Simulate: billing actions always during pilot
- Common mistake: skipping logging because "it's just a pilot"
- What good looks like: same observability defaults as production readiness (checklist guide) guide
Reference documentation
07
Week-by-week execution shape
Weeks 1 to 2 focus on discovery and baseline. Finalise golden questions (OpenAI evals), measure current handle time or triage accuracy, secure data access, and stand up retrieval with cite-only generation. No hero integrations yet.
Weeks 3 to 4 add Content Safety, evals, and narrow tooling. Introduce propose-only actions if required.
Weeks 5 to 6 put the workflow in front of users, measure against baseline, and draft scale, pivot, or stop per OpenAI evals. Reserve last three days for sponsor readout.
Mid-pilot checkpoint at week three. If stop rules trend red, escalate to NIST AI RMF Govern early.
- Week 1: charter signed, baseline, vector index v1
- Week 2: golden set scored, citation rate visible
- Week 3: safety on, eval job, NIST AI RMF Govern checkpoint
- Week 4: stable UI, HITL path, cost per task
- Week 5: cohort live, weekly metric email
- Week 6: metrics, deck, production readiness conversation handoff list
08
Roles and RACI for six weeks
Name owners for business outcome, technical delivery, data access, risk sign-off, and user adoption per NIST AI RMF RACI.
Business owner approves scope and attends OpenAI evals reviews. Engineering owns build, telemetry, and evals. Champion recruits users.
Legal reviews data privacy before indexing. Security reviews tool allow lists and Entra integration in week one.
Cost cap or write-tool changes: NIST AI RMF Govern chair plus security liaison.
- Sponsor: metric, stop rules, steering readout
- Engineering: architecture, evals, cost controls
- Champion: users, feedback, shadow-AI signals
- Risk liaison: Content Safety, kill switch
- On-call for empty RAG index Monday 9am?
09
Data, identity, and environment decisions
Decide in week one which data classification enters the vector index, which Entra groups gate access, and sandbox vs production-adjacent tenants.
Prefer sandbox with telemetry and Content Safety matching production over production with logging disabled.
Index only documents the cohort may access. ACL model must match scale per Microsoft Copilot data protection.
Record model region, retention, subprocessors in charter appendix (Copilot privacy as reference).
- Classify sources: public, internal, restricted
- Map SSO before launch (Entra conditional access)
- Document zero-retention options (production readiness conversation)
- Common mistake: personal API keys
- Secrets in vault (AI security controls)
Reference documentation
10
Deliverables sponsors expect at week six
Executives fund decisions. Deliver one-page outcome summary, scale, pivot, or stop, cost forecast, risk residual, and owners.
Before-and-after chart on primary metric with sample size caveats per OpenAI evals.
Attach architecture, golden set eval summary, incident log, and production readiness conversation gap list if scaling.
If scale: name workflow, go-live, budget. If stop: what must change (RAG vs fine-tune) for retry.
- Executive summary with headline metric and decision
- Architecture and data-flow (AWS AI compliance)
- Golden set report: citation and unknown-answer rates
- Cost per successful task: pilot vs steady state (AI Gateway)
- Open risks: retention, writes, region (data privacy)
- Workshop: fund phase two tomorrow? (NIST AI RMF Govern)
11
Common mistakes and anti-patterns
Naming anti-patterns early prevents repeat failures (OWASP LLM Top 10 in vendor documentation).
Scope creep should trigger NIST AI RMF Govern charter amendment with timeline impact.
Skipping evals because the demo feels good fails by week five when real phrasing hits the golden set.
Silent production writes erase trust in one incident per OpenAI safety best practices.
- No baseline week: cannot prove OpenAI evals value
- Multiple primary metrics: steering argues
- Fine-tuning for facts instead of RAG: refresh cost surprise
- No stop rules: orphan beta service
- Ignoring Copilot coexistence: duplicate email pilots
- Demo-only logging: blind at production readiness conversation scale
- NIST AI RMF Govern cites stop rule status monthly
12
Workshop agenda: half-day pilot charter
Run before sprint one. Attendees: sponsor, champion, engineering lead, risk liaison, ops analytics, NIST AI RMF Govern chair optional.
Morning: workflow scorecard and metric. Afternoon: stop rules, scope, week plan, RACI per this NIST AI RMF.
End with signed one-page charter. No charter, no sprint.
Capture open questions with owners before room clears.
- 0:00–0:30: Problem statement and user interviews
- 0:30–1:00: Workflow scorecard (reference architecture patterns)
- 1:00–1:45: Metric, baseline, data privacy class
- 1:45–2:15: Stop rules and scope in/out
- 2:15–3:00: Build vs simulate, RACI, NIST AI RMF Govern checkpoints
- 3:00–3:30: Readout and steering date
Provider & framework documentation
Official docs referenced in this guide. Use these in architecture reviews and security questionnaires.