AI Labs

Decision guide

Production readiness checklist

What pilots skip, and what security, legal, and operations ask for before go-live.

Business sponsorsTechnical leaders

Production readiness is a programme gate, not a slide

Pilots optimise for learning speed. Production optimises for accountability: who owns incidents, what gets logged, how quality regresses are caught, and how humans override unsafe behaviour per NIST AI RMF Manage themes.

Security, legal, and operations often ask the same questions late because engineering shipped a demo with console logging and no runbook. This guide is the minimum bar before go-live, paired with production readiness conversation.

Readiness is not perfection. It is documented owners, tested controls, and honest gaps with dates. Sponsors sign scope, limits, and 90-day OpenAI evals metrics, not a vague AI launch.

Pair this checklist with the OWASP LLM Top 10 and evidence pack when InfoSec review is scheduled.

Observability and audit

Demos log to the console. Production logs model ID, latency, token usage, retrieval hits, tool calls, safety scores, and user/session correlation to your SIEM via Azure Foundry monitor or equivalent.

Without structured logs you cannot debug incidents, prove ROI to finance, or answer audit questions about who received which answer from which document. production readiness conversation spell out minimum fields.

Dashboards should be operational, not vanity. Latency p95, error rate, empty retrieval rate, and cost per workflow matter more than total messages on a OpenAI evals dashboard.

Retention on logs must match data privacy policy. Redact secrets and minimise PII in log payloads while keeping correlation IDs.

  • Structured logs per request with correlation IDs (AI SDK telemetry)
  • Dashboards: latency p95, error rate, cost per workflow (AI Gateway tags)
  • RAG (Azure RAG concepts): retrieval hit rate and empty-result rate
  • Tools: invocation audit trail with approver identity
  • Workshop question: "Can we reconstruct one answer end to end from logs?"

Evaluation before merge

Treat prompts, retrieval config, and tool schemas like application code. Run golden set (OpenAI evals) evals in CI: same questions, rubric-scored answers, regression alerts when quality drops.

Evals are not a one-off pilot activity. Every prompt or index change should trigger the harness via Azure prompt flow evaluation, or you accept silent quality drift.

Include safety cases in the golden set (OpenAI evals): prompt injection attempts, empty retrieval, and requests for prohibited actions. Quality is not only ROUGE scores.

Publish eval trends to champions and NIST AI RMF Govern monthly so business sees engineering discipline, not black boxes.

Safety and human oversight

Any write to customer-facing systems needs OpenAI safety best practices approval or deterministic policy gate, not model discretion alone. Azure Content Safety classifiers, PII redaction, and budget guardrails belong in the request path.

Read-only first is policy for regulated programmes, not delay tactics. Propose, approve, execute separates agent creativity from system accountability per NIST Manage.

Test incident runbooks: disable tool calling, route to human queue, preserve logs. Drill before executives ask whether you have practised it.

Align channel matrix with Copilot coexistence: where employees may use Microsoft 365 Copilot vs citation-required apps.

Identity, secrets, and environments

Production binds sessions to corporate SSO via Entra conditional access. Service principals or managed identities hold model keys; personal API keys are a shadow-AI vector.

Separate dev, test, and prod keys and indexes. Dev must not point at production corpora per data privacy classification.

Rotate secrets on offboarding and quarterly. Scan repos and CI for leaked keys per production readiness conversation.

Document which roles may deploy prompt changes vs infrastructure. Uncontrolled prompt deploys are production incidents waiting for OWASP LLM review.

  • SSO for users; scoped roles for approvers and admins (Entra ID)
  • Secrets in vault; break-glass MFA for admin actions (OWASP LLM Top 10)
  • No production customer data in developer laptops (Copilot privacy)
  • Model version pinning with promotion process (AI Gateway routes)

Data, retention, and regional boundaries

Classify corpora before indexing into a vector index. Restricted data needs stronger controls or exclusion from Microsoft 365 Copilot.

Align retention on prompts, logs, and vector stores with legal holds per the Microsoft Copilot data protection. Pilots that ignored retention create expensive rework.

Document model region, subprocessors, and what happens when users travel jurisdictions. AWS AI compliance pages help procurement Q&A.

See Azure responsible AI overview for sponsor-facing policy language alongside retention schedules.

Cost and capacity controls

Production without caps is an open invoice. Per-tenant and per-workflow limits, with graceful degradation, belong in the request path per unified model gateway.

Tag logs with team, workflow, and model ID for showback. Finance should see cost per successful task on the OpenAI evals dashboard, not only monthly token totals.

Load-test RAG retrieval and agent loops before wide rollout. Burst adoption can throttle shared indexes or AI Gateway.

Link to unified model gateway and embeddings docs for routing, caching, and refresh cost patterns.

Operating rhythm after launch

Assign owners for taxonomy changes, prompt versions, vector index refreshes, and tool allow lists. Review override rates monthly per OpenAI safety best practices.

Schedule index rebuilds when source systems change. Version embeddings and chunking config alongside application deploys.

NIST AI RMF Govern should receive a standard monthly pack: incidents, eval trend, cost, adoption, open risks.

Decommissioning matters: how to drain traffic, archive logs, and delete indexes when a workflow ends (NIST Manage).

Minimum bar checklist

Use this as a steering-forum gate before production. Red items block launch; amber items need dated remediation plans signed by the sponsor per NIST AI RMF Playbook.

Do not waive logging or write controls for executive deadlines. Shortcuts here dominate post-incident reviews cited in OWASP LLM Top 10.

Attach evidence links (dashboard, eval report, runbook drill date) to the AWS AI compliance, not assertions.

Production gate for sponsors

Sponsors should sign off on scope (which workflows), limits (what the system will not do), and metrics (OpenAI evals for 90 days post-launch).

Without that, engineering ships a demo and operations inherits an unowned service. Name the business owner on-call per NIST AI RMF Govern charter, not only the vendor.

Include escalation path when the model is wrong or unsafe. Employees need a OpenAI safety best practices queue, not a dead end chat.

Scale funding should depend on hitting metric and stop rules thresholds, not demo applause.

  • 90-day primary KPI and supporting metrics agreed
  • stop rules (NIST AI RMF) for post-launch quality or cost breaches
  • Communication plan for users and help desk
  • Budget for steady-state ops, not only build

Handoff to internal teams

Deliver runbooks, architecture diagrams, eval datasets, and a recorded walkthrough of disable-tools and human-fallback. End with artefacts your team can operate (reference architecture patterns as patterns).

Train help desk on empty RAG retrieval, approval delays, and Content Safety blocks. Scripts reduce shadow workarounds.

List open production gaps with owners and dates. Honest amber items build trust; hidden gaps destroy credibility at first security incident.

Schedule a 30-day post-launch review with the same attendees as the readiness gate (scoping pilot readout format).

Workshop: readiness review in two hours

Attendees: sponsor, engineering lead, security liaison, ops/SRE, champion. Bring pilot scoping metrics, architecture diagram, and draft runbook.

Score each checklist section red/amber/green against NIST AI RMF Playbook. Agree launch date or remediation plan with owners.

Run a 15-minute tabletop: RAG returns empty on a high-profile question. Trace response, logging, and comms.

Output: signed scorecard stored where NIST AI RMF Govern and audit can find it.

Next step

Talk about your next pilot

Patterns, metrics, and runnable demos for architecture reviews and pilots, from The Ops Toolbox.

Prefer the web form? The Ops Toolbox.

  • One workflow, clear metrics
  • Your cloud, your keys
  • Written handoff, not dependency