AI Labs by The Ops Toolbox

Business sponsorsTechnical leaders

Production readiness is a programme gate, not a slide

Pilots optimise for learning speed. Production optimises for accountability: who owns incidents, what gets logged, how quality regresses are caught, and how humans override unsafe behaviour per NIST AI RMF Manage themes.

Security, legal, and operations often ask the same questions late because engineering shipped a demo with console logging and no runbook. This guide is the minimum bar before go-live, paired with production readiness conversation.

Readiness is not perfection. It is documented owners, tested controls, and honest gaps with dates. Sponsors sign scope, limits, and 90-day OpenAI evals metrics, not a vague AI launch.

Pair this checklist with the OWASP LLM Top 10 and evidence pack when InfoSec review is scheduled.

Gate: steering forum or NIST AI RMF Govern, not only engineering sign-off
Deliverable: one-page readiness scorecard mapped to NIST AI RMF Playbook controls
Common mistake: promoting pilot logging to production without Bedrock logging or SIEM export
What good looks like: same telemetry defaults promised in the NIST AI RMF

Observability and audit

Demos log to the console. Production logs model ID, latency, token usage, retrieval hits, tool calls, safety scores, and user/session correlation to your SIEM via Azure Foundry monitor or equivalent.

Without structured logs you cannot debug incidents, prove ROI to finance, or answer audit questions about who received which answer from which document. production readiness conversation spell out minimum fields.

Dashboards should be operational, not vanity. Latency p95, error rate, empty retrieval rate, and cost per workflow matter more than total messages on a OpenAI evals dashboard.

Retention on logs must match data privacy policy. Redact secrets and minimise PII in log payloads while keeping correlation IDs.

Structured logs per request with correlation IDs (AI SDK telemetry)
Dashboards: latency p95, error rate, cost per workflow (AI Gateway tags)
RAG (Azure RAG concepts): retrieval hit rate and empty-result rate
Tools: invocation audit trail with approver identity
Workshop question: "Can we reconstruct one answer end to end from logs?"

Evaluation before merge

Treat prompts, retrieval config, and tool schemas like application code. Run golden set (OpenAI evals) evals in CI: same questions, rubric-scored answers, regression alerts when quality drops.

Evals are not a one-off pilot activity. Every prompt or index change should trigger the harness via Azure prompt flow evaluation, or you accept silent quality drift.

Include safety cases in the golden set (OpenAI evals): prompt injection attempts, empty retrieval, and requests for prohibited actions. Quality is not only ROUGE scores.

Publish eval trends to champions and NIST AI RMF Govern monthly so business sees engineering discipline, not black boxes.

20 to 50 golden questions (OpenAI evals) with Anthropic eval rubrics where helpful
CI or scheduled job blocks deploy on regression threshold (Azure RAG solution guide citation cases included)
Version prompts and chunking config in git (Azure RAG concepts)
Post-incident: add failure cases to golden set (OpenAI evals) within one week

Safety and human oversight

Any write to customer-facing systems needs OpenAI safety best practices approval or deterministic policy gate, not model discretion alone. Azure Content Safety classifiers, PII redaction, and budget guardrails belong in the request path.

Read-only first is policy for regulated programmes, not delay tactics. Propose, approve, execute separates agent creativity from system accountability per NIST Manage.

Test incident runbooks: disable tool calling, route to human queue, preserve logs. Drill before executives ask whether you have practised it.

Align channel matrix with Copilot coexistence: where employees may use Microsoft 365 Copilot vs citation-required apps.

Human approval for CRM, billing, and external email writes (HITL guide)
Content Safety and Bedrock Guardrails on inputs and outputs
Rate limits and token/cost caps per tenant (unified model gateway)
Kill switch tested in last 90 days (AI security controls)
Unknown answer when RAG retrieval empty, not invention

Identity, secrets, and environments

Production binds sessions to corporate SSO via Entra conditional access. Service principals or managed identities hold model keys; personal API keys are a shadow-AI vector.

Separate dev, test, and prod keys and indexes. Dev must not point at production corpora per data privacy classification.

Rotate secrets on offboarding and quarterly. Scan repos and CI for leaked keys per production readiness conversation.

Document which roles may deploy prompt changes vs infrastructure. Uncontrolled prompt deploys are production incidents waiting for OWASP LLM review.

SSO for users; scoped roles for approvers and admins (Entra ID)
Secrets in vault; break-glass MFA for admin actions (OWASP LLM Top 10)
No production customer data in developer laptops (Copilot privacy)
Model version pinning with promotion process (AI Gateway routes)

Data, retention, and regional boundaries

Classify corpora before indexing into a vector index. Restricted data needs stronger controls or exclusion from Microsoft 365 Copilot.

Align retention on prompts, logs, and vector stores with legal holds per the Microsoft Copilot data protection. Pilots that ignored retention create expensive rework.

Document model region, subprocessors, and what happens when users travel jurisdictions. AWS AI compliance pages help procurement Q&A.

See Azure responsible AI overview for sponsor-facing policy language alongside retention schedules.

Index ACL matches source system permissions (Azure AI Search)
Retention schedule signed and implemented in config (Microsoft Copilot data protection)
DPA and subprocessors register current for chosen models (vendor selection guide)
Redaction before send where minimisation required (PII handling in production)

Cost and capacity controls

Production without caps is an open invoice. Per-tenant and per-workflow limits, with graceful degradation, belong in the request path per unified model gateway.

Tag logs with team, workflow, and model ID for showback. Finance should see cost per successful task on the OpenAI evals dashboard, not only monthly token totals.

Load-test RAG retrieval and agent loops before wide rollout. Burst adoption can throttle shared indexes or AI Gateway.

Link to unified model gateway and embeddings docs for routing, caching, and refresh cost patterns.

Hard cap vs soft alert documented with approver (unified model gateway)
Cost per successful task on steering dashboard (OpenAI evals)
AI Gateway or central key management for model routes
Batch jobs sized to avoid surprise embedding bills

unified model gateway

Operating rhythm after launch

Assign owners for taxonomy changes, prompt versions, vector index refreshes, and tool allow lists. Review override rates monthly per OpenAI safety best practices.

Schedule index rebuilds when source systems change. Version embeddings and chunking config alongside application deploys.

NIST AI RMF Govern should receive a standard monthly pack: incidents, eval trend, cost, adoption, open risks.

Decommissioning matters: how to drain traffic, archive logs, and delete indexes when a workflow ends (NIST Manage).

Named owner for prompts, indexes, and on-call (NIST AI RMF Govern roster)
Monthly review: quality, cost, incidents, overrides (OpenAI evals)
Quarterly: model route review, MITRE ATLAS findings, champion feedback
Runbook for disable-tools and human fallback

Minimum bar checklist

Use this as a steering-forum gate before production. Red items block launch; amber items need dated remediation plans signed by the sponsor per NIST AI RMF Playbook.

Do not waive logging or write controls for executive deadlines. Shortcuts here dominate post-incident reviews cited in OWASP LLM Top 10.

Attach evidence links (dashboard, eval report, runbook drill date) to the AWS AI compliance, not assertions.

Logging and dashboards live in ops tools (Foundry monitor)
golden set (OpenAI evals) evals run on prompt or index changes
Write tools behind OpenAI safety best practices or policy gates
Rate limits and cost caps configured (unified model gateway)
Runbook tested for disable-tools and human fallback (security controls)
Data retention and PII handling documented (data privacy)
Security liaison signed ISO 42001-aligned control checklist

Production gate for sponsors

Sponsors should sign off on scope (which workflows), limits (what the system will not do), and metrics (OpenAI evals for 90 days post-launch).

Without that, engineering ships a demo and operations inherits an unowned service. Name the business owner on-call per NIST AI RMF Govern charter, not only the vendor.

Include escalation path when the model is wrong or unsafe. Employees need a OpenAI safety best practices queue, not a dead end chat.

Scale funding should depend on hitting metric and stop rules thresholds, not demo applause.

90-day primary KPI and supporting metrics agreed
stop rules (NIST AI RMF) for post-launch quality or cost breaches
Communication plan for users and help desk
Budget for steady-state ops, not only build

Handoff to internal teams

Deliver runbooks, architecture diagrams, eval datasets, and a recorded walkthrough of disable-tools and human-fallback. End with artefacts your team can operate (reference architecture patterns as patterns).

Train help desk on empty RAG retrieval, approval delays, and Content Safety blocks. Scripts reduce shadow workarounds.

List open production gaps with owners and dates. Honest amber items build trust; hidden gaps destroy credibility at first security incident.

Schedule a 30-day post-launch review with the same attendees as the readiness gate (scoping pilot readout format).

Runbook PDF plus link to live dashboards (telemetry)
Eval dataset location and how to add golden set cases
IAM matrix and data-flow diagram current (AWS AI compliance)
Champion briefing on what changed at go-live (NIST AI RMF Govern)

Workshop: readiness review in two hours

Attendees: sponsor, engineering lead, security liaison, ops/SRE, champion. Bring pilot scoping metrics, architecture diagram, and draft runbook.

Score each checklist section red/amber/green against NIST AI RMF Playbook. Agree launch date or remediation plan with owners.

Run a 15-minute tabletop: RAG returns empty on a high-profile question. Trace response, logging, and comms.

Output: signed scorecard stored where NIST AI RMF Govern and audit can find it.

0:00 to 0:30: Observability and eval evidence
0:30 to 1:00: Safety, Entra, data privacy
1:00 to 1:30: Cost caps and operating rhythm
1:30 to 2:00: Tabletop and go/no-go (NIST AI RMF)

Production readiness checklist

Production readiness is a programme gate, not a slide

Observability and audit

Evaluation before merge

Safety and human oversight

Identity, secrets, and environments

Data, retention, and regional boundaries

Cost and capacity controls

Operating rhythm after launch

Minimum bar checklist

Production gate for sponsors

Handoff to internal teams

Workshop: readiness review in two hours

Plan your next pilot

Production readiness checklist

Executive summary

Production readiness is a programme gate, not a slide

Observability and audit

Evaluation before merge

Safety and human oversight

Identity, secrets, and environments

Data, retention, and regional boundaries

Cost and capacity controls

Operating rhythm after launch

Minimum bar checklist

Production gate for sponsors

Handoff to internal teams

Workshop: readiness review in two hours

Plan your next pilot