AI Labs by The Ops Toolbox

Business sponsorsTechnical leaders

Bake-off on tasks, not demos

A model that shines in a generic demo may fail on your policy corpus, your JSON schema, or your latency budget. Score on golden questions, not keynote transcripts.

Score candidates on the same golden questions with the same rubric, retrieval config, and safety settings you will use in production. Follow production readiness conversation for environment parity.

Separate subjective eloquence from measurable outcomes: citation rate, unknown-answer rate, tool success, and unsafe proposal rate. Align metrics with OpenAI evals.

Publish results to sponsors as a table, not a narrative about which vendor felt smarter in the room. Archive reports like the OpenAI evals.

Cite-only RAG: citation rate and unknown-answer discipline
Classification: accuracy and calibration on your labels
Tool use: success rate and unsafe proposal rate
Structured extraction: schema validity on fixed samples
Latency p95 and cost per successful task at pilot volume

Define task types before model names

Enterprises rarely need one model for everything. They need defaults per task type: conversational Q&A, classification, embedding, and long-document summarisation.

Write task definitions with input limits, expected output shape, and whether citations are mandatory. Link cite-only tasks to the Azure RAG solution guide.

Map each task to evaluation metrics that legal and the business already understand, such as must-cite or must-refuse. Use Anthropic eval guidance for rubric design.

Avoid letting teams pick models ad hoc via personal API keys after the pilot standardises. Route through AI Gateway or cloud-native routing.

Chat with tools versus chat retrieve-only
Batch classification with fixed label set
Embedding model tied to search index version
Meeting summarisation with redaction rules
Vendor questionnaire extraction with schema validation

Contracts and data terms

Legal cares about training use, retention, indemnities, subprocessors, and region. Engineering cares about rate limits, failover, and observability. Resolve both before declaring a default in Azure Well-Architected standards.

Ask whether prompts and outputs may be used for vendor training, and whether zero-retention options exist for your tier. Review AWS AI compliance and Azure DPAs.

Attach the executed terms to the AWS AI compliance so reviewers do not chase email threads. Update when you add gateway failover.

Include data privacy and retention language sponsors can sign without engineering translation.

Data processing agreement and subprocessor list
Retention period for prompts, outputs, and logs. Align with Microsoft Copilot data protection.
Regional availability and cross-border failover
Indemnity and acceptable use policy alignment
Support SLA and incident notification commitments

Primary and fallback

Production needs a fallback model or human queue when the primary is down, rate limited, or degraded. AI Gateway and cloud deploy guides both support routing tables.

Gateways let you switch providers without rewriting application logic, provided your prompts and schemas stay portable. Demo unified model gateway in selection workshops.

Test failover monthly. Teams that only configure primary routes discover gaps during the first regional outage. Log failover events per Bedrock logging or SDK telemetry.

Document which tasks may degrade quality in fallback versus which must pause with a clear user message. Escalate to OpenAI safety best practices queues when both models refuse.

Primary and fallback model IDs in config, not code forks
Queue or ticket path when both models refuse. See OpenAI safety best practices.
Spend caps per route to prevent runaway failover cost
Eval suite run against both models before promotion
Incident comms template when degrading to fallback

RAG and model choice

For organisational knowledge workloads, retrieval quality often dominates raw model IQ. Fix chunking and citations before comparing models.

Compare models only after chunking, metadata filters, and citation formatting are fixed. Otherwise you tune the wrong variable. Use the Azure RAG solution guide as a bar.

Measure hallucination rate when retrieval returns low confidence. A stronger model can still invent policy if citations are optional. Apply reducing hallucinations guardrails.

Include re-index cost and time in the selection narrative when documents change weekly. Factor embedding refresh into TCO.

Same index version across all model candidates. Demo Azure Foundry RAG.
Citation required in system prompt and eval rubric
Unknown answer when retrieval score below threshold
Regional filter tests for AU-only content
Refresh playbook when legal updates a clause. Link RAG solution guide.

Safety and tool-use scoring

Include adversarial prompts in the bake-off: prompt injection attempts, requests to exfiltrate secrets, and proposals to email customers without approval.

Score whether the model proposes disallowed tools, and whether guardrails or Content Safety block before execution.

For CRM or ITSM writes, measure propose-only behaviour until OpenAI safety best practices is wired. Review OWASP LLM Top 10.

Share safety results with risk teams before business sponsors see only quality improvements. Attach to security controls review.

Prompt injection cases in golden set
Tool allow list enforced server side
Output filter for PII and secrets. See Microsoft Copilot data protection.
Human approval on write tools
Disable-tools drill recorded for audits

Cost and capacity planning

Token price is only part of the story. Include embedding cost, search units, safety API calls, and engineering time for regression evals. Use unified model gateway forecasting.

Model size affects latency and user experience in interactive chat. Batch workloads may tolerate slower, cheaper routes. Demo unified model gateway.

Forecast at 2x and 10x pilot volume so finance can set guardrails before viral internal adoption. Tie to OpenAI evals KPIs.

Revisit the forecast when you add tools, longer context, or image inputs.

Cost per successful task, not per thousand tokens alone
Separate batch and interactive budgets
Spend alerts tied to project or cost centre
Monthly review with finops and NIST AI RMF Govern
Document exception process for premium models via NIST AI RMF Govern

Workshop facilitation

Run selection workshops with procurement, platform engineering, risk, and a business owner in the same room. Use NIST AI RMF timing for prep.

Show live runs from the golden set on a projector. Hidden spreadsheets breed distrust when results differ from memory. Run Azure prompt flow evaluation side by side.

End with a decision log: default models per task, fallback rules, and who approves exceptions. Store with production readiness conversation register.

Schedule a 90-day review because vendor roadmaps and pricing change quickly. Re-run OpenAI evals on each model bump.

Pre-read: task definitions and rubric one week ahead
Live scoring sheet visible to all attendees
No new candidates introduced on workshop day
Decision owner named before adjournment. Escalate to NIST AI RMF Govern if blocked.
Calendar invite for quarterly model register review

When to standardise

Standardise after the pilot proves value on one workflow, not when every team has a different favourite. Gate on production readiness conversation.

Publish a model register with version, owner, approved tasks, and retirement date. Link to Azure Well-Architected decisions.

Allow exceptions through architecture review with a short risk note, not shadow API keys. Log in AWS AI compliance.

Retire models on a schedule. Old endpoints linger and become security debt. Run regression evals before deprecation.

Default per task type in internal standards
Register entry for each approved deployment
Exception template with sponsor and risk sign-off
Deprecation notice 30 days before switch-off
Regression eval gate on every version bump

What good looks like

Good looks like reproducible eval reports checked into CI for every prompt or index change. Follow OpenAI evals patterns.

Good looks like legal, engineering, and procurement citing the same model register and NIST AI RMF Playbook.

Good looks like sponsors understanding fallback and cost caps, not only headline quality scores.

Good looks like reference architecture patterns in vendor documentation used as teaching patterns while your production config remains independently reviewed.

Golden set owned by business and engineering jointly
Bake-off results archived with version numbers
Fallback tested in last quarterly drill
No unmanaged API keys in production tenants
Steering committee brief uses task metrics, not hype

Common mistakes

Teams choose the model from the best sales demo, then discover retrieval was never built. Fix corpus and chunking first.

Teams fine-tuning for facts that change weekly instead of fixing chunking and citations. Read fine-tuning guide fit criteria.

Teams skip legal review on subprocessors until procurement blocks go-live. Attach terms to AWS AI compliance in week one.

Teams standardise too early on a single model for incompatible task types. Split tasks using the task-type definitions from workshop prep.

Comparing models with different retrieval configs
No unknown-answer cases in the eval rubric
Fallback never tested under load
Personal keys bypassing spend caps
Evals run only once before launch

Vendor and model selection

Bake-off on tasks, not demos

Define task types before model names

Contracts and data terms

Primary and fallback

RAG and model choice

Safety and tool-use scoring

Cost and capacity planning

Workshop facilitation

When to standardise

What good looks like

Common mistakes

Plan your next pilot

Vendor and model selection

Executive summary

Bake-off on tasks, not demos

Define task types before model names

Contracts and data terms

Primary and fallback

RAG and model choice

Safety and tool-use scoring

Cost and capacity planning

Workshop facilitation

When to standardise

What good looks like

Common mistakes

Plan your next pilot