AI Labs

Decision guide

Vendor and model selection

How to run a defensible bake-off when procurement, platform, and the business each have a favourite model.

Business sponsorsTechnical leaders

Bake-off on tasks, not demos

A model that shines in a generic demo may fail on your policy corpus, your JSON schema, or your latency budget. Score on golden questions, not keynote transcripts.

Score candidates on the same golden questions with the same rubric, retrieval config, and safety settings you will use in production. Follow production readiness conversation for environment parity.

Separate subjective eloquence from measurable outcomes: citation rate, unknown-answer rate, tool success, and unsafe proposal rate. Align metrics with OpenAI evals.

Publish results to sponsors as a table, not a narrative about which vendor felt smarter in the room. Archive reports like the OpenAI evals.

  • Cite-only RAG: citation rate and unknown-answer discipline
  • Classification: accuracy and calibration on your labels
  • Tool use: success rate and unsafe proposal rate
  • Structured extraction: schema validity on fixed samples
  • Latency p95 and cost per successful task at pilot volume

Define task types before model names

Enterprises rarely need one model for everything. They need defaults per task type: conversational Q&A, classification, embedding, and long-document summarisation.

Write task definitions with input limits, expected output shape, and whether citations are mandatory. Link cite-only tasks to the Azure RAG solution guide.

Map each task to evaluation metrics that legal and the business already understand, such as must-cite or must-refuse. Use Anthropic eval guidance for rubric design.

Avoid letting teams pick models ad hoc via personal API keys after the pilot standardises. Route through AI Gateway or cloud-native routing.

  • Chat with tools versus chat retrieve-only
  • Batch classification with fixed label set
  • Embedding model tied to search index version
  • Meeting summarisation with redaction rules
  • Vendor questionnaire extraction with schema validation

Contracts and data terms

Legal cares about training use, retention, indemnities, subprocessors, and region. Engineering cares about rate limits, failover, and observability. Resolve both before declaring a default in Azure Well-Architected standards.

Ask whether prompts and outputs may be used for vendor training, and whether zero-retention options exist for your tier. Review AWS AI compliance and Azure DPAs.

Attach the executed terms to the AWS AI compliance so reviewers do not chase email threads. Update when you add gateway failover.

Include data privacy and retention language sponsors can sign without engineering translation.

Primary and fallback

Production needs a fallback model or human queue when the primary is down, rate limited, or degraded. AI Gateway and cloud deploy guides both support routing tables.

Gateways let you switch providers without rewriting application logic, provided your prompts and schemas stay portable. Demo unified model gateway in selection workshops.

Test failover monthly. Teams that only configure primary routes discover gaps during the first regional outage. Log failover events per Bedrock logging or SDK telemetry.

Document which tasks may degrade quality in fallback versus which must pause with a clear user message. Escalate to OpenAI safety best practices queues when both models refuse.

  • Primary and fallback model IDs in config, not code forks
  • Queue or ticket path when both models refuse. See OpenAI safety best practices.
  • Spend caps per route to prevent runaway failover cost
  • Eval suite run against both models before promotion
  • Incident comms template when degrading to fallback

RAG and model choice

For organisational knowledge workloads, retrieval quality often dominates raw model IQ. Fix chunking and citations before comparing models.

Compare models only after chunking, metadata filters, and citation formatting are fixed. Otherwise you tune the wrong variable. Use the Azure RAG solution guide as a bar.

Measure hallucination rate when retrieval returns low confidence. A stronger model can still invent policy if citations are optional. Apply reducing hallucinations guardrails.

Include re-index cost and time in the selection narrative when documents change weekly. Factor embedding refresh into TCO.

  • Same index version across all model candidates. Demo Azure Foundry RAG.
  • Citation required in system prompt and eval rubric
  • Unknown answer when retrieval score below threshold
  • Regional filter tests for AU-only content
  • Refresh playbook when legal updates a clause. Link RAG solution guide.

Safety and tool-use scoring

Include adversarial prompts in the bake-off: prompt injection attempts, requests to exfiltrate secrets, and proposals to email customers without approval.

Score whether the model proposes disallowed tools, and whether guardrails or Content Safety block before execution.

For CRM or ITSM writes, measure propose-only behaviour until OpenAI safety best practices is wired. Review OWASP LLM Top 10.

Share safety results with risk teams before business sponsors see only quality improvements. Attach to security controls review.

Cost and capacity planning

Token price is only part of the story. Include embedding cost, search units, safety API calls, and engineering time for regression evals. Use unified model gateway forecasting.

Model size affects latency and user experience in interactive chat. Batch workloads may tolerate slower, cheaper routes. Demo unified model gateway.

Forecast at 2x and 10x pilot volume so finance can set guardrails before viral internal adoption. Tie to OpenAI evals KPIs.

Revisit the forecast when you add tools, longer context, or image inputs.

Workshop facilitation

Run selection workshops with procurement, platform engineering, risk, and a business owner in the same room. Use NIST AI RMF timing for prep.

Show live runs from the golden set on a projector. Hidden spreadsheets breed distrust when results differ from memory. Run Azure prompt flow evaluation side by side.

End with a decision log: default models per task, fallback rules, and who approves exceptions. Store with production readiness conversation register.

Schedule a 90-day review because vendor roadmaps and pricing change quickly. Re-run OpenAI evals on each model bump.

  • Pre-read: task definitions and rubric one week ahead
  • Live scoring sheet visible to all attendees
  • No new candidates introduced on workshop day
  • Decision owner named before adjournment. Escalate to NIST AI RMF Govern if blocked.
  • Calendar invite for quarterly model register review

When to standardise

Standardise after the pilot proves value on one workflow, not when every team has a different favourite. Gate on production readiness conversation.

Publish a model register with version, owner, approved tasks, and retirement date. Link to Azure Well-Architected decisions.

Allow exceptions through architecture review with a short risk note, not shadow API keys. Log in AWS AI compliance.

Retire models on a schedule. Old endpoints linger and become security debt. Run regression evals before deprecation.

  • Default per task type in internal standards
  • Register entry for each approved deployment
  • Exception template with sponsor and risk sign-off
  • Deprecation notice 30 days before switch-off
  • Regression eval gate on every version bump

What good looks like

Good looks like reproducible eval reports checked into CI for every prompt or index change. Follow OpenAI evals patterns.

Good looks like legal, engineering, and procurement citing the same model register and NIST AI RMF Playbook.

Good looks like sponsors understanding fallback and cost caps, not only headline quality scores.

Good looks like reference architecture patterns in vendor documentation used as teaching patterns while your production config remains independently reviewed.

  • Golden set owned by business and engineering jointly
  • Bake-off results archived with version numbers
  • Fallback tested in last quarterly drill
  • No unmanaged API keys in production tenants
  • Steering committee brief uses task metrics, not hype

Common mistakes

Teams choose the model from the best sales demo, then discover retrieval was never built. Fix corpus and chunking first.

Teams fine-tuning for facts that change weekly instead of fixing chunking and citations. Read fine-tuning guide fit criteria.

Teams skip legal review on subprocessors until procurement blocks go-live. Attach terms to AWS AI compliance in week one.

Teams standardise too early on a single model for incompatible task types. Split tasks using the task-type definitions from workshop prep.

Next step

Talk about your next pilot

Patterns, metrics, and runnable demos for architecture reviews and pilots, from The Ops Toolbox.

Prefer the web form? The Ops Toolbox.

  • One workflow, clear metrics
  • Your cloud, your keys
  • Written handoff, not dependency