For business sponsors
Executive summary
Skim this before the full guide. Technical detail follows in the sections below.
- Decision
- Which default models and vendors are approved per task type.
- Primary metric
- Golden-set scores, latency p95, and cost per successful task on your data.
- Stop rule
- Reject models that win demos but fail your rubric or data-processing terms.
Related worked example
Multi-model routing via unified gatewayNeed facilitation on this topic? Start a conversation.
Bake-off on tasks, not demos
A model that shines in a generic demo may fail on your policy corpus, your JSON schema, or your latency budget. Score on golden questions, not keynote transcripts.
Score candidates on the same golden questions with the same rubric, retrieval config, and safety settings you will use in production. Follow production readiness conversation for environment parity.
Separate subjective eloquence from measurable outcomes: citation rate, unknown-answer rate, tool success, and unsafe proposal rate. Align metrics with OpenAI evals.
Publish results to sponsors as a table, not a narrative about which vendor felt smarter in the room. Archive reports like the OpenAI evals.
- Cite-only RAG: citation rate and unknown-answer discipline
- Classification: accuracy and calibration on your labels
- Tool use: success rate and unsafe proposal rate
- Structured extraction: schema validity on fixed samples
- Latency p95 and cost per successful task at pilot volume
Define task types before model names
Enterprises rarely need one model for everything. They need defaults per task type: conversational Q&A, classification, embedding, and long-document summarisation.
Write task definitions with input limits, expected output shape, and whether citations are mandatory. Link cite-only tasks to the Azure RAG solution guide.
Map each task to evaluation metrics that legal and the business already understand, such as must-cite or must-refuse. Use Anthropic eval guidance for rubric design.
Avoid letting teams pick models ad hoc via personal API keys after the pilot standardises. Route through AI Gateway or cloud-native routing.
- Chat with tools versus chat retrieve-only
- Batch classification with fixed label set
- Embedding model tied to search index version
- Meeting summarisation with redaction rules
- Vendor questionnaire extraction with schema validation
Contracts and data terms
Legal cares about training use, retention, indemnities, subprocessors, and region. Engineering cares about rate limits, failover, and observability. Resolve both before declaring a default in Azure Well-Architected standards.
Ask whether prompts and outputs may be used for vendor training, and whether zero-retention options exist for your tier. Review AWS AI compliance and Azure DPAs.
Attach the executed terms to the AWS AI compliance so reviewers do not chase email threads. Update when you add gateway failover.
Include data privacy and retention language sponsors can sign without engineering translation.
- Data processing agreement and subprocessor list
- Retention period for prompts, outputs, and logs. Align with Microsoft Copilot data protection.
- Regional availability and cross-border failover
- Indemnity and acceptable use policy alignment
- Support SLA and incident notification commitments
Primary and fallback
Production needs a fallback model or human queue when the primary is down, rate limited, or degraded. AI Gateway and cloud deploy guides both support routing tables.
Gateways let you switch providers without rewriting application logic, provided your prompts and schemas stay portable. Demo unified model gateway in selection workshops.
Test failover monthly. Teams that only configure primary routes discover gaps during the first regional outage. Log failover events per Bedrock logging or SDK telemetry.
Document which tasks may degrade quality in fallback versus which must pause with a clear user message. Escalate to OpenAI safety best practices queues when both models refuse.
- Primary and fallback model IDs in config, not code forks
- Queue or ticket path when both models refuse. See OpenAI safety best practices.
- Spend caps per route to prevent runaway failover cost
- Eval suite run against both models before promotion
- Incident comms template when degrading to fallback
RAG and model choice
For organisational knowledge workloads, retrieval quality often dominates raw model IQ. Fix chunking and citations before comparing models.
Compare models only after chunking, metadata filters, and citation formatting are fixed. Otherwise you tune the wrong variable. Use the Azure RAG solution guide as a bar.
Measure hallucination rate when retrieval returns low confidence. A stronger model can still invent policy if citations are optional. Apply reducing hallucinations guardrails.
Include re-index cost and time in the selection narrative when documents change weekly. Factor embedding refresh into TCO.
- Same index version across all model candidates. Demo Azure Foundry RAG.
- Citation required in system prompt and eval rubric
- Unknown answer when retrieval score below threshold
- Regional filter tests for AU-only content
- Refresh playbook when legal updates a clause. Link RAG solution guide.
Safety and tool-use scoring
Include adversarial prompts in the bake-off: prompt injection attempts, requests to exfiltrate secrets, and proposals to email customers without approval.
Score whether the model proposes disallowed tools, and whether guardrails or Content Safety block before execution.
For CRM or ITSM writes, measure propose-only behaviour until OpenAI safety best practices is wired. Review OWASP LLM Top 10.
Share safety results with risk teams before business sponsors see only quality improvements. Attach to security controls review.
- Prompt injection cases in golden set
- Tool allow list enforced server side
- Output filter for PII and secrets. See Microsoft Copilot data protection.
- Human approval on write tools
- Disable-tools drill recorded for audits
Cost and capacity planning
Token price is only part of the story. Include embedding cost, search units, safety API calls, and engineering time for regression evals. Use unified model gateway forecasting.
Model size affects latency and user experience in interactive chat. Batch workloads may tolerate slower, cheaper routes. Demo unified model gateway.
Forecast at 2x and 10x pilot volume so finance can set guardrails before viral internal adoption. Tie to OpenAI evals KPIs.
Revisit the forecast when you add tools, longer context, or image inputs.
- Cost per successful task, not per thousand tokens alone
- Separate batch and interactive budgets
- Spend alerts tied to project or cost centre
- Monthly review with finops and NIST AI RMF Govern
- Document exception process for premium models via NIST AI RMF Govern
Workshop facilitation
Run selection workshops with procurement, platform engineering, risk, and a business owner in the same room. Use NIST AI RMF timing for prep.
Show live runs from the golden set on a projector. Hidden spreadsheets breed distrust when results differ from memory. Run Azure prompt flow evaluation side by side.
End with a decision log: default models per task, fallback rules, and who approves exceptions. Store with production readiness conversation register.
Schedule a 90-day review because vendor roadmaps and pricing change quickly. Re-run OpenAI evals on each model bump.
- Pre-read: task definitions and rubric one week ahead
- Live scoring sheet visible to all attendees
- No new candidates introduced on workshop day
- Decision owner named before adjournment. Escalate to NIST AI RMF Govern if blocked.
- Calendar invite for quarterly model register review
When to standardise
Standardise after the pilot proves value on one workflow, not when every team has a different favourite. Gate on production readiness conversation.
Publish a model register with version, owner, approved tasks, and retirement date. Link to Azure Well-Architected decisions.
Allow exceptions through architecture review with a short risk note, not shadow API keys. Log in AWS AI compliance.
Retire models on a schedule. Old endpoints linger and become security debt. Run regression evals before deprecation.
- Default per task type in internal standards
- Register entry for each approved deployment
- Exception template with sponsor and risk sign-off
- Deprecation notice 30 days before switch-off
- Regression eval gate on every version bump
What good looks like
Good looks like reproducible eval reports checked into CI for every prompt or index change. Follow OpenAI evals patterns.
Good looks like legal, engineering, and procurement citing the same model register and NIST AI RMF Playbook.
Good looks like sponsors understanding fallback and cost caps, not only headline quality scores.
Good looks like reference architecture patterns in vendor documentation used as teaching patterns while your production config remains independently reviewed.
- Golden set owned by business and engineering jointly
- Bake-off results archived with version numbers
- Fallback tested in last quarterly drill
- No unmanaged API keys in production tenants
- Steering committee brief uses task metrics, not hype
Common mistakes
Teams choose the model from the best sales demo, then discover retrieval was never built. Fix corpus and chunking first.
Teams fine-tuning for facts that change weekly instead of fixing chunking and citations. Read fine-tuning guide fit criteria.
Teams skip legal review on subprocessors until procurement blocks go-live. Attach terms to AWS AI compliance in week one.
Teams standardise too early on a single model for incompatible task types. Split tasks using the task-type definitions from workshop prep.
- Comparing models with different retrieval configs
- No unknown-answer cases in the eval rubric
- Fallback never tested under load
- Personal keys bypassing spend caps
- Evals run only once before launch