For business sponsors
Executive summary
Skim this before the full guide. Technical detail follows in the sections below.
- Decision
- Whether organisational knowledge should live in retrieval or custom model weights.
- Primary metric
- Citation rate and time to update answers when source policy changes.
- Stop rule
- Do not fine-tune for facts that change monthly; prefer RAG unless task shape is fixed.
Related worked example
Enterprise RAG on Azure AI FoundryNeed facilitation on this topic? Start a conversation.
01
What RAG actually does
Retrieval-augmented generation (Azure RAG concepts) keeps knowledge outside the model. You chunk documents, embed them with a model such as OpenAI embeddings, store vectors in a search index, retrieve the top matches at query time, and pass those passages into the prompt. When policy changes, you re-index; you do not retrain.
Production RAG is more than uploading PDFs. You need chunking strategy, metadata filters (region, product line, effective date), citation formatting, and an explicit unknown answer when retrieval confidence is low. Microsoft's Azure RAG solution guide documents how those pieces fit together.
The model's job in a well designed RAG system is synthesis and refusal, not memorisation of your corpus. That separation is what makes answers auditable and refreshable. Pair cite-only behaviour with reducing hallucinations guidance when evals show fluent wrong answers.
Workshop participants often conflate "we added documents" with "we built RAG." Use a simple diagram: source systems, index, retrieve, cite, generate. If any box is missing, you are not ready for production per OpenAI production best practices.
- Index refresh is a scheduled operational task, not a one-off upload. See AWS Bedrock Knowledge Bases for managed ingestion patterns.
- Every factual claim in regulated Q&A should trace to a passage ID
- Empty retrieval must map to a scripted unknown response
- Workshop question: "If legal edits one paragraph tomorrow, what is our update path?"
- What good looks like: versioned index builds with rollback and Vercel embeddings pinned in config
02
Default to RAG for organisational knowledge
If answers must reflect your policies, products, or procedures, and change when documents change, RAG plus cite-only generation is usually the right first move. Fine-tuning embeds knowledge into weights that are expensive to refresh and hard to audit.
RAG gives legal and compliance a defensible trail: the model answered from passage X in document Y. That matters in regulated Q&A, procurement, and HR policy where "the model said so" is not evidence. Align retention with the Microsoft Copilot data protection.
Start with the smallest approved corpus that covers 80% of questions. Expand metadata and sources after citation rate and unknown-answer discipline are stable. Run golden questions weekly, not only before launch.
Sponsors should hear RAG described as answers tied to approved documents, not as a cheaper chatbot. The value is governance and refresh speed, not novelty. Point sceptics to the Azure RAG solution guide for enterprise design patterns.
- Policy and compliance Q&A with mandatory citations
- Product documentation that updates weekly or monthly. Demo with Azure Foundry RAG.
- Multi-source answers (SharePoint + Confluence + PDF archive)
- Regional or role-based views via search filters
- Common mistake: allowing the model to answer when retrieval is empty
- What good looks like: unknown-answer rate reported alongside citation rate in the OpenAI evals dashboard
Reference documentation
03
Chunking, ingestion, and citations
Chunk size and overlap drive recall. Whole PDFs stuffed into context waste tokens and bury the right paragraph. Tune chunking on your documents with retrieval metrics and OpenAI evals, not blog defaults.
Ingestion pipelines need virus scanning, type allow lists, ACL sync from source systems, and deduplication when the same policy exists in three repositories. Treat malicious PDF instructions as a prompt injection vector via retrieved content.
Citations must be human readable: document title, section, effective date, and link back to the source system where possible. Engineers may log chunk IDs; business users need recognisable references. Follow OpenAI retrieval patterns for passage formatting.
The Azure RAG solution guide in vendor documentation shows how refresh jobs, citation formatting, and eval hooks fit together. Treat that pattern as the engineering bar, not optional polish.
- Chunk by heading structure where documents have predictable layout
- Store metadata: classification, region, product, effective_from, effective_to
- Re-embed on chunking config change; version the embedding model
- Pen-test ingestion: malicious PDF instructions are a retrieval attack. See OWASP LLM Top 10.
- Workshop question: "Can we show the exact passage in the source system?"
Reference documentation
04
When fine-tuning earns its cost
Fine-tuning adjusts model weights on your examples. It helps when the task shape is stable but prompt engineering becomes unwieldy, not when facts change frequently.
Use fine-tuning when you have hundreds to thousands of labelled examples, a clear evaluation rubric, and a retraining cadence when labels drift. Without evals, fine-tuning is an expensive way to hide prompt debt.
Regional language variants, clinical phrasing, or strict JSON schemas are common justified cases. Facts about your products are usually not. Read Azure fine-tuning considerations before committing budget.
Azure, AWS custom models, and OpenAI each expose fine-tuning with different data residency and approval steps. Anchor the decision to your cloud anchor guide, not to a vendor demo.
- Stable output format (same JSON schema or label set every time)
- Style or tone consistency at scale (brand voice, clinical phrasing)
- Latency-sensitive classification with a fixed, small label set
- Domain terminology where retrieval noise stays high despite good chunking. Compare against RAG vs fine-tuning checklist first.
- Poor fit: policy corpus that changes monthly
- Poor fit: proving ROI on factual Q&A without citations
05
Common failure modes
Teams often pick fine-tuning because demos feel smarter, then discover refresh cost and regression risk. Name these patterns in architecture review before spend is committed.
RAG failures are usually operational: bad chunking, missing ACL filters, or no unknown path. Fine-tuning failures are usually eval and label drift: the model learned last quarter's exceptions.
The most expensive combined mistake is fine-tuning for facts while also running RAG, without clear division of responsibility. Pick a primary knowledge path per use case. Document the choice in your OWASP LLM Top 10 register.
Skipping golden set comparison lets both approaches look successful in isolation and fail together in production. Run the same rubric through Azure prompt flow evaluation.
- Fine-tuning for facts: knowledge goes stale; use RAG or structured extraction
- RAG without chunking discipline: whole PDFs in context; wrong paragraph wins
- No citation requirement: fluent wrong answers pass review until audit
- Skipping evals: neither approach measured on the same questions. See OpenAI evals.
- Index without ACLs: users see documents they cannot open in source systems
- What good looks like: weekly citation and empty-retrieval dashboard
06
Hybrid patterns that work in enterprise
RAG plus a small fine-tuned router or classifier is a legitimate hybrid: the classifier picks intent or document family; RAG answers with citations. Keep the classifier's labels stable and eval them separately with OpenAI evals.
Structured extraction (forms, questionnaires) often beats both RAG and fine-tuning when fields are fixed. Use the model to map text to schema, then validate deterministically. See intent routing example.
Do not hybridise without documenting which component owns factual correctness. Auditors will ask; ambiguity creates programme risk. Align with OpenAI safety best practices for any write path.
If Copilot already covers broad M365 search, custom RAG should focus on corpora and citations Copilot cannot provide. Read the Microsoft 365 Copilot overview.
- RAG for answers, fine-tuning for intent routing only
- RAG for policy, deterministic rules for eligibility thresholds
- Extraction to JSON, OpenAI safety best practices, then system write
- Common mistake: two knowledge paths with no single owner
07
Pilot discipline: compare on the same questions
Run the same golden questions through baseline chat, **RAG, and (only if needed) a fine-tuned endpoint. Measure citation rate, hallucination rate, latency, token cost, and refresh cost** when source documents change.
Keep prompts and retrieval config in version control. Treat index refreshes and prompt changes like code deploys with rollback. Follow production best practices for staged promotion.
Score unknown-answer discipline explicitly. A system that refuses when unsure often beats one that impresses executives in week one and fails audit in week ten. Use the NIST AI RMF stop rules.
Publish side-by-side results to the council: not which model is smarter, but which pattern meets defensibility and TCO targets. Demo baseline chat with the Azure OpenAI example.
- 20 to 50 golden questions agreed with legal or ops before build
- Same rubric for baseline, RAG, and fine-tuning variants
- Measure cost per successful task, not cost per demo. Link to Vercel AI Gateway.
- Document refresh labour hours when a policy paragraph changes
- Workshop question: "What regression blocks a prompt or index deploy?"
Reference documentation
08
For business sponsors
Sponsors rarely need model names; they need defensibility (citations, audit trail), time-to-update when policy changes, and a clear refresh cost line in the budget. The AWS AI compliance explains what auditors expect.
Position RAG as answers tied to approved documents and fine-tuning as stable task format at scale, not as competing vanity projects.
Ask three questions in every steering readout: How fast can we update answers after a policy change? What evidence do we show when an answer is challenged? What is our cost per successful task compared to baseline? Frame answers using OpenAI evals metrics.
Pilot success is citation rate plus unknown-answer rate, not felt smarter in a room of executives. Tie funding to the program charter.
- Budget line for index storage, embedding refresh, and search ops
- Legal sign-off on cite-only behaviour and retention
- Stop rule: citation rate below threshold for two weeks
- Scale criterion: same metrics hold on 3x question volume
09
Total cost of ownership (TCO)
RAG TCO includes search index, embedding calls, retrieval traffic, re-index labour when corpora change, and observability. Fine-tuning TCO includes labelling, training runs, retraining when labels drift, and regression evals on every refresh.
For most enterprise knowledge workloads, RAG wins TCO unless you have a narrow, stable label task with evidence that prompts cannot reach. Model cost controls apply to both patterns.
Include human review minutes in TCO. A cheaper model that doubles escalations is not a win. Factor HITL queues into fully loaded cost.
Finance should see a one-page comparison: pilot actuals, steady-state forecast, and sensitivity if adoption doubles. Use production readiness conversation bake-off tables as a template.
- RAG: index + embed + query + ops time for refresh
- Fine-tuning: labels + train + eval + retrain on drift
- Hidden cost: incident response when citations fail. See AI security controls.
- What good looks like: cost per successful task on steering dashboard
10
Governance, privacy, and audit
Index only data classes approved for the use case. Restricted HR or customer data in a general copilot index is a common programme-ending mistake. Apply Azure responsible AI review before production.
Align retention on prompts, logs, and vector stores with privacy and legal holds. RAG does not remove GDPR, APP, or sector retention obligations. Follow the Microsoft Copilot data protection.
Log retrieval IDs and model version per answer. When regulators or internal audit ask why an answer was given, you need reconstructability, not a chat screenshot. Enable Foundry monitoring where available.
Pair this guide with data privacy and security controls guides for sponsor-facing policy language.
- ACL on index matches source system permissions
- Classification labels enforced at query time
- Retention schedule signed before production readiness conversation
- Subprocessor register updated when embedding model changes
11
Workshop: 90-minute pattern bake-off
Attendees: sponsor, engineering lead, ops analytics, legal or privacy delegate, champion. Bring golden questions and one policy change scenario (paragraph edit).
Hour one: run baseline chat and RAG on the same questions; score citations and unknowns. Hour two: TCO sketch and decision: RAG only, fine-tuning only for task shape, or hybrid with named owners. Use the Vercel seed RAG example as a teaching prop.
End with a written recommendation and stop rules, not a verbal preference for the shiniest demo. Anchor stop rules to the NIST AI RMF.
Capture open risks (ACL, ingestion, eval CI) with owners before sprint two. Log exceptions through the NIST AI RMF Govern.
- 0:00 to 0:15: Charter recap and golden set review
- 0:15 to 0:45: Live scoring baseline vs RAG
- 0:45 to 1:05: Policy-change refresh exercise
- 1:05 to 1:25: TCO and governance gaps
- 1:25 to 1:30: Decision and next steps
12
Architecture decision checklist
Use this checklist in review before committing budget to fine-tuning or a large index build. Cross-check against the Azure RAG solution guide.
If more than two answers are "no" or "unknown," pause the pattern and fix foundations first. Run a production readiness conversation gap review.
Escalate exceptions to the NIST AI RMF Govern with an expiry date, not silent team-level workarounds.
- Do answers need to change when documents change without retraining? → RAG
- Is factual defensibility required in audit? → RAG with citations
- Is the task a stable label or schema with abundant examples? → consider fine-tuning
- Do we have golden evals and CI for every prompt or index change? → required either way
- Is ACL enforced on retrieval? → required for RAG
- What good looks like: one-page decision record attached to charter
Provider & framework documentation
Official docs referenced in this guide. Use these in architecture reviews and security questionnaires.
Microsoft Azure
OpenAI
Anthropic