AI Labs by The Ops Toolbox

Business sponsorsTechnical leaders

What RAG actually does

Retrieval-augmented generation (Azure RAG concepts) keeps knowledge outside the model. You chunk documents, embed them with a model such as OpenAI embeddings, store vectors in a search index, retrieve the top matches at query time, and pass those passages into the prompt. When policy changes, you re-index; you do not retrain.

Production RAG is more than uploading PDFs. You need chunking strategy, metadata filters (region, product line, effective date), citation formatting, and an explicit unknown answer when retrieval confidence is low. Microsoft's Azure RAG solution guide documents how those pieces fit together.

The model's job in a well designed RAG system is synthesis and refusal, not memorisation of your corpus. That separation is what makes answers auditable and refreshable. Pair cite-only behaviour with reducing hallucinations guidance when evals show fluent wrong answers.

Workshop participants often conflate "we added documents" with "we built RAG." Use a simple diagram: source systems, index, retrieve, cite, generate. If any box is missing, you are not ready for production per OpenAI production best practices.

Index refresh is a scheduled operational task, not a one-off upload. See AWS Bedrock Knowledge Bases for managed ingestion patterns.
Every factual claim in regulated Q&A should trace to a passage ID
Empty retrieval must map to a scripted unknown response
Workshop question: "If legal edits one paragraph tomorrow, what is our update path?"
What good looks like: versioned index builds with rollback and embedding models pinned in config

Default to RAG for organisational knowledge

If answers must reflect your policies, products, or procedures, and change when documents change, RAG plus cite-only generation is usually the right first move. Fine-tuning embeds knowledge into weights that are expensive to refresh and hard to audit.

RAG gives legal and compliance a defensible trail: the model answered from passage X in document Y. That matters in regulated Q&A, procurement, and HR policy where "the model said so" is not evidence. Align retention with the Microsoft Copilot data protection.

Start with the smallest approved corpus that covers 80% of questions. Expand metadata and sources after citation rate and unknown-answer discipline are stable. Run golden questions weekly, not only before launch.

Sponsors should hear RAG described as answers tied to approved documents, not as a cheaper chatbot. The value is governance and refresh speed, not novelty. Point sceptics to the Azure RAG solution guide for enterprise design patterns.

Policy and compliance Q&A with mandatory citations
Product documentation that updates weekly or monthly. Demo with Azure Foundry RAG.
Multi-source answers (SharePoint + Confluence + PDF archive)
Regional or role-based views via search filters
Common mistake: allowing the model to answer when retrieval is empty
What good looks like: unknown-answer rate reported alongside citation rate in the OpenAI evals dashboard

Chunking, ingestion, and citations

Chunk size and overlap drive recall. Whole PDFs stuffed into context waste tokens and bury the right paragraph. Tune chunking on your documents with retrieval metrics and OpenAI evals, not blog defaults.

Ingestion pipelines need virus scanning, type allow lists, ACL sync from source systems, and deduplication when the same policy exists in three repositories. Treat malicious PDF instructions as a prompt injection vector via retrieved content.

Citations must be human readable: document title, section, effective date, and link back to the source system where possible. Engineers may log chunk IDs; business users need recognisable references. Follow OpenAI retrieval patterns for passage formatting.

The Azure RAG solution guide in vendor documentation shows how refresh jobs, citation formatting, and eval hooks fit together. Treat that pattern as the engineering bar, not optional polish.

Chunk by heading structure where documents have predictable layout
Store metadata: classification, region, product, effective_from, effective_to
Re-embed on chunking config change; version the embedding model
Pen-test ingestion: malicious PDF instructions are a retrieval attack. See OWASP LLM Top 10.
Workshop question: "Can we show the exact passage in the source system?"

When fine-tuning earns its cost

Fine-tuning adjusts model weights on your examples. It helps when the task shape is stable but prompt engineering becomes unwieldy, not when facts change frequently.

Use fine-tuning when you have hundreds to thousands of labelled examples, a clear evaluation rubric, and a retraining cadence when labels drift. Without evals, fine-tuning is an expensive way to hide prompt debt.

Regional language variants, clinical phrasing, or strict JSON schemas are common justified cases. Facts about your products are usually not. Read Azure fine-tuning considerations before committing budget.

Azure, AWS custom models, and OpenAI each expose fine-tuning with different data residency and approval steps. Anchor the decision to your cloud anchor guide, not to a vendor demo.

Stable output format (same JSON schema or label set every time)
Style or tone consistency at scale (brand voice, clinical phrasing)
Latency-sensitive classification with a fixed, small label set
Domain terminology where retrieval noise stays high despite good chunking. Compare against RAG vs fine-tuning checklist first.
Poor fit: policy corpus that changes monthly
Poor fit: proving ROI on factual Q&A without citations

Common failure modes

Teams often pick fine-tuning because demos feel smarter, then discover refresh cost and regression risk. Name these patterns in architecture review before spend is committed.

RAG failures are usually operational: bad chunking, missing ACL filters, or no unknown path. Fine-tuning failures are usually eval and label drift: the model learned last quarter's exceptions.

The most expensive combined mistake is fine-tuning for facts while also running RAG, without clear division of responsibility. Pick a primary knowledge path per use case. Document the choice in your OWASP LLM Top 10 register.

Skipping golden set comparison lets both approaches look successful in isolation and fail together in production. Run the same rubric through Azure prompt flow evaluation.

Fine-tuning for facts: knowledge goes stale; use RAG or structured extraction
RAG without chunking discipline: whole PDFs in context; wrong paragraph wins
No citation requirement: fluent wrong answers pass review until audit
Skipping evals: neither approach measured on the same questions. See OpenAI evals.
Index without ACLs: users see documents they cannot open in source systems
What good looks like: weekly citation and empty-retrieval dashboard

Hybrid patterns that work in enterprise

RAG plus a small fine-tuned router or classifier is a legitimate hybrid: the classifier picks intent or document family; RAG answers with citations. Keep the classifier's labels stable and eval them separately with OpenAI evals.

Structured extraction (forms, questionnaires) often beats both RAG and fine-tuning when fields are fixed. Use the model to map text to schema, then validate deterministically. See intent routing example.

Do not hybridise without documenting which component owns factual correctness. Auditors will ask; ambiguity creates programme risk. Align with OpenAI safety best practices for any write path.

If Copilot already covers broad M365 search, custom RAG should focus on corpora and citations Copilot cannot provide. Read the Microsoft 365 Copilot overview.

RAG for answers, fine-tuning for intent routing only
RAG for policy, deterministic rules for eligibility thresholds
Extraction to JSON, OpenAI safety best practices, then system write
Common mistake: two knowledge paths with no single owner

Pilot discipline: compare on the same questions

Run the same golden questions through baseline chat, **RAG, and (only if needed) a fine-tuned endpoint. Measure citation rate, hallucination rate, latency, token cost, and refresh cost** when source documents change.

Keep prompts and retrieval config in version control. Treat index refreshes and prompt changes like code deploys with rollback. Follow production best practices for staged promotion.

Score unknown-answer discipline explicitly. A system that refuses when unsure often beats one that impresses executives in week one and fails audit in week ten. Use the NIST AI RMF stop rules.

Publish side-by-side results to the council: not which model is smarter, but which pattern meets defensibility and TCO targets. Demo baseline chat with the Azure OpenAI example.

20 to 50 golden questions agreed with legal or ops before build
Same rubric for baseline, RAG, and fine-tuning variants
Measure cost per successful task, not cost per demo. Link to unified model gateway.
Document refresh labour hours when a policy paragraph changes
Workshop question: "What regression blocks a prompt or index deploy?"

For business sponsors

Sponsors rarely need model names; they need defensibility (citations, audit trail), time-to-update when policy changes, and a clear refresh cost line in the budget. The AWS AI compliance explains what auditors expect.

Position RAG as answers tied to approved documents and fine-tuning as stable task format at scale, not as competing vanity projects.

Ask three questions in every steering readout: How fast can we update answers after a policy change? What evidence do we show when an answer is challenged? What is our cost per successful task compared to baseline? Frame answers using OpenAI evals metrics.

Pilot success is citation rate plus unknown-answer rate, not felt smarter in a room of executives. Tie funding to the program charter.

Budget line for index storage, embedding refresh, and search ops
Legal sign-off on cite-only behaviour and retention
Stop rule: citation rate below threshold for two weeks
Scale criterion: same metrics hold on 3x question volume

Total cost of ownership (TCO)

RAG TCO includes search index, embedding calls, retrieval traffic, re-index labour when corpora change, and observability. Fine-tuning TCO includes labelling, training runs, retraining when labels drift, and regression evals on every refresh.

For most enterprise knowledge workloads, RAG wins TCO unless you have a narrow, stable label task with evidence that prompts cannot reach. Model cost controls apply to both patterns.

Include human review minutes in TCO. A cheaper model that doubles escalations is not a win. Factor HITL queues into fully loaded cost.

Finance should see a one-page comparison: pilot actuals, steady-state forecast, and sensitivity if adoption doubles. Use production readiness conversation bake-off tables as a template.

RAG: index + embed + query + ops time for refresh
Fine-tuning: labels + train + eval + retrain on drift
Hidden cost: incident response when citations fail. See AI security controls.
What good looks like: cost per successful task on steering dashboard

Governance, privacy, and audit

Index only data classes approved for the use case. Restricted HR or customer data in a general copilot index is a common programme-ending mistake. Apply Azure responsible AI review before production.

Align retention on prompts, logs, and vector stores with privacy and legal holds. RAG does not remove GDPR, APP, or sector retention obligations. Follow the Microsoft Copilot data protection.

Log retrieval IDs and model version per answer. When regulators or internal audit ask why an answer was given, you need reconstructability, not a chat screenshot. Enable Foundry monitoring where available.

Pair this guide with data privacy and security controls guides for sponsor-facing policy language.

ACL on index matches source system permissions
Classification labels enforced at query time
Retention schedule signed before production readiness conversation
Subprocessor register updated when embedding model changes

Workshop: 90-minute pattern bake-off

Attendees: sponsor, engineering lead, ops analytics, legal or privacy delegate, champion. Bring golden questions and one policy change scenario (paragraph edit).

Hour one: run baseline chat and RAG on the same questions; score citations and unknowns. Hour two: TCO sketch and decision: RAG only, fine-tuning only for task shape, or hybrid with named owners. Use the seed RAG example as a teaching prop.

End with a written recommendation and stop rules, not a verbal preference for the shiniest demo. Anchor stop rules to the NIST AI RMF.

Capture open risks (ACL, ingestion, eval CI) with owners before sprint two. Log exceptions through the NIST AI RMF Govern.

0:00 to 0:15: Charter recap and golden set review
0:15 to 0:45: Live scoring baseline vs RAG
0:45 to 1:05: Policy-change refresh exercise
1:05 to 1:25: TCO and governance gaps
1:25 to 1:30: Decision and next steps

Architecture decision checklist

Use this checklist in review before committing budget to fine-tuning or a large index build. Cross-check against the Azure RAG solution guide.

If more than two answers are "no" or "unknown," pause the pattern and fix foundations first. Run a production readiness conversation gap review.

Escalate exceptions to the NIST AI RMF Govern with an expiry date, not silent team-level workarounds.

Do answers need to change when documents change without retraining? → RAG
Is factual defensibility required in audit? → RAG with citations
Is the task a stable label or schema with abundant examples? → consider fine-tuning
Do we have golden evals and CI for every prompt or index change? → required either way
Is ACL enforced on retrieval? → required for RAG
What good looks like: one-page decision record attached to charter

RAG vs fine-tuning

Executive summary