For business sponsors
Executive summary
Skim this before the full guide. Technical detail follows in the sections below.
- Decision
- What may be indexed, logged, and sent to model providers.
- Primary metric
- Zero ACL violations in sampling; retention aligned to legal holds.
- Stop rule
- Halt indexing of a data class until classification and DPA are signed.
Related worked example
PII Redaction Pre-ProcessorNeed facilitation on this topic? Start a conversation.
Privacy is a design constraint, not a late gate
Generative AI systems touch **prompts, retrieved documents, logs, embeddings (OpenAI embeddings guide), and tool outputs**. Each layer can persist personal data longer or wider than users expect. Privacy review in week five of a pilot forces rework or shutdown.
Legal and privacy teams need plain-language answers: what is collected, where it is stored, who can access it, how long it is kept, and whether providers train on customer content. Engineering implements those answers in architecture, not in footnotes.
Sensible pilots proceed when **data classification (Microsoft Copilot data protection), minimisation, and retention schedules** are documented before indexing. Blocking everything blocks learning; indexing everything blocks compliance.
- Privacy sign-off in pilot week one, not week five
- Document subprocessors and regions per provider route
- Workshop question: "Would an employee expect this chat to be stored five years?"
- What good looks like: one-page privacy appendix in every charter
Classify data before you index
Not every document belongs in a **vector index (Azure vector search) or general copilot. Separate public, internal, and restricted** corpora with different access controls, retention, and model routes.
HR files, customer PII, health information, and legal privilege material often require exclusion, redaction, or dedicated indexes with stricter IAM. Default deny beats default include.
Classification must match source system ACLs. If SharePoint (Azure RAG concepts) denies a user the file, RAG (Azure RAG concepts) must not surface it. Pilots that ignore ACLs create audit findings at scale.
Champions often push for "just add one more folder." Council should require classification label and owner signature per new source.
- Public: marketing brochures, external FAQs
- Internal: operating procedures without personal data
- Restricted: HR cases, customer records, legal matters
- Metadata: region, business unit, effective date, classification tag
- Common mistake: indexing entire drives without inventory
- Checklist: data owner sign-off per corpus
PII and sensitive data in prompts
Users paste names, employee IDs, customer account numbers, and medical details into chat despite training. Assume prompts contain sensitive data unless proven otherwise.
Minimisation means redacting or blocking patterns before model send where policy requires. Detectors are imperfect; combine automated redaction with user warnings and refusal for high-risk workflows.
Structured workflows should collect identifiers in forms with explicit purpose, not free-text chat, when possible. Forms enable validation and audit.
Tool calls can leak PII into logs and third-party systems. Scrub tool arguments in logs and restrict tool scopes to least privilege.
- Redact email, phone, tax file number patterns before provider send
- Warn users not to paste customer records in general copilots
- Separate restricted workflow with stronger controls and training
- Log redaction events for audit without logging raw PII
- What good looks like: redaction tested on golden prompts with PII
Prompt, output, and conversation retention
Decide explicitly whether you store full prompts, model outputs, hashes only, or redacted excerpts. Each choice affects debugging, eval replay, and legal discovery.
Align retention with DPAs, employment law, and sector rules. Some providers offer zero-retention or no-training options for eligible tiers. Document what you purchased vs what you assume.
Shorter retention reduces risk but limits incident investigation. Typical pattern: 30 to 90 days for redacted conversation logs in production, longer for aggregated metrics only.
User-facing privacy notices must match actual retention. If logs exist for security, say so plainly.
- retention schedule (Microsoft Copilot data protection) per data type: prompts, outputs, embeddings (OpenAI embeddings guide), logs
- Legal hold process pauses deletion without silent exceptions
- Zero-retention route for highest sensitivity workflows if available
- Deletion job tested quarterly, not only documented
- Workshop question: "What must we produce in litigation discovery?"
Indexes, embeddings, and derived data
Vector indexes store **embeddings (OpenAI embeddings guide) derived from source documents**. Even when raw text is not stored in the index, embeddings can enable reconstruction or inference in some threat models. Treat indexes as sensitive assets.
Re-index when documents change classification or are deleted. Stale chunks in an index are a retention violation if source deletion should remove access.
Separate indexes by classification and region. Do not mix ANZ HR policies with EU customer data in one searchable pool without legal review.
Backup and disaster recovery copies inherit the same retention and access rules as primary indexes.
- Index encryption at rest and in transit
- ACL (Azure RAG concepts) sync job from source systems on schedule
- Tombstone or purge pipeline when source document deleted
- Inventory: which indexes exist, owner, classification, region
- Common mistake: dev index copied from prod without scrubbing
Regional residency and cross-border flows
Data residency commitments often drive cloud anchor choice. Document which regions host models, search, logs, and backups. User travel and remote work can complicate jurisdiction assumptions.
Cross-border transfer requires legal mechanism: standard contractual clauses, adequacy decisions, or binding corporate rules. Engineering cannot resolve this alone.
Failover to another region for availability may violate residency if not disclosed. Gateway failover rules need legal review, not only SRE review.
Australian organisations often require APAC regions with clear subprocessors list. Publish region map in security evidence pack (review pack guide).
- Region per component: inference, index, logs, object storage
- subprocessor (production readiness conversation) register updated when provider changes data handling
- Block routing to non-approved regions in gateway config
- Scenario: EU employee queries ANZ-only index, what happens?
- What good looks like: architecture diagram with region labels
Provider training and subprocessors
Vendor questionnaires ask whether customer content is used to train foundation models. Answers vary by product tier, configuration, and date. Version your answers when providers update terms.
Maintain a **subprocessor (production readiness conversation) list**: model providers, embedding services, safety APIs, observability (AI SDK telemetry) vendors, and log storage. Procurement and privacy rely on the same register.
Enterprise agreements may add contractual terms beyond public policies. Track which workloads are covered by which agreement.
When switching models via gateway, subprocessors change. Treat model promotion as a privacy change requiring review if data handling differs.
- Document opt-out or zero-retention flags per environment
- Annual review of provider trust centre and DPA (Microsoft Copilot data protection)
- Notify privacy when adding new tool integration
- Common mistake: assuming all Azure OpenAI configs behave identically
Scenario: HR policy Q&A without overexposure
A bank pilots HR policy Q&A for 2,000 staff. Corpus excludes individual case files and performance reviews. Only published policies with effective dates enter the index.
Users receive notice that questions may be logged in redacted form for 90 days. PII redaction (Azure Content Safety) runs on outbound prompts. Escalation to HR advisers for personal cases is mandatory when questions include individual identifiers.
Legal accepts pilot because citations tie to approved documents and restricted data never entered the index. Scale requires quarterly ACL (Azure RAG concepts) sync and privacy impact assessment update.
Lesson: narrow corpus and clear escalation beat broad "ask anything HR" scope.
- Corpus: published policies only, versioned
- No individual employee records in index
- Redaction plus advise HR for personal cases
- Retention: 90-day redacted logs, aggregated metrics longer
Scenario: CRM assist with customer data
A telco builds **CRM (AI SDK agents) research assist** for account managers. Customer names and account numbers appear in tool responses. Logs scrub arguments; only internal user IDs correlate sessions.
Customer PII never enters the general copilot index. CRM (AI SDK agents) tool reads live with OAuth scoped to the user’s accounts. Outputs stay inside CRM UI, not emailed externally by default.
Retention on CRM (AI SDK agents) tool audit logs follows customer record policy, often seven years. Chat ephemeral layer retains 30 days redacted.
Privacy sign-off requires DPIA referencing both CRM (AI SDK agents) DPA (Microsoft Copilot data protection) and model provider DPA.
- Live CRM (AI SDK agents) read, no bulk export to index
- Scoped OAuth per user, not service account god mode
- Separate retention tiers for chat vs CRM (AI SDK agents) audit
- Block copy-to-clipboard external share without DLP (Microsoft Copilot data protection) where required
Access control and logging visibility
Who may read conversation logs, index contents, and eval datasets? Restrict to break-glass roles with MFA (OWASP LLM Top 10) and audit trail.
Support staff debugging production need redacted views by default. Full prompt access requires ticket and manager approval.
Champions and sponsors should not browse employee chats casually. Programmes lose trust quickly when measurement feels like surveillance.
Align access model with existing SIEM (AI SDK telemetry) and ticketing roles rather than inventing parallel admin groups.
- Role matrix: user, support, admin, auditor, break-glass
- MFA (OWASP LLM Top 10) for log and index admin access
- Audit log of who viewed which conversation record
- Workshop question: "Who should never see full prompts?"
Content safety and harmful outputs
Privacy intersects safety when outputs expose third-party personal data or confidential material from retrieved documents. Content safety (Azure Content Safety) filters reduce harm but do not replace access control.
Configure safety thresholds with legal for regulated industries. Log safety scores without storing blocked harmful content verbatim when possible.
Incident response for privacy breach via model output mirrors traditional data breach playbooks: contain, notify, preserve evidence, remediate index or prompt path.
Azure Content Safety (overview) and similar services are subprocessors with their own data handling terms—see the Azure Content Safety.
- Input and output filtering for harassment and leakage patterns
- Refuse when retrieval would expose wrong ACL (Azure RAG concepts) document
- Breach runbook linked from production readiness conversation guide
- Test: prompt injection (OpenAI mitigations) attempting to exfiltrate other users' data
Vendor due diligence and evidence pack
Security and privacy reviews ask for data-flow diagrams, retention schedules, **subprocessor (production readiness conversation) lists, and sample redacted logs**. Prepare these once and version per release.
Answer questionnaires with specific configuration facts: region, retention days, training opt-out status, encryption modes. Avoid generic "we use Azure" responses.
Link controls to demo patterns in vendor documentation (Azure AI Foundry documentation) as reference implementations, not proof your production config is compliant.
When auditors visit, show deletion job success metrics and ACL (Azure RAG concepts) sync lag dashboards, not only policy PDFs.
- Artefact: data-flow diagram (Microsoft Copilot data protection) with classification colours
- Artefact: retention table by data type
- Artefact: subprocessor (production readiness conversation) register with review date
- Artefact: sample redacted log line with field legend
- Checklist: DPA (Microsoft Copilot data protection) signed before prod customer data
Pilot minimum vs production privacy bar
Pilots may use synthetic or anonymised data and smaller cohorts, but should still implement classification, ACL (Azure RAG concepts)-aware retrieval, and documented retention. "Pilot" is not an excuse for production customer PII in dev tenants.
Production bar adds tested deletion, legal hold integration, DPIA (Microsoft Copilot data protection) or PIA on file, user notice, and quarterly access reviews.
Graduating pilot to production triggers privacy change assessment if scope, region, or data classes expand.
stop rules (NIST AI RMF) from scoping guide apply: if privacy blockers cannot close in two weeks, pause rather than bypass.
- Pilot minimum: classified corpus, ACL (Azure RAG concepts) sync, redaction on send, 90-day log cap
- Production: deletion jobs, hold process, notices, access reviews, DPIA (Microsoft Copilot data protection)
- Never: personal API keys (OWASP LLM Top 10) with customer data
- Never: skip notice because "internal only"
- What good looks like: privacy sign-off recorded in council minutes