AI Labs by The Ops Toolbox

Business sponsorsTechnical leaders

Controls vs evidence

InfoSec (OWASP LLM Top 10) reviews ask for proof that controls exist, not assertions that the team is responsible.

The AI security controls guide (controls guide) defines what to implement. This guide packages what to submit so reviewers can trace control to artefact quickly.

Treat the pack as a living folder updated each release, not a one-off PowerPoint before production readiness conversation go-live.

Sponsors should know which artefacts gate production promotion in the steering committee (AI council guide).

Control statement in one sentence
Owner role and escalation contact
Artefact file name and last updated date
Link to demo or log sample
Exception register if control is partial

Questions you should expect

Reviewers will ask about data flows, subprocessors, authentication, logging, prompt injection (OpenAI mitigations), tool abuse, model change management, and incident response.

Prepare short written answers with diagrams attached. Verbal assurances do not survive turnover in security teams.

Procurement (production readiness conversation) may join with contract questions on training use and retention. Align answers with the executed DPA (Microsoft Copilot data protection).

Schedule a dry run with your internal security partner before external review to catch missing artefacts early—use the evidence pack checklist as a script.

Where do prompts and outputs persist?
Who can access logs and indexes?
How are API keys (OWASP LLM Top 10) stored and rotated?
What happens if retrieval returns nothing?
How are models and prompts versioned?
What is the disable-tools procedure?

Architecture artefacts

Provide a data-flow diagram (Microsoft Copilot data protection), network diagram where applicable, IAM matrix (OWASP LLM Top 10), and list of third-party APIs.

Link each control on the diagram to an OWASP LLM control or redacted log sample from your pilot environment.

Show trust boundaries between user browser, application, model API, search index, and approval queue (OpenAI safety best practices).

Version diagrams when you add tools or change regions. Reviewers compare submissions across releases.

Context diagram with actors and systems
Sequence diagram for propose-approve-execute
IAM matrix (OWASP LLM Top 10): role, scope, data access
subprocessor (production readiness conversation) and region table
Environment separation: dev, test, prod

Identity and access evidence

Demonstrate SSO (Entra conditional access) with Entra ID (conditional access) or your corporate IdP, least-privilege app roles, and no shared admin keys in application code.

Export a redacted sample of authentication logs showing user identity on each model call where applicable.

Document break-glass access for engineers and how it is time boxed and reviewed.

Search indexes must respect the same entitlements as source systems per RAG ACL guidance. Include a test case in the pack.

SSO (Entra conditional access) configuration screenshot or runbook excerpt
Role matrix mapped to business functions
Sample JWT claims in logs, redacted
Key rotation calendar and last run date
Access review ticket closed this quarter

Data classification and retention

State which data classes may enter prompts and which are prohibited, with examples HR and legal agree on.

Attach retention schedules for prompts, outputs, logs, embeddings (OpenAI embeddings guide), and approval records.

Describe deletion and export procedures for subject rights requests.

If you redact before model call, show before and after samples in the pack with fields removed—reference Azure Content Safety.

data classification (Microsoft Copilot data protection) policy excerpt
Retention table per store
Redaction rules for PII and secrets
Index rebuild process when sources delete content
Backup and restore test date

Safety and abuse

Document input and output filtering, rate limits, and tool allow lists.

Run a lightweight penetration test (OWASP LLM Top 10) on tool endpoints, not only the chat user interface.

Include prompt injection (OpenAI mitigations) test results and what changed after remediation.

Show human approval samples for write actions with audit trail exports from OpenAI safety best practices.

Content filter configuration export
Rate limit and abuse alert thresholds
Tool allow list file in version control
Red-team summary with severity counts
HITL (OpenAI safety best practices) approval screenshot and log extract

Control-to-artefact matrix

Attach one primary artefact per control so reviewers do not chase scattered Confluence pages.

Use a spreadsheet or table with columns for control ID, owner, artefact link, and test date.

Mark partial controls honestly with compensating measures and target dates.

Steering committees promote pilots when critical controls are green or accepted as risk with owner per NIST AI RMF Govern process.

Identity: SSO (Entra conditional access) config, role matrix, sample auth logs
Data: classification policy, index ACL (Azure RAG concepts) diagram, retention schedule (Microsoft Copilot data protection)
Injection and tools: red-team summary, allow list export, HITL (OpenAI safety best practices) audit sample
Runtime: dashboard link, rate-limit config, incident runbook drill date
Change: CI eval report, model version (production readiness conversation) register, rollback record

Logging, monitoring, and incident response

Show which events are logged: prompt hash, model version (production readiness conversation), retrieval IDs, tool proposals, approvals, and errors.

Connect logs to your SIEM or observability (AI SDK telemetry) platform with retention matching policy.

Include an incident runbook for model outage, data leak suspicion, and unsafe tool execution.

Record the date of the last tabletop exercise in the pack cover sheet.

Log field dictionary with sensitivity labels
Sample dashboard for latency, errors, spend
Alert routes to on-call and security
Incident severity definitions for AI
Post-incident template with root cause fields

Change management and evals

Prompts and indexes are code. Show version control, peer review, and CI eval gates before promotion.

Attach the latest golden set (OpenAI evals) eval report with pass/fail thresholds defined upfront.

Document how you roll back prompt or index changes within one business day.

model version (production readiness conversation) changes should trigger regression evals (OpenAI evals) even when prompts are unchanged.

Git tag or release note per production deploy
Eval threshold document signed by sponsor
Rollback runbook tested this quarter
Model register with approval dates
Diff of prompt changes in last release

Procurement and subprocessors

Include executed contracts or order forms with data processing terms.

List subprocessors, regions, and training use flags in a table aligned to your architecture diagram.

Note marketplace purchases versus direct enterprise agreements to avoid wrong support contacts.

Update the table when you add gateway routes or new model providers.

DPA (Microsoft Copilot data protection) and acceptable use attachments
subprocessor (production readiness conversation) list with purpose per vendor
Region commitments and exceptions
Insurance or indemnity clauses referenced
Renewal dates and contract owners

Using this site in reviews

Examples in vendor documentation (Azure AI Foundry documentation) include implementation paths, environment variable lists, and architecture diagrams.

Use them as reference patterns to accelerate workshops. Your production configuration and IAM must still be reviewed on their own merits.

Cite example slugs in the pack index so reviewers can reproduce demos in a sandbox—start with OWASP LLM Top 10.

Clearly label which controls are demonstrated versus planned for phase two.

Index of related example slugs per control
Sandbox tenant separate from production
Demo script with expected safe outcomes
Gap list for phase two with dates
No copy-paste of sample keys into prod

What good looks like

Good looks like a reviewer opening one folder and finding diagrams, matrices, logs, and evals without a chase thread.

Good looks like sponsors signing promotion when critical controls are evidenced, not when demos felt impressive.

Good looks like the pack updating within five business days of each production release.

Good looks like alignment between legal answers and engineering configuration per Microsoft Copilot data protection.

Cover sheet with version, owner, date
All critical controls green or risk accepted
Dry run completed with internal security
steering committee (AI council guide) packet includes pack link
Post-go-live review scheduled at 90 days

Security review evidence pack

Controls vs evidence

Questions you should expect

Architecture artefacts

Identity and access evidence

Data classification and retention

Safety and abuse

Control-to-artefact matrix

Logging, monitoring, and incident response

Change management and evals

Procurement and subprocessors

Using this site in reviews

What good looks like

Plan your next pilot

Security review evidence pack

Executive summary

Controls vs evidence

Questions you should expect

Architecture artefacts

Identity and access evidence

Data classification and retention

Safety and abuse

Control-to-artefact matrix

Logging, monitoring, and incident response

Change management and evals

Procurement and subprocessors

Using this site in reviews

What good looks like

Plan your next pilot