AI Labs by The Ops Toolbox

Business sponsorsTechnical leaders

Why measurement must be agreed before launch

Teams that measure AI only after launch argue about definitions while executives lose patience. Baseline week, primary KPI, and **quality sampling (OpenAI evals guide) method** should be in the pilot charter (NIST AI RMF) before users touch the system.

Business sponsors care about operational outcomes: handle time (OpenAI evals guide), rework, cycle time, audit findings, and cost avoidance. Engineering cares about latency, error rates, retrieval quality, and eval regressions. Both views belong on one page updated weekly during pilots.

Measurement is not surveillance of employees. Frame metrics as workflow health and system quality, with anonymised aggregates where unions or works councils require it.

Agree metric definitions with ops analytics or finance in week one
Never report deflection without quality sampling (OpenAI evals guide)
Separate vanity metrics (message count) from outcome metrics
Workshop question: "What would finance accept as proof of ROI?"
What good looks like: same dashboard link in steering deck (AI council guide) every month

Business metrics that hold up in steering forums

Choose metrics tied to repeatable work the organisation already tracks. Introducing a novel "AI confidence score" without business meaning creates friction in every review.

For policy Q&A, measure average time to verified answer and escalation rate to tier-2. For ITSM (OpenAI function calling) triage, measure correct category assignment against supervisor spot-checks. For procurement (production readiness conversation) intake, measure cycle time from submission to routed approval.

Express targets as ranges with sample sizes. "10 to 15% handle time (OpenAI evals guide) reduction across 120 cases" is defensible. "Saved thousands of hours" without methodology is not.

Include risk-adjusted metrics: override rate (OpenAI safety best practices), incident count, empty-retrieval rate, and percentage of answers with valid citations. Improvements in speed that increase errors are not success.

handle time (OpenAI evals guide): start to verified answer, not chat session length
First-contact resolution for support workflows
Rework rate: tickets reopened or answers challenged
Cycle time: intake to decision for approvals
Audit findings on policy answers quarter over quarter
Cost per successful task (unified model gateway)
Common mistake: claiming deflection (OpenAI evals guide) when users gave up

Technical metrics engineers should own

Platform teams instrument every request with **model ID (production readiness conversation), latency, token usage (unified model gateway), retrieval hit count, safety scores, and tool invocations**. Without structured logs, business metrics cannot be explained when they move.

Track latency p50 and p95 separately. Users tolerate occasional slow answers; sustained p95 breaches erode adoption even when averages look fine.

Monitor empty retrieval rate and **citation rate (Azure RAG solution guide)** for RAG (Azure RAG concepts) apps. A fast wrong answer often traces to retrieval failure, not model quality.

Export technical metrics to the same BI tool sponsors use, or provide a weekly CSV with consistent definitions. Fragmented reporting kills trust between teams.

Latency p95, error rate, timeout rate
Tokens per successful task by workflow and model
Retrieval hit rate, empty-result rate, top-k size
Tool success and failure rate with error taxonomy
Eval regression score on golden set (OpenAI evals) after each prompt change
Cache hit rate where caching is enabled
What good looks like: alert when p95 doubles week over week

Golden sets, rubrics, and quality sampling

Agree **20 to 50 golden questions (OpenAI evals) or conversations before launch, authored by subject matter experts, not only engineers. Include paraphrases, edge cases, and questions that should yield unknown answer** when policy is silent.

Score answers with a **rubric (OpenAI evals)**: factual correctness, citation presence, tone, refusal when appropriate, and safety. Numeric rubrics enable regression detection when prompts or indexes change.

Live user traffic needs sampling, not full manual review. Review 5 to 10% of sessions weekly, prioritising low-confidence retrievals, high token usage (unified model gateway), and escalations.

When quality sampling (OpenAI evals guide) disagrees with automated evals, treat the human sample as ground truth and add failing cases to the golden set (OpenAI evals).

golden set (OpenAI evals) versioned in git alongside prompts
rubric (OpenAI evals) dimensions weighted by risk (citations weigh heavily in compliance Q&A)
Include adversarial prompts in eval set quarterly
Track unknown-answer rate as a positive signal when appropriate
Workshop question: "Which wrong answer would be unacceptable on page one of the newspaper?"
Common mistake: golden set (OpenAI evals) only contains easy questions

Baselines and counterfactuals

Measure a baseline week with the old process before pilot users switch. Capture the same metric definition you will use during pilot. Seasonality matters for HR and finance cycles.

Where ethical and practical, run A/B or phased rollout: one team on AI assist, matched team on control. Not every organisation can A/B test, but phased rollout beats no comparison.

Document confounders: major policy release, staffing change, or system outage during pilot weeks. Executives prefer honest caveats to silent bias.

For read-only assist, compare time-with-assist vs historical baseline, not vs hypothetical perfect automation.

Baseline: median handle time (OpenAI evals guide) over 100+ cases
Control group: matched team or prior quarter adjusted for volume
Record external events on shared pilot timeline
What good looks like: analytics sign-off on baseline methodology

Scenario: ITSM ticket triage

A global manufacturer pilots L1 triage for ServiceNow. golden set (OpenAI evals): 500 historical tickets with supervisor-confirmed categories. Primary metric: percentage correctly routed without reopen within 24 hours.

Engineering tracks classification latency, model route used, and override rate (OpenAI safety best practices) when agents change category. Business tracks handle time (OpenAI evals guide) from ticket open to first human action.

Week four shows 18% handle time (OpenAI evals guide) improvement but override rate (OpenAI safety best practices) climbs to 12%. Steering pauses scale, adds rubric (OpenAI evals) weight on category accuracy, and retrains prompts before week six readout.

Final recommendation: pivot to suggest-only mode until override rate (OpenAI safety best practices) falls below 5% for two weeks.

Primary: correct routing rate on spot-checked sample
Secondary: handle time (OpenAI evals guide), reopen rate, cost per triaged ticket
Quality sample: 50 tickets per week reviewed by team lead
Stop rule triggered: override rate (OpenAI safety best practices) above 10% for two weeks
Lesson: speed metrics alone masked accuracy regression

Scenario: HR policy Q&A

A financial services firm pilots internal policy Q&A for ANZ employees. Corpus: 180 HR policies with effective dates. Primary metric: time from question to answer with valid citation, sampled in live chat.

golden set (OpenAI evals) includes questions about leave, parental policies, and code of conduct. Rubric requires citation to passage or explicit unknown when policy is silent.

Citation rate (Azure RAG solution guide) holds at 91% but empty-retrieval spikes after a SharePoint (Azure RAG concepts) migration. Technical metric alerts fire before users complain widely. Index refresh restores quality within 48 hours.

Steering approves scale to all ANZ staff with monthly index refresh SLA and continued citation sampling.

Primary: median time to cited answer
Supporting: escalation to HR adviser rate, citation rate (Azure RAG solution guide), unknown-answer rate
Risk metric: answers without retrieval above threshold
Operational metric: index freshness lag in days
What good looks like: legal accepts citation format for audit

Scenario: procurement vendor intake

A utilities company automates vendor questionnaire extraction from PDF submissions. Primary metric: cycle time from submission to routed approval. Quality metric: field-level accuracy on golden forms.

Engineering measures extraction schema validation failures, human correction rate per field, and tokens per document. Business measures backlog reduction and approver time saved.

Pilot shows 30% cycle time improvement but high correction rate on insurance certificates. Team pivots to human review (OpenAI safety best practices) for low-confidence fields rather than full auto-approve.

Scale recommendation includes HITL (OpenAI safety best practices) queue for fields below confidence threshold, aligning with oversight patterns.

Primary: intake cycle time median and p90
Quality: field accuracy on stratified sample
Technical: schema validation pass rate, model route per doc type
Risk: auto-approve only above confidence floor

Dashboards and reporting rhythm

During pilots, publish a one-page weekly report: primary business metric, top three technical health indicators, quality sample summary, incidents, and cost week to date.

In production, move to monthly steering with weekly automated alerts for anomalies only. Alert on eval regression, cost spike, override rate (OpenAI safety best practices) jump, and empty-retrieval spike.

Use consistent colour and definition legends. Changing formulas mid-program destroys comparability.

Include a "so what" sentence for executives: "handle time (OpenAI evals guide) improved 12% but cost per task rose 8% because users ask longer questions. Next action: context limits and prompt tuning."

Weekly pilot email: metric, sample size, trend arrow, blockers
Monthly production deck: same metrics plus incident summary
Alert thresholds documented and owned by platform team
BI dashboard: business KPI layer over technical event logs
Workshop question: "Which single chart would you show the CEO?"

Cost and value in the same frame

Finance will ask whether savings exceed token, search, and labour costs. Report **cost per successful task (unified model gateway)** alongside time saved, using the same definition of success as quality rubrics.

Break down cost drivers: model tier, context length, retrieval size, safety calls, and human review (OpenAI safety best practices) time. A cheaper model that increases rework is not a win.

Forecast steady-state cost at 2x and 5x volume. Pilots often undercount embedding refresh and log storage.

Link to the AI cost controls guide (spend guide) for caps and routing levers when spend trends wrong.

cost per successful task (unified model gateway) = total AI spend / tasks meeting rubric (OpenAI evals)
Include human review (OpenAI safety best practices) minutes in fully loaded cost
Show cost trend vs adoption trend on same chart
Common mistake: reporting token spend (unified model gateway) without task denominator

unified model gateway

Conversation replay and production evals

Golden sets miss phrasing drift from real users. **conversation replay (OpenAI evals) evals** sample production transcripts (redacted), re-score with rubrics, and detect regressions prompts alone cannot catch.

Replay evals monthly in production, weekly during active prompt experiments. Feed failures back into golden set (OpenAI evals) and champion workshops.

Privacy review must approve replay storage and redaction rules before enabling. Prefer stored hashes and excerpts over full transcript retention where possible.

Pair replay evals with batch triage (OpenAI function calling) patterns when reviewing large backlogs of historical conversations for quality audits.

Sample size: enough for statistical signal, typically 200+ turns/month
Redact PII before storage or scoring
Tag failures: retrieval, tool, safety, tone, policy gap
Track mean rubric (OpenAI evals) score over time with control limits
What good looks like: replay failures block prompt promote without review

Governance metrics and accountability

Council portfolios need program health metrics: pilots completed with documented scale, pivot, or stop; reuse of shared platform patterns; open risk findings; shadow-AI reports resolved.

Assign metric owners. Business owns outcome KPIs. Platform owns technical health. Risk owns incident and override reporting. Ambiguous ownership produces stale dashboards.

When metrics disagree, run a joint review rather than letting teams cherry-pick. Engineering may show green latency while business shows rising escalations. Both are true; the story is incomplete retrieval.

Document metric definitions in a living glossary linked from the dashboard. New steering members should onboard in minutes, not weeks.

Portfolio: pilots with decision documented / pilots started
Reuse: workflows on shared gateway and eval harness
Risk: incidents, overrides, empty-retrieval spikes
Adoption: active users / invited users, not logins alone
Checklist: glossary published, owners named, alert runbooks tested

Common mistakes and what good looks like

Avoid metrics theatre: impressive charts without sample sizes or definitions. Avoid engineering-only dashboards that sponsors never open.

What good looks like: executives can explain the primary metric and last month’s trend without engineering present.

What good looks like: prompt changes trigger eval regression alerts before users notice.

What good looks like: quality sampling (OpenAI evals guide) finds issues automated scores miss, and those cases enter the golden set (OpenAI evals) within one sprint.

Mistake: measuring messages sent instead of tasks completed
Mistake: no baseline, only post-launch trend
Mistake: hiding cost increases when speed improves
Mistake: stopping measurement after pilot approval
Good: joint business-platform review every month
Good: failures classified and owned, not averaged away

Measuring AI success

Why measurement must be agreed before launch

Business metrics that hold up in steering forums

Technical metrics engineers should own

Golden sets, rubrics, and quality sampling

Baselines and counterfactuals

Scenario: ITSM ticket triage

Scenario: HR policy Q&A

Scenario: procurement vendor intake

Dashboards and reporting rhythm

Cost and value in the same frame

Conversation replay and production evals

Governance metrics and accountability

Common mistakes and what good looks like

Plan your next pilot

Measuring AI success

Executive summary

Why measurement must be agreed before launch

Business metrics that hold up in steering forums

Technical metrics engineers should own

Golden sets, rubrics, and quality sampling

Baselines and counterfactuals

Scenario: ITSM ticket triage

Scenario: HR policy Q&A

Scenario: procurement vendor intake

Dashboards and reporting rhythm

Cost and value in the same frame

Conversation replay and production evals

Governance metrics and accountability

Common mistakes and what good looks like

Plan your next pilot