Prompt A/B Evaluation with Structured Rubrics

Case studyArchitecture, governance, and how to adapt this pattern in a pilot

Business use case

Problem

Teams debate prompt wording in Slack. Without structured evals, the loudest voice wins, and unsafe brevity often sounds better in a demo.

Who benefits

Risk & compliance, repeatable rubric dimensions (clarity, safety, task-fit)
Product & ops, evidence for which prompt ships to tier-1 agents
Platform engineering, eval jobs that can run in CI later

Success metrics

Rubric dimensions agreed before any prompt change merges
Side-by-side latency captured for UX-sensitive flows
Human reviewer sign-off on winner for regulated channels

Solution

Run two system prompts against the same user message, then call generateObject with a Zod rubric to pick a winner and explain scores, how transformation teams operationalize prompt review without a separate eval vendor on day one.

Technical implementation

Architecture

Same user message, two prompts, then a structured judge, mirrors how platform teams gate prompt changes.

How it runs

Drawing the flow…

Outcomes and learnings

Always capture latency per variant alongside quality scores
Keep rubric dimensions stable quarter-to-quarter for trend reporting
Pair automated eval with a small human gold set before production rollout

Delivery playbookDiscovery → pilot → scale

1
Discovery2–4 wks
Define rubric dimensions with risk and product; collect 30 gold user messages from production logs (redacted).
2
Pilot6–8 wks
Gate prompt changes on eval winner + human spot-check for regulated channels.
3
Scaleongoing
Wire eval job into CI on prompt PRs; track dimension trends quarterly.

Where else this appliesStructured prompt comparison is how you institutionalize quality before prompts diverge across squads and repos.

Brand voice tuning

Marketing and legal jointly score tone, disclaimer presence, and brevity on golden customer scenarios.

Safety regression tests

Block prompt merges that increase refusal failures or harmful completions on a fixed red-team set.

Localization pilots

Compare system prompts for multilingual support without retraining models.

Agent system prompt versions

Pick the prompt package before enabling new tools or write access in production.

Run evals in preview deployments in production; store rubric results next to deployment IDs for rollback decisions.