AI Labs
All examples

Prompt A/B Evaluation with Structured Rubrics

Compare two system prompts on the same user message, then score with a schema-constrained evaluator.

Prompt evaluationGovernance

Target outcomes

  • Prompt changes gated on rubric winner for regulated channels
  • Eval dimensions stable quarter-over-quarter for reporting

Initiative playbook

Typical delivery arc for this pattern in enterprise programs.

  1. 1
    Discovery2 to 4 wks

    Define rubric dimensions with risk and product; collect 30 gold user messages from production logs (redacted).

  2. 2
    Pilot6 to 8 wks

    Gate prompt changes on eval winner + human spot-check for regulated channels.

  3. 3
    Scaleongoing

    Wire eval job into CI on prompt PRs; track dimension trends quarterly.

Business use case

Problem

Teams debate prompt wording in Slack. Without structured evals, the loudest voice wins, and unsafe brevity often sounds better in a demo.

Who benefits

  • Risk & compliance, repeatable rubric dimensions (clarity, safety, task-fit)
  • Product & ops, evidence for which prompt ships to tier-1 agents
  • Platform engineering, eval jobs that can run in CI later

Success metrics

  • Rubric dimensions agreed before any prompt change merges
  • Side-by-side latency captured for UX-sensitive flows
  • Human reviewer sign-off on winner for regulated channels

Solution

Run two system prompts against the same user message, then call generateObject with a Zod rubric to pick a winner and explain scores, how transformation teams operationalize prompt review without a separate eval vendor on day one.

Technical implementation

Architecture

Same user message, two prompts, then a structured judge, mirrors how platform teams gate prompt changes.

How it runs
Drawing the flow…

Outcomes and learnings

  • Always capture latency per variant alongside quality scores
  • Keep rubric dimensions stable quarter-to-quarter for trend reporting
  • Pair automated eval with a small human gold set before production rollout

Where else this applies

Structured prompt comparison is how you institutionalize quality before prompts diverge across squads and repos.

Brand voice tuning

Marketing and legal jointly score tone, disclaimer presence, and brevity on golden customer scenarios.

Safety regression tests

Block prompt merges that increase refusal failures or harmful completions on a fixed red-team set.

Localization pilots

Compare system prompts for multilingual support without retraining models.

Agent system prompt versions

Pick the prompt package before enabling new tools or write access in production.

Using this stack elsewhere

Run evals in preview deployments on Vercel; store rubric results next to deployment IDs for rollback decisions.

Live demo

The demo is the same code path described above, not a simplified mock UI. Add keys in .env.local when you are ready; the narrative and diagrams stand on their own without them.

Business

Paste two system prompts and see which one wins on clarity, safety, and fit, useful before you ship to agents.

Technical

Parallel generateText for A/B, then generateObject with promptEvalSchema for structured scores.

Prompt A/B evaluation

Run two system prompts on the same user message, then score with a structured rubric.

Live

System prompt A

System prompt B

User message