AI Labs
All examples
Live demo

Conversation Replay + Regression Evals

Store golden conversations and score prompt changes before they ship.

Prompt evaluationGovernanceEnterprise
Jump to demo

Replay a golden conversation and score it — catch regressions before ship.

Technical notes

generateText + rubric generateObject at /api/demos/vercel-replay-eval.

Conversation replay + rubric eval

Run a stored golden conversation through your system prompt and score it for regressions.

Live
Case studyArchitecture, governance, and how to adapt this pattern in a pilot

Business use case

Teams change prompts in production and only discover regressions after customers complain. This pattern makes “prompt change” behave like “code change”: replay, score, and block bad merges.

Solution

Run a stored golden conversation through your system prompt, generate an answer, then use a rubric to score correctness and risk.

Delivery playbookDiscovery → pilot → scale
  1. 1
    Discovery2–4 wks

    Create 20 to 50 golden conversations by intent; define rubric criteria and regression thresholds.

  2. 2
    Pilot6–8 wks

    Run replay evals on every prompt/tool change; block deploys on high-risk regressions.

  3. 3
    Scaleongoing

    Expand golden set by channel; trend scores over time and tie to incident reports.

Where else this appliesGolden conversation replay is how you turn prompt changes into measurable quality gates. It is the cheapest way to stop silent regressions.

Prompt versioning

Block deployments when safety or grounding behaviour regresses.

Tooling changes

Replay before enabling new tools or write actions for agents.

Model swaps

Compare quality when procurement mandates a provider change.

Governance reviews

Provide evidence to risk teams beyond anecdotal demos.

Works well with AI SDK routes: run in CI, run in admin tools, or run during workshops when teams propose prompt edits.