AI Labs

Decision guide

AI cost controls

Token economics, routing, and guardrails so pilots do not become open-ended invoices.

Business sponsorsTechnical leaders

Why cost control is a governance topic

Generative AI spend is variable, bursty, and easy to misattribute. A successful pilot can double API bills when adoption rises or when users paste entire documents into chat. Finance needs predictability; engineering needs unified model gateway that do not require emergency outages.

Cost control is not stinginess. It is accountability: who may spend, on which workflows, with what caps, and who approves exceptions. Uncapped pilots erode trust with procurement (production readiness conversation) and encourage shadow API keys (OWASP LLM Top 10).

Treat cost like safety: controls belong in the request path, with telemetry and alerts, not as a monthly surprise in a spreadsheet.

  • Variable cost per request vs fixed SaaS licences finance understands
  • Shadow keys bypass caps and chargeback (unified model gateway) entirely
  • Workshop question: "What monthly spend triggers executive review?"
  • What good looks like: cost per successful task (unified model gateway) on the steering dashboard

What finance needs to see

Forecast **cost per successful task (unified model gateway)**, not raw token totals alone. Define "successful" the same way as quality rubrics in the measuring success guide (OpenAI evals).

Provide ranges for pilot vs steady state: low, expected, high scenarios driven by adoption and context length. Explain variance drivers using measuring success metrics.

Show who pays: central platform budget, business unit showback (unified model gateway), or project code. Ambiguous funding creates late political fights.

Include non-token costs: vector index (Azure vector search), embedding refresh, content safety (Azure Content Safety) API calls, log storage, and human review (OpenAI safety best practices) time for HITL workflows.

  • Monthly forecast with assumptions documented
  • Actuals vs forecast weekly during pilot
  • Cost ceiling in charter with named approver to exceed
  • Breakdown by workflow, team tag, and model ID (production readiness conversation)
  • Common mistake: quoting list price without volume discount or region multiplier

Token economics fundamentals

Spend scales with model tier, input tokens, output tokens, and call count. Long retrieved contexts and chat history dominate many RAG (Azure RAG concepts) bills more than answer length.

Embedding costs accrue on index build and refresh, not only queries. A nightly re-index of a large SharePoint (Azure RAG concepts) corpus can exceed chat spend if unplanned.

Safety and classification calls add overhead per request. Budget Content Safety and Bedrock Guardrails explicitly rather than treating them as negligible.

Educate sponsors: cheaper models reduce cost but may increase rework and human review (OpenAI safety best practices) minutes. Optimise fully loaded cost, not tokens alone.

  • Input-heavy: long policies, pasted emails, multi-turn history
  • Output-heavy: JSON extraction, long summaries
  • Call-heavy: agent loops with multiple tool round trips
  • Refresh-heavy: frequent embedding jobs on volatile corpora
  • Workshop question: "Which user behaviour would 10x our bill?"

Technical levers in the request path

**model routing (unified model gateway)** sends classify and triage tasks to smaller models and reserves large models for generation that needs reasoning. Route by task type, not globally per app.

Context limits cap retrieved passages, chat history depth, and maximum output tokens. Enforce limits in middleware per RAG guidance, not only in prompt instructions.

Retrieval top-k and chunk size directly affect input tokens. Tune for citation quality vs cost; measure empty-result rate when trimming aggressively.

Caching helps only where answers are stable and privacy policy allows. Never cache personalised or restricted responses without explicit policy review.

  • Small model for intent, category, and safety pre-check
  • Large model for final answer or complex extraction
  • Hard cap on input tokens with graceful truncation message
  • Default max output tokens per workflow type
  • Batch non-interactive jobs off peak where providers allow
  • What good looks like: routing rules versioned in git

Guardrails: caps, alerts, and degradation

Implement per-session, per-user, and per-tenant caps where multi-team platforms serve many units. Tags on every request enable enforcement.

Choose hard blocks vs soft alerts by risk appetite. Hard blocks stop spend; soft alerts notify owners before breach. Production often uses soft alerts for teams and hard caps for sandbox.

Define graceful degradation when caps approach: shorter answers, disable expensive tools, queue to human, or switch to smaller model with user-visible notice.

Never fail silently with empty responses. Users retry and multiply cost. Clear messages reduce thrash.

  • Daily and monthly budget per workflow tag
  • Alert at 70%, 90%, 100% of budget thresholds
  • Automatic downgrade route when 90% daily cap hit
  • kill switch (OWASP LLM Top 10) for runaway agent loops (max tool steps)
  • Incident runbook when cap blocks executives during demo
  • Common mistake: caps only at provider account, not per team

Gateway routing and failover economics

An AI Gateway (gateway documentation) centralises provider keys, model routes, failover, and spend attribution. Application teams call one endpoint; platform teams change routes without redeploying every app.

Failover to a secondary model avoids outage but may change cost profile. Document failover pricing and quality trade-offs in runbooks.

Use gateway analytics to compare cost and quality by route on the same golden set (OpenAI evals) before promoting a cheaper default.

Avoid per-team API keys (OWASP LLM Top 10) scattered across repos. Keys without gateway telemetry cannot be capped or charged back reliably.

  • Primary and fallback model per task type documented
  • Automatic failover when primary rate-limited or down
  • Spend dashboard by route and provider
  • What good looks like: no new production key without gateway path

Scenario: enterprise policy Q&A at scale

A retailer scales HR policy Q&A from 500 to 8,000 users. Initial spend forecast assumed 3 turns per session; real usage averages 7 turns with long pasted emails.

Platform adds history summarisation after turn 4, reduces retrieval top-k from 8 to 5 with eval monitoring, and routes clarification questions to a small model. cost per successful task (unified model gateway) falls 35% with citation rate (Azure RAG solution guide) stable.

Finance receives weekly showback (unified model gateway) by region. ANZ overspend triggers review; root cause is a champion sharing an uncapped demo link externally. Caps and SSO (Entra conditional access) fix the leak.

Steering approves continued scale with quarterly routing review and mandatory cost line in monthly metrics.

  • Lever applied: history compression after N turns
  • Lever applied: classify-then-generate routing
  • Governance: external sharing blocked by auth
  • Metric watched: cost per cited answer, not per session

Scenario: ITSM batch triage

An energy company runs **overnight batch triage (OpenAI function calling)** on 20,000 tickets monthly. Interactive chat caps do not apply; batch job cost dominates.

Team uses batch API where available, smaller model for category suggestion, and large model only on low-confidence cases. Embedding cache avoids re-processing unchanged ticket templates.

Job-level budget cap stops runaway loops when upstream feed duplicates records. Alert pages platform before invoice shock.

Business case compares batch AI cost to contractor hours previously spent on manual sort.

  • Separate budgets for interactive vs batch workloads
  • Idempotent job design prevents duplicate spend
  • Sample manual QA on batch output weekly
  • Cost metric: cents per ticket triaged correctly

Chargeback, showback, and tagging discipline

Tag logs early with team, workflow, cost centre, and environment. Retroactive tagging after finance asks is painful and inaccurate.

Showback reports actual spend by tag without internal invoicing; good for pilots. **chargeback (unified model gateway)** assigns cost to business units; needed at scale.

Publish a simple monthly statement: tokens, safety calls, embedding refresh, index storage, by team. Champions use statements to prioritise high-value workflows.

Platform retains central negotiation with providers; business units consume via tags, not separate enterprise agreements unless approved.

  • Required tags on every API call enforced in middleware
  • Reject untagged calls in production environments
  • Monthly showback (unified model gateway) meeting with council finance rep
  • Exception process for shared corpora costs split by usage
  • Workshop question: "Which team would be surprised by their bill today?"

Pilot vs production cost bars

Pilots may use soft caps and alerts with weekly finance check-ins. Production requires hard caps, degradation behaviour, and documented approvers for overrides.

Sandbox environments get lower caps and smaller models by default. Prevent accidental production keys in dev via separate accounts or gateway policies.

Before go-live, run a load test at expected peak with cost projection. Peak day after policy announcement exceeds steady state easily.

Align cost bar with production readiness (checklist guide) checklist: caps are a gate item, not optional.

  • Pilot: soft cap, daily email to owner
  • Production: hard cap, degradation, on-call runbook
  • Load test: 2x expected peak requests for one hour
  • Document approver chain for temporary cap increase
  • What good looks like: cap drill tested before launch

Observability and anomaly detection

Cost anomalies often indicate bugs, abuse, or **prompt injection (OpenAI mitigations) loops**, not legitimate adoption. Alert on spend rate change, tokens per request spike, and tool loop count.

Correlate cost spikes with deploy events, index refreshes, and marketing announcements driving traffic.

Dashboards should show cost alongside quality metrics. A cheap model that increases escalations may raise fully loaded cost.

Export provider invoices and reconcile to internal logs monthly. Discrepancies reveal untagged keys or wrong region routing.

  • Alert: hourly spend 3x trailing seven-day average
  • Alert: single user session exceeds token threshold
  • Dashboard: cost per task vs latency vs citation rate (Azure RAG solution guide)
  • Monthly reconciliation: invoice vs internal telemetry
  • Post-incident: add failing pattern to eval or cap rule

Council and procurement conversations

Procurement (production readiness conversation) will ask for commit discounts, enterprise agreements, and spend predictability. Bring tagged usage history from pilots, not theoretical estimates.

Council should review portfolio spend monthly: which workflows earn continuation, which should be capped or stopped on cost grounds alone.

Publish a one-page cost policy: default caps, tagging rules, approvers, and prohibited patterns (unlogged keys, uncapped public demos).

Pair cost policy with measuring success guide (OpenAI evals) metrics so kill decisions use cost and quality together.

  • Standing council agenda item: spend vs forecast
  • Policy: no production workflow without cost tag owner
  • procurement (production readiness conversation) pack: usage by month, growth scenario, provider options
  • Common mistake: negotiating EA before usage patterns known

Checklist and common mistakes

Use this checklist before scaling any pilot beyond the initial cohort.

Common mistakes include caps only on paper, ignoring embedding refresh, and optimising tokens while human rework soars.

What good looks like: finance, platform, and sponsors cite the same cost per successful task (unified model gateway) number in meetings.

  • cost per successful task (unified model gateway) defined and dashboard live
  • Tags enforced on all production traffic
  • Routing rules documented and eval-tested
  • Hard and soft caps configured with degradation tested
  • Gateway or central key management in place
  • Monthly reconciliation scheduled
  • Mistake: unlimited agent tool loops
  • Mistake: no cap on batch jobs
  • Mistake: hiding spend in central IT budget without showback (unified model gateway)

Next step

Talk about your next pilot

Patterns, metrics, and runnable demos for architecture reviews and pilots, from The Ops Toolbox.

Prefer the web form? The Ops Toolbox.

  • One workflow, clear metrics
  • Your cloud, your keys
  • Written handoff, not dependency