← All posts

Benchmarks as guardrails for AI workloads

April 2026 · Reliability

Inference stacks change quickly — new runtimes, new quantizations, new container images. Without automated checks, “it works on my cluster” becomes the release strategy. I lean on benchmark automation to compare latency and throughput across builds and deployment shapes, including Ollama and llama.cpp-style paths — the same class of checks that belong in CI when inference sits in a user-facing or product-critical path.

What “good” looks like

  • Repeatable inputs and environments so results are comparable week to week (same model revision, same prompt length class, same hardware class).
  • Thresholds tied to reality — not vanity scores — aligned with product or SLO expectations (e.g. “p95 under X ms for this prompt class”).
  • Artifacts people open when something regresses: HTML/JSON reports, or a dashboard slice, not a one-off terminal scroll lost to history.

A minimal latency probe (Ollama-shaped)

Example: pin model and prompt, run N iterations, then compute average from total duration (adapt auth, endpoint, and JSON fields to your stack):

#!/usr/bin/env bash
set -euo pipefail
ENDPOINT="${OLLAMA_HOST:-http://127.0.0.1:11434}/api/generate"
MODEL="${MODEL:-llama3.2:1b}"
RUNS="${RUNS:-30}"

for i in $(seq 1 "$RUNS"); do
  curl -sS "$ENDPOINT" \
    -H 'Content-Type: application/json' \
    -d "{\"model\":\"$MODEL\",\"prompt\":\"Explain GitOps in one sentence.\",\"stream\":false}" \
    | jq -r '.total_duration'
done | awk '{sum+=$1; n++} END {print "avg_ns:", sum/n}'

To approximate p95 from a list of durations saved to durations.txt (one number per line, nanoseconds):

sort -n durations.txt | awk '{
  a[NR]=$1
} END {
  idx = int(0.95 * NR + 0.5)
  if (idx < 1) idx = 1
  print "p95_ns:", a[idx]
}'

Wire that into a pipeline step that fails when p95 crosses a budget — same spirit as performance tests for APIs, except the “endpoint” is your inference runtime behind AKS or a VM.

From numbers to merge gates

On platform work, benchmarks earn their keep when they block promotion: new image, new quantization, new node pool — run the suite, compare to last green baseline. If latency balloons or throughput collapses, the release does not move forward until someone acknowledges it. That is the difference between a graph in a slide deck and a guardrail your team trusts.

Benchmarks will not replace intuition, but they make debates shorter: either the numbers moved or they did not. That is enough to ship with more confidence on Azure-backed AI platform work — and to explain why a rollout stopped without hand-waving.