Benchmarks as guardrails for AI workloads

Inference stacks change quickly — new runtimes, new quantizations, new container images. Without automated checks, “it works on my cluster” becomes the release strategy. I lean on benchmark automation to compare latency and throughput across builds and deployment shapes, including Ollama and llama.cpp-style paths — the same class of checks that belong in CI when inference sits in a user-facing or product-critical path.

What “good” looks like

Repeatable inputs and environments so results are comparable week to week (same model revision, same prompt length class, same hardware class).
Thresholds tied to reality — not vanity scores — aligned with product or SLO expectations (e.g. “p95 under X ms for this prompt class”).
Artifacts people open when something regresses: HTML/JSON reports, or a dashboard slice, not a one-off terminal scroll lost to history.

A minimal latency probe (Ollama-shaped)

Example: pin model and prompt, run N iterations, then compute average from total duration (adapt auth, endpoint, and JSON fields to your stack):

#!/usr/bin/env bash
set -euo pipefail
ENDPOINT="${OLLAMA_HOST:-http://127.0.0.1:11434}/api/generate"
MODEL="${MODEL:-llama3.2:1b}"
RUNS="${RUNS:-30}"

for i in $(seq 1 "$RUNS"); do
  curl -sS "$ENDPOINT" \
    -H 'Content-Type: application/json' \
    -d "{\"model\":\"$MODEL\",\"prompt\":\"Explain GitOps in one sentence.\",\"stream\":false}" \
    | jq -r '.total_duration'
done | awk '{sum+=$1; n++} END {print "avg_ns:", sum/n}'

To approximate p95 from a list of durations saved to durations.txt (one number per line, nanoseconds):

sort -n durations.txt | awk '{
  a[NR]=$1
} END {
  idx = int(0.95 * NR + 0.5)
  if (idx < 1) idx = 1
  print "p95_ns:", a[idx]
}'

Wire that into a pipeline step that fails when p95 crosses a budget — same spirit as performance tests for APIs, except the “endpoint” is your inference runtime behind AKS or a VM.

From numbers to merge gates

On platform work, benchmarks earn their keep when they block promotion: new image, new quantization, new node pool — run the suite, compare to last green baseline. If latency balloons or throughput collapses, the release does not move forward until someone acknowledges it. That is the difference between a graph in a slide deck and a guardrail your team trusts.

Benchmarks will not replace intuition, but they make debates shorter: either the numbers moved or they did not. That is enough to ship with more confidence on Azure-backed AI platform work — and to explain why a rollout stopped without hand-waving.