Benchmarks as guardrails for AI workloads
Inference stacks change quickly — new runtimes, new quantizations, new container images. Without automated checks, “it works on my cluster” becomes the release strategy. I lean on benchmark automation to compare latency and throughput across builds and deployment shapes, including Ollama and llama.cpp-style paths — the same class of checks that belong in CI when inference sits in a user-facing or product-critical path.
What “good” looks like
- Repeatable inputs and environments so results are comparable week to week (same model revision, same prompt length class, same hardware class).
- Thresholds tied to reality — not vanity scores — aligned with product or SLO expectations (e.g. “p95 under X ms for this prompt class”).
- Artifacts people open when something regresses: HTML/JSON reports, or a dashboard slice, not a one-off terminal scroll lost to history.
A minimal latency probe (Ollama-shaped)
Example: pin model and prompt, run N iterations, then compute average from total duration (adapt auth, endpoint, and JSON fields to your stack):
#!/usr/bin/env bash
set -euo pipefail
ENDPOINT="${OLLAMA_HOST:-http://127.0.0.1:11434}/api/generate"
MODEL="${MODEL:-llama3.2:1b}"
RUNS="${RUNS:-30}"
for i in $(seq 1 "$RUNS"); do
curl -sS "$ENDPOINT" \
-H 'Content-Type: application/json' \
-d "{\"model\":\"$MODEL\",\"prompt\":\"Explain GitOps in one sentence.\",\"stream\":false}" \
| jq -r '.total_duration'
done | awk '{sum+=$1; n++} END {print "avg_ns:", sum/n}'To approximate p95 from a list of durations saved to durations.txt (one number per line, nanoseconds):
sort -n durations.txt | awk '{
a[NR]=$1
} END {
idx = int(0.95 * NR + 0.5)
if (idx < 1) idx = 1
print "p95_ns:", a[idx]
}'Wire that into a pipeline step that fails when p95 crosses a budget — same spirit as performance tests for APIs, except the “endpoint” is your inference runtime behind AKS or a VM.
From numbers to merge gates
On platform work, benchmarks earn their keep when they block promotion: new image, new quantization, new node pool — run the suite, compare to last green baseline. If latency balloons or throughput collapses, the release does not move forward until someone acknowledges it. That is the difference between a graph in a slide deck and a guardrail your team trusts.
Benchmarks will not replace intuition, but they make debates shorter: either the numbers moved or they did not. That is enough to ship with more confidence on Azure-backed AI platform work — and to explain why a rollout stopped without hand-waving.