AI API Benchmark
Run the same prompt N times against an LLM endpoint to get stable p50/p95 latency and throughput — then compare providers on identical workloads.
🔒 Runs entirely in your browser · nothing is uploaded or stored
📏 Fair comparison
Use the same prompt, model size and run count for each endpoint. Run during the same time window — provider latency varies by load.
📊 Read the spread
p95 ≫ p50 means inconsistent latency (bad for UX). A tight spread is often worth more than a slightly lower average.
Frequently Asked Questions
How many runs do I need?
5 runs gives a usable p50/p95 for a quick read. Use 10+ when comparing providers you're about to commit to.
Does it cost money to run?
Each run is a real API call billed by your provider. Keep the prompt short and runs low to minimize cost.
Related tools