AI API Benchmark

Run the same prompt N times against an LLM endpoint to get stable p50/p95 latency and throughput — then compare providers on identical workloads.

🔒 Runs entirely in your browser · nothing is uploaded or stored

📏 Fair comparison

Use the same prompt, model size and run count for each endpoint. Run during the same time window — provider latency varies by load.

📊 Read the spread

p95 ≫ p50 means inconsistent latency (bad for UX). A tight spread is often worth more than a slightly lower average.

Frequently Asked Questions

How many runs do I need?

5 runs gives a usable p50/p95 for a quick read. Use 10+ when comparing providers you're about to commit to.

Does it cost money to run?

Each run is a real API call billed by your provider. Keep the prompt short and runs low to minimize cost.