RAG Performance Calculator

Estimate end-to-end latency for a retrieval-augmented generation pipeline by stage — so you know whether embedding, search, rerank or generation is your real bottleneck.

Total (ms)
Time to first token
Bottleneck

🔍 Cut the bottleneck

  • Rerank heavy? Use a lighter reranker or top-k filter
  • Search slow? Add an HNSW/IVF index
  • TTFT high? Smaller/faster model or closer region

⚡ Perceived speed

Users feel TTFT, not total time. Stream tokens and show retrieval status to mask the generation tail.

Frequently Asked Questions

Where do I get these numbers?

Measure each stage with our Embedding Speed Test and Prompt Latency Test, plus your vector DB's query metrics. Then plug them in here.

Should I always rerank?

Reranking improves answer quality but adds latency. If your first-stage retrieval is already accurate, you can skip it or apply it only to ambiguous queries.