LLM Speed Insights 2025

Model Name Provider Tokens/sec Latency Distribution Context Length Cost/Million Tokens Notes
Amazon Nova Pro Amazon 84.2 0.44s TTFT
300k Input: $0.80
Output:$3.20
Low Latency, Scalable performance.[1]
Amazon Nova Lite Amazon 143.6 0.39s TTFT
300k Input: $0.06
Output: $0.24
Cost Effective inference, Larger context length.[1]
Amazon Nova Micro Amazon 190.5 0.37s TTFT
130K tokens Input: $0.04
Output: $0.14
Smaller variant, optimized for speed.[1]
OpenAI o1-mini OpenAI 188.6 11.64 s TTFT
128K tokens Input: $1.10
Output: $4.40
Fastest model, optimized for speed.[1]
OpenAI o1 OpenAI 143 12.763s TTFT
128K tokens Input: $18.75
Output: $60.00
Reasoning model, fast token generation, may have longer TTFT.[1],[2]
OpenAI o3-mini OpenAI 159.6 14.12s TTFT
200K tokens Input: $1.10
Output: $4.40
Fast, efficient, reasoning-focused.[1]
GPT-4o mini OpenAI 91.1 0.43s TTFT
128K tokens Input: $0.15
Output: $0.60
Cost Effective, Intelligent model.[1]
GPT-4o OpenAI 120.7 0.45s TTFT
130K tokens Input: $5.00
Output: $15.00
Intelligent, Fast, Versatile.[1]
GPT-4 OpenAI 20 0.4s TTFT
8k tokens Input: $30.00
Output: $60.00
Fast generation, low latency.[1]
Gemini 1.5 Flash-8B Google N/A No data
1M tokens Input: $0.04
Output: $0.15
Lightweight, Low computational cost.[1]
Gemini 2.0 Flash Google 150 0.26s TTFT
1M tokens Input: $0.10
Output: $0.40
Multimodal model, high speed surprising for its capabilities.[1]
Grok Beta xAI 66 0.31s TTFT
128K tokens Input: $5.00
Output: $15.00
Large model, faster than expected for its size.[1]
Grok 3 xAI N/A No data
128K tokens No data Advanced, contextual, high-speed reasoning.[1]
Llama 3.1 Instruct 8B Meta 179.7 0.37s TTFT
128K tokens No data Efficient, Multilingual.[1]
Llama 3.1 Instruct 70B Meta 82.4 0.53s TTFT
128K tokens Input: $0.60
Output: $0.75
Multilingual, Instruction-tuned language model.[1]
Llama 3.2 Instruct 3B Meta 134.4 0.34s TTFT
128K tokens Input: $0.06
Output: $0.06
Computationally less expensive, suited for mobile devices.[1]
Llama 3.2 Instruct 11B (Vision) Meta 104.8 0.29s TTFT
128K tokens Input: $0.18
Output: $0.18
Vision-focused, multilingual, fast inference.[1]
Llama 3.2 Instruct 90B (Vision) Meta 40 0.35s TTFT
128K tokens Input: $0.80
Output:$0.80
Multimodal, high-precision, visual reasoning.[1]
Llama 3.3 Instruct 70B Meta 100.9 0.56s TTFT
128K tokens Input: $0.59
Output: $0.71
Balanced speed for a 70B parameter model.[1]
Claude 3 Haiku Anthropic 142.5 0.56s TTFT
200K tokens Input: $0.25
Output: $1.25
Fast, efficient, lightweight.[1]
Claude 3 Opus Anthropic 27.6 1.34s TTFT
200K tokens Input: $15.00
Output: $75.00
Advanced, powerful, deep reasoning.[1]
Claude 3.5 Haiku Anthropic 65.5 0.65s
200K tokens Input: $0.80
Output: $4.00
Lightweight, responsive.[1]
Claude 3.5 Sonnet Anthropic 85 0.84 s TTFT
200K tokens Input: $3.00
Output: $15.00
Large model, faster than expected for its size.[1]
DeepSeek-V2-Chat DeepSeek 17 1.58s TTFT
128K tokens Input: $0.14
Output: $0.28
Efficient, reliable, low latency.[1]
DeepSeek R1 Distill Qwen 14B DeepSeek 76.9 12.21s TTFT
130K tokens Input: $0.88
Output: $0.88
Cost Effective, efficient.[1]
DeepSeek V3 DeepSeek 27.9 7.35s TTFT
130K tokens Input: $0.27
Output: $1.10
Fast, efficient, open-source.[1]
DeepSeek R1 DeepSeek 25 60.76s TTFT
130K tokens Input: $0.55
Output: $2.19
671B parameters, slow token generation.[1]
Qwen2.5 Coder Instruct 32B Alibaba 69 0.35s TTFT
131k tokens Input: $0.80
Output: $0.80
Low cost, low latency.[1]
Qwen Turbo Alibaba 79 1.11s TTFT
1M tokens Input: $0.05$
Output: $0.20
Efficient, cost-efficient model.[1]
Qwen2.5 Max Alibaba 36.2 1.26s TTFT
32k tokens Input: $1.60
Output: $6.40
Low Latency, Scalable performance.[1]
Qwen 2.5-72B Alibaba 58 1.09s TTFT
130K tokens Input: $0.00
Output: 0.00 (Alibaba Cloud)
Range depends on framework. (e.g., vLLM vs. Transformer).[1],[2]
Mistral Small Mistral AI 108 0.32s TTFT
33k tokens Input: $0.20
Output: $0.60
Faster TTFT Speed, Low Latency.[1]
Mistral Saba Mistral AI 98 0.34s TTFT
32k tokens Input: $0.20
Output: $0.60
Low latency, Efficient for mobile devices.[1]
Mixtral 8x7B Instruct Mistral AI 101.9 0.31s TTFT
33k tokens Input: $0.70
Output: $0.70
Efficient, multilingual, high-performance.[1]
Ministral 3B Mistral AI 222.3 0.3s TTFT
128K tokens Input: $0.04
Output: $0.04
Compact, fast, cost-effective.[1]
Phi-4 Microsoft 35.7 0.28s TTFT
16k tokens Input: $0.09
Output: $0.22
Efficient, reasoning-focused, lightweight.[1]
Dolly Databricks 8.5-12.8 No data
No data No data 12B parameters, 128 tokens in 10-15 sec on A100.[1]
Falcon 40B TII 8.61 No data
2048 No data Measured on RTX 4090, hardware-specific.[1]

Methodology

Token generation speeds are sourced from verified benchmarks, including third-party analyses (e.g., Artificial Analysis, Vellum.ai), official documentation (e.g., Qwen Docs), and community reports (e.g., GitHub, DatabaseMart). Only models with published speed metrics are included, ensuring accuracy. Data reflects performance as of February 22, 2025, under varying hardware and inference conditions.