Model Name | Provider | Tokens/sec | Latency | Distribution | Context Length | Cost/Million Tokens | Notes |
---|---|---|---|---|---|---|---|
Amazon Nova Pro | Amazon | 84.2 | 0.44s TTFT |
|
300k |
Input: $0.80 Output:$3.20 |
Low Latency, Scalable performance.[1] |
Amazon Nova Lite | Amazon | 143.6 | 0.39s TTFT |
|
300k |
Input: $0.06 Output: $0.24 |
Cost Effective inference, Larger context length.[1] |
Amazon Nova Micro | Amazon | 190.5 | 0.37s TTFT |
|
130K tokens |
Input: $0.04 Output: $0.14 |
Smaller variant, optimized for speed.[1] |
OpenAI o1-mini | OpenAI | 188.6 | 11.64 s TTFT |
|
128K tokens |
Input: $1.10 Output: $4.40 |
Fastest model, optimized for speed.[1] |
OpenAI o1 | OpenAI | 143 | 12.763s TTFT |
|
128K tokens |
Input: $18.75 Output: $60.00 |
Reasoning model, fast token generation, may have longer TTFT.[1],[2] |
OpenAI o3-mini | OpenAI | 159.6 | 14.12s TTFT |
|
200K tokens |
Input: $1.10 Output: $4.40 |
Fast, efficient, reasoning-focused.[1] |
GPT-4o mini | OpenAI | 91.1 | 0.43s TTFT |
|
128K tokens |
Input: $0.15 Output: $0.60 |
Cost Effective, Intelligent model.[1] |
GPT-4o | OpenAI | 120.7 | 0.45s TTFT |
|
130K tokens |
Input: $5.00 Output: $15.00 |
Intelligent, Fast, Versatile.[1] |
GPT-4 | OpenAI | 20 | 0.4s TTFT |
|
8k tokens |
Input: $30.00 Output: $60.00 |
Fast generation, low latency.[1] |
Gemini 1.5 Flash-8B | N/A | No data |
|
1M tokens |
Input: $0.04 Output: $0.15 |
Lightweight, Low computational cost.[1] | |
Gemini 2.0 Flash | 150 | 0.26s TTFT |
|
1M tokens |
Input: $0.10 Output: $0.40 |
Multimodal model, high speed surprising for its capabilities.[1] | |
Grok Beta | xAI | 66 | 0.31s TTFT |
|
128K tokens |
Input: $5.00 Output: $15.00 |
Large model, faster than expected for its size.[1] |
Grok 3 | xAI | N/A | No data |
|
128K tokens | No data | Advanced, contextual, high-speed reasoning.[1] |
Llama 3.1 Instruct 8B | Meta | 179.7 | 0.37s TTFT |
|
128K tokens | No data | Efficient, Multilingual.[1] |
Llama 3.1 Instruct 70B | Meta | 82.4 | 0.53s TTFT |
|
128K tokens |
Input: $0.60 Output: $0.75 |
Multilingual, Instruction-tuned language model.[1] |
Llama 3.2 Instruct 3B | Meta | 134.4 | 0.34s TTFT |
|
128K tokens |
Input: $0.06 Output: $0.06 |
Computationally less expensive, suited for mobile devices.[1] |
Llama 3.2 Instruct 11B (Vision) | Meta | 104.8 | 0.29s TTFT |
|
128K tokens |
Input: $0.18 Output: $0.18 |
Vision-focused, multilingual, fast inference.[1] |
Llama 3.2 Instruct 90B (Vision) | Meta | 40 | 0.35s TTFT |
|
128K tokens |
Input: $0.80 Output:$0.80 |
Multimodal, high-precision, visual reasoning.[1] |
Llama 3.3 Instruct 70B | Meta | 100.9 | 0.56s TTFT |
|
128K tokens |
Input: $0.59 Output: $0.71 |
Balanced speed for a 70B parameter model.[1] |
Claude 3 Haiku | Anthropic | 142.5 | 0.56s TTFT |
|
200K tokens |
Input: $0.25 Output: $1.25 |
Fast, efficient, lightweight.[1] |
Claude 3 Opus | Anthropic | 27.6 | 1.34s TTFT |
|
200K tokens |
Input: $15.00 Output: $75.00 |
Advanced, powerful, deep reasoning.[1] |
Claude 3.5 Haiku | Anthropic | 65.5 | 0.65s |
|
200K tokens |
Input: $0.80 Output: $4.00 |
Lightweight, responsive.[1] |
Claude 3.5 Sonnet | Anthropic | 85 | 0.84 s TTFT |
|
200K tokens |
Input: $3.00 Output: $15.00 |
Large model, faster than expected for its size.[1] |
DeepSeek-V2-Chat | DeepSeek | 17 | 1.58s TTFT |
|
128K tokens |
Input: $0.14 Output: $0.28 |
Efficient, reliable, low latency.[1] |
DeepSeek R1 Distill Qwen 14B | DeepSeek | 76.9 | 12.21s TTFT |
|
130K tokens |
Input: $0.88 Output: $0.88 |
Cost Effective, efficient.[1] |
DeepSeek V3 | DeepSeek | 27.9 | 7.35s TTFT |
|
130K tokens |
Input: $0.27 Output: $1.10 |
Fast, efficient, open-source.[1] |
DeepSeek R1 | DeepSeek | 25 | 60.76s TTFT |
|
130K tokens |
Input: $0.55 Output: $2.19 |
671B parameters, slow token generation.[1] |
Qwen2.5 Coder Instruct 32B | Alibaba | 69 | 0.35s TTFT |
|
131k tokens |
Input: $0.80 Output: $0.80 |
Low cost, low latency.[1] |
Qwen Turbo | Alibaba | 79 | 1.11s TTFT |
|
1M tokens |
Input: $0.05$ Output: $0.20 |
Efficient, cost-efficient model.[1] |
Qwen2.5 Max | Alibaba | 36.2 | 1.26s TTFT |
|
32k tokens |
Input: $1.60 Output: $6.40 |
Low Latency, Scalable performance.[1] |
Qwen 2.5-72B | Alibaba | 58 | 1.09s TTFT |
|
130K tokens |
Input: $0.00 Output: 0.00 (Alibaba Cloud) |
Range depends on framework. (e.g., vLLM vs. Transformer).[1],[2] |
Mistral Small | Mistral AI | 108 | 0.32s TTFT |
|
33k tokens |
Input: $0.20 Output: $0.60 |
Faster TTFT Speed, Low Latency.[1] |
Mistral Saba | Mistral AI | 98 | 0.34s TTFT |
|
32k tokens |
Input: $0.20 Output: $0.60 |
Low latency, Efficient for mobile devices.[1] |
Mixtral 8x7B Instruct | Mistral AI | 101.9 | 0.31s TTFT |
|
33k tokens |
Input: $0.70 Output: $0.70 |
Efficient, multilingual, high-performance.[1] |
Ministral 3B | Mistral AI | 222.3 | 0.3s TTFT |
|
128K tokens |
Input: $0.04 Output: $0.04 |
Compact, fast, cost-effective.[1] |
Phi-4 | Microsoft | 35.7 | 0.28s TTFT |
|
16k tokens |
Input: $0.09 Output: $0.22 |
Efficient, reasoning-focused, lightweight.[1] |
Dolly | Databricks | 8.5-12.8 | No data |
|
No data | No data | 12B parameters, 128 tokens in 10-15 sec on A100.[1] |
Falcon 40B | TII | 8.61 | No data |
|
2048 | No data | Measured on RTX 4090, hardware-specific.[1] |
Token generation speeds are sourced from verified benchmarks, including third-party analyses (e.g., Artificial Analysis, Vellum.ai), official documentation (e.g., Qwen Docs), and community reports (e.g., GitHub, DatabaseMart). Only models with published speed metrics are included, ensuring accuracy. Data reflects performance as of February 22, 2025, under varying hardware and inference conditions.