Bennchmarks

Model Name	Provider	Tokens/sec	Latency	Context Length	Cost/Million Tokens	Notes
Amazon Nova Pro	Amazon	84.2	0.44s TTFT	300k	Input: $0.80 Output:$3.20	Low Latency, Scalable performance.^[1]
Amazon Nova Lite	Amazon	143.6	0.39s TTFT	300k	Input: $0.06 Output: $0.24	Cost Effective inference, Larger context length.^[1]
Amazon Nova Micro	Amazon	190.5	0.37s TTFT	130K tokens	Input: $0.04 Output: $0.14	Smaller variant, optimized for speed.^[1]
OpenAI o1-mini	OpenAI	188.6	11.64 s TTFT	128K tokens	Input: $1.10 Output: $4.40	Fastest model, optimized for speed.^[1]
OpenAI o1	OpenAI	143	12.763s TTFT	128K tokens	Input: $18.75 Output: $60.00	Reasoning model, fast token generation, may have longer TTFT.^[1],[2]
OpenAI o3-mini	OpenAI	159.6	14.12s TTFT	200K tokens	Input: $1.10 Output: $4.40	Fast, efficient, reasoning-focused.^[1]
GPT-4o mini	OpenAI	91.1	0.43s TTFT	128K tokens	Input: $0.15 Output: $0.60	Cost Effective, Intelligent model.^[1]
GPT-4o	OpenAI	120.7	0.45s TTFT	130K tokens	Input: $5.00 Output: $15.00	Intelligent, Fast, Versatile.^[1]
GPT-4	OpenAI	20	0.4s TTFT	8k tokens	Input: $30.00 Output: $60.00	Fast generation, low latency.^[1]
Gemini 1.5 Flash-8B	Google	N/A	No data	1M tokens	Input: $0.04 Output: $0.15	Lightweight, Low computational cost.^[1]
Gemini 2.0 Flash	Google	150	0.26s TTFT	1M tokens	Input: $0.10 Output: $0.40	Multimodal model, high speed surprising for its capabilities.^[1]
Grok Beta	xAI	66	0.31s TTFT	128K tokens	Input: $5.00 Output: $15.00	Large model, faster than expected for its size.^[1]
Grok 3	xAI	N/A	No data	128K tokens	No data	Advanced, contextual, high-speed reasoning.^[1]
Llama 3.1 Instruct 8B	Meta	179.7	0.37s TTFT	128K tokens	No data	Efficient, Multilingual.^[1]
Llama 3.1 Instruct 70B	Meta	82.4	0.53s TTFT	128K tokens	Input: $0.60 Output: $0.75	Multilingual, Instruction-tuned language model.^[1]
Llama 3.2 Instruct 3B	Meta	134.4	0.34s TTFT	128K tokens	Input: $0.06 Output: $0.06	Computationally less expensive, suited for mobile devices.^[1]
Llama 3.2 Instruct 11B (Vision)	Meta	104.8	0.29s TTFT	128K tokens	Input: $0.18 Output: $0.18	Vision-focused, multilingual, fast inference.^[1]
Llama 3.2 Instruct 90B (Vision)	Meta	40	0.35s TTFT	128K tokens	Input: $0.80 Output:$0.80	Multimodal, high-precision, visual reasoning.^[1]
Llama 3.3 Instruct 70B	Meta	100.9	0.56s TTFT	128K tokens	Input: $0.59 Output: $0.71	Balanced speed for a 70B parameter model.^[1]
Claude 3 Haiku	Anthropic	142.5	0.56s TTFT	200K tokens	Input: $0.25 Output: $1.25	Fast, efficient, lightweight.^[1]
Claude 3 Opus	Anthropic	27.6	1.34s TTFT	200K tokens	Input: $15.00 Output: $75.00	Advanced, powerful, deep reasoning.^[1]
Claude 3.5 Haiku	Anthropic	65.5	0.65s	200K tokens	Input: $0.80 Output: $4.00	Lightweight, responsive.^[1]
Claude 3.5 Sonnet	Anthropic	85	0.84 s TTFT	200K tokens	Input: $3.00 Output: $15.00	Large model, faster than expected for its size.^[1]
DeepSeek-V2-Chat	DeepSeek	17	1.58s TTFT	128K tokens	Input: $0.14 Output: $0.28	Efficient, reliable, low latency.^[1]
DeepSeek R1 Distill Qwen 14B	DeepSeek	76.9	12.21s TTFT	130K tokens	Input: $0.88 Output: $0.88	Cost Effective, efficient.^[1]
DeepSeek V3	DeepSeek	27.9	7.35s TTFT	130K tokens	Input: $0.27 Output: $1.10	Fast, efficient, open-source.^[1]
DeepSeek R1	DeepSeek	25	60.76s TTFT	130K tokens	Input: $0.55 Output: $2.19	671B parameters, slow token generation.^[1]
Qwen2.5 Coder Instruct 32B	Alibaba	69	0.35s TTFT	131k tokens	Input: $0.80 Output: $0.80	Low cost, low latency.^[1]
Qwen Turbo	Alibaba	79	1.11s TTFT	1M tokens	Input: $0.05$ Output: $0.20	Efficient, cost-efficient model.^[1]
Qwen2.5 Max	Alibaba	36.2	1.26s TTFT	32k tokens	Input: $1.60 Output: $6.40	Low Latency, Scalable performance.^[1]
Qwen 2.5-72B	Alibaba	58	1.09s TTFT	130K tokens	Input: $0.00 Output: 0.00 (Alibaba Cloud)	Range depends on framework. (e.g., vLLM vs. Transformer).^[1],[2]
Mistral Small	Mistral AI	108	0.32s TTFT	33k tokens	Input: $0.20 Output: $0.60	Faster TTFT Speed, Low Latency.^[1]
Mistral Saba	Mistral AI	98	0.34s TTFT	32k tokens	Input: $0.20 Output: $0.60	Low latency, Efficient for mobile devices.^[1]
Mixtral 8x7B Instruct	Mistral AI	101.9	0.31s TTFT	33k tokens	Input: $0.70 Output: $0.70	Efficient, multilingual, high-performance.^[1]
Ministral 3B	Mistral AI	222.3	0.3s TTFT	128K tokens	Input: $0.04 Output: $0.04	Compact, fast, cost-effective.^[1]
Phi-4	Microsoft	35.7	0.28s TTFT	16k tokens	Input: $0.09 Output: $0.22	Efficient, reasoning-focused, lightweight.^[1]
Dolly	Databricks	8.5-12.8	No data	No data	No data	12B parameters, 128 tokens in 10-15 sec on A100.^[1]
Falcon 40B	TII	8.61	No data	2048	No data	Measured on RTX 4090, hardware-specific.^[1]

Methodology

Token generation speeds are sourced from verified benchmarks, including third-party analyses (e.g., Artificial Analysis, Vellum.ai), official documentation (e.g., Qwen Docs), and community reports (e.g., GitHub, DatabaseMart). Only models with published speed metrics are included, ensuring accuracy. Data reflects performance as of February 22, 2025, under varying hardware and inference conditions.