Question 1

What is DeepInfra?

Accepted Answer

DeepInfra is an API provider that hosts large language models. Active models: 12; From (input): $0.06 / 1M tok; Avg throughput: 55 tok/s; Avg latency: 1.01 s; Max context: 1.0M.

Question 2

How many models does DeepInfra offer?

Accepted Answer

DeepInfra currently serves 12 active models out of 47 historical offerings on LLM Stats.

Question 3

What is DeepInfra's API pricing?

Accepted Answer

DeepInfra input pricing starts from $0.06 per 1M tokens, with the most expensive offering at $1.74 per 1M tokens. See the Pricing tab above for the full per-model breakdown.

Question 4

How fast is DeepInfra?

Accepted Answer

DeepInfra averages 55 output tokens per second across its catalog, with average latency of 1.01s. Per-model performance is shown in the Performance tab.

Question 5

Is DeepInfra OpenAI compatible?

Accepted Answer

Most providers expose an OpenAI-compatible /v1/chat/completions endpoint so you can switch from OpenAI to DeepInfra by changing only the base URL and API key. Check https://deepinfra.com/ for the exact endpoint format and any provider-specific parameters.

Question 6

Does DeepInfra support multimodal models?

Accepted Answer

Yes. DeepInfra's catalog includes 6 vision-capable models. See the Models and Capabilities tabs for the full per-model breakdown.

Question 7

Whose models does DeepInfra host?

Accepted Answer

DeepInfra hosts models from DeepSeek, NVIDIA, OpenAI, Alibaba Cloud / Qwen Team, Xiaomi, and Google, plus 6 more. See the Models tab for the full catalog grouped by creator.

Question 8

How do I start using DeepInfra?

Accepted Answer

Sign up at https://deepinfra.com/ to get an API key, then call DeepInfra's API directly from your application. Most clients work out of the box by pointing the OpenAI SDK at DeepInfra's base URL with your key. Use the Pricing and Performance tabs above to pick the right model for your latency, cost, and context-window requirements.

Model	Input /M	Output /M	Throughput	Context	Capabilities
Qwen3 VL 8B Thinkingfp8	$0.180	$2.09	—	262K	Vision
Qwen3 VL 8B Thinkingfp8	$0.180	$2.09	—	262K	—
Qwen3 VL 8B Instructfp8	$0.180	$0.690	—	262K	Vision
Qwen3 VL 8B Instructfp8	$0.180	$0.690	—	262K	—
Qwen3 VL 4B Thinkingfp8	$0.100	$1.00	—	262K	Vision
Qwen3 VL 4B Thinkingfp8	$0.100	$1.00	—	262K	—
Qwen3 VL 4B Instructfp8	$0.100	$0.600	—	262K	Vision
Qwen3 VL 4B Instructfp8	$0.100	$0.600	—	262K	—
Qwen3 VL 235B A22B Instructfp8	$0.300	$1.49	—	262K	Vision
Qwen3 VL 235B A22B Instructfp8	$0.300	$1.49	—	262K	—
Qwen3 32B	$0.100	$0.300	27t/s	128K	—
Qwen3 30B A3B	$0.100	$0.300	83t/s	128K	—

Model	Input /M	Output /M	Throughput	Context	Capabilities
Qwen3 VL 8B Thinkingfp8	$0.180	$2.09	—	262K	Vision
Qwen3 VL 8B Thinkingfp8	$0.180	$2.09	—	262K	—
Qwen3 VL 8B Instructfp8	$0.180	$0.690	—	262K	Vision
Qwen3 VL 8B Instructfp8	$0.180	$0.690	—	262K	—
Qwen3 VL 4B Thinkingfp8	$0.100	$1.00	—	262K	Vision
Qwen3 VL 4B Thinkingfp8	$0.100	$1.00	—	262K	—
Qwen3 VL 4B Instructfp8	$0.100	$0.600	—	262K	Vision
Qwen3 VL 4B Instructfp8	$0.100	$0.600	—	262K	—
Qwen3 VL 235B A22B Instructfp8	$0.300	$1.49	—	262K	Vision
Qwen3 VL 235B A22B Instructfp8	$0.300	$1.49	—	262K	—
Qwen3 32B	$0.100	$0.300	27t/s	128K	—
Qwen3 30B A3B	$0.100	$0.300	83t/s	128K	—

DeepInfra: API pricing, performance & models

Catalog

DeepInfrapricing, performance & catalog

Most affordable

Fastest

Largest context

FAQ