Qwen3 VL 8B Thinking

Name: Qwen: Qwen3 VL 8B Thinking API
Brand: Qwen

qwen/qwen3-vl-8b-thinking

VisionToolsJSONReasoning

by Qwen · 2025-10-14

Qwen3-VL 8B Thinking — open-weight small vision-language reasoning model, 8B params, 128k context.

Endpoints:/v1/chat/completions

ctx131.1K tokens

Max output41K

Inputtext + image + video

Outputtext

p50 TTFT4.25 s

from openai import OpenAI

client = OpenAI(
    base_url="https://api.orcarouter.ai/v1",
    api_key="$ORCAROUTER_API_KEY",
)

INPUT$0.18/ 1M tokens

OUTPUT$2.10/ 1M tokens

p50 TTFT4.25 s7d

p95 TTFT8.55 s7d

TRAFFIC146.2Ktokens / 7d

Get the Qwen3 VL 8B Thinking API →▶ Try in playground </> Use via API

What is Qwen3 VL 8B Thinking?

Qwen3 VL 8B Thinking is an 8-billion-parameter multimodal language model developed by the Qwen team at Alibaba Cloud, hosted on OrcaRouter under the provider qwen. It belongs to the Qwen3 family of vision-language models, with the "Thinking" suffix indicating enhanced reasoning capabilities for visual and textual inputs. The model supports inputs including text, images, and video, and produces text outputs. Its context window spans 131,072 tokens, and it can generate up to 40,960 tokens in a single response. The model is accessed via OrcaRouter's OpenAI-compatible API endpoint, and its identifier is qwen/qwen3-vl-8b-thinking.

Who should use this model?

This model is appropriate for developers, researchers, and enterprises that need multimodal reasoning over long contexts (up to 131K tokens) without exceeding a 40K output limit. It is especially useful for tasks that combine visual and textual information, such as summarizing video content, analyzing documents with embedded images, or answering detailed questions about high-resolution photographs. Because it is an 8B-parameter model, it balances capability with computational cost; users who require maximal accuracy on complex multimodal benchmarks might consider larger models, while those with simpler, text-only tasks may prefer cheaper alternatives.

What modalities does it support?

Qwen3 VL 8B Thinking accepts three input modalities: text, images, and video. Images can be static photographs, diagrams, screenshots, or any raster graphics. Video input is treated as a sequence of frames; the model processes video content by sampling frames over time. Output is always text, including markdown-formatted answers, lists, or code blocks. The model does not generate images or audio. When processing video, the context window limits how many frames can be reasonably included; with a 131K token limit, users should consider frame-rate and duration trade-offs. OrcaRouter does not pre-process media beyond standard API upload guidelines.

How does the 'Thinking' variant differ from standard Qwen3 VL?

The "Thinking" variant of Qwen3 VL is designed to produce chain-of-thought reasoning before arriving at a final answer. This approach improves performance on tasks that require multi-step logic, such as arithmetic with visual elements, spatial reasoning, or complex temporal understanding in video. It may produce longer outputs due to intermediate reasoning tokens, so users should account for token consumption accordingly. The standard Qwen3 VL model (if available) would prioritize concise answers directly. The thinking process is not exposed as separate tokens in the API; it is internal to the model's generation and reflected in the final response text.

Code samples

Call from any SDK

OpenAI-compatible — keep the SDK you already use

OpenAI SDKhttps://api.orcarouter.ai/v1

from openai import OpenAI

client = OpenAI(
    base_url="https://api.orcarouter.ai/v1",
    api_key="$ORCAROUTER_API_KEY",
)

response = client.chat.completions.create(
    model="qwen/qwen3-vl-8b-thinking",
    messages=[{"role": "user", "content": "Hello"}],
)
print(response.choices[0].message.content)

Supported parameters

enable_search
enable_thinking
include_reasoning
logprobs
max_tokens
n
parallel_tool_calls
presence_penalty
reasoning
repetition_penalty
response_format
seed
stop
stream
stream_options
temperature
thinking_budget
tool_choice
tools
top_k
top_logprobs
top_p

Pricing

Input / 1M tokens	$0.180
Output / 1M tokens	$2.10
Currency	USD

Cost calculator

Tokens / month10MM

70%

Estimate based on list price

Token & cost estimator

Expected output tokens

Input tokens: 20Cost per request: $0.001054

Estimate only — actual token counts depend on the provider's tokenizer.

Performance

last 7 days

p50 TTFT

4.25 s

Output speed

71.2 tok/s

p95 TTFT

8.55 s

Error rate

Public benchmarks

pending

How Design Arena works

Source: Design Arena

How it compares

	Qwen3 VL 8B Thinking	qwen/qwen3-max-preview	Qwen3.5 397B A17B	qwen/qwen3.5-plus
Input $/M	$0.18	$0.86	$0.17	$0.12
Output $/M	$2.10	$3.44	$1.03	$0.69
Context	131K	262K	33K	1.0M
Quality	4/10	8/10	8/10	8/10
Compare side-by-side		Compare side-by-side	Compare side-by-side	Compare side-by-side

More from Qwen

See all models from qwen →

Qwen3.6 35B A3BCheapest

qwen/qwen3.6-35b-a3b

$0.25 in · $1.49 out / 1M

262.1K ctx· quality 8/10

Compare side-by-side

Qwen3.6 Plus

qwen/qwen3.6-plus

$0.50 in · $3.00 out / 1M

1.05M ctx· quality 8/10

Compare side-by-side

Qwen3.7 Plus

qwen/qwen3.7-plus

$0.35 in · $1.42 out / 1M

1M ctx· quality 8/10

Compare side-by-side

FAQ

What is the cost per million tokens for Qwen3 VL 8B Thinking on OrcaRouter?

Input tokens: $0.18 per 1 million tokens. Output tokens: $2.10 per 1 million tokens. These are the provider's rates passed through with zero markup.

What is the context window and maximum output token count?

Context window is 131,072 tokens. Maximum output tokens is 40,960 tokens.

What are the main strengths of this model?

It handles three input modalities (text, image, video), offers a large context and output length, and includes a thinking (chain-of-thought) capability that improves performance on complex reasoning tasks.

How does this model compare to Qwen2 VL 7B?

It has a larger context (131K vs 32K), larger max output (40K vs 2K), and adds thinking capabilities. Pricing is slightly higher but offers improved reasoning and multimodal understanding.

Does OrcaRouter cache any user data when using this model?

OrcaRouter does not report any caching mechanism that stores prompts or images; data handling follows standard API practices. Consult OrcaRouter's privacy policy for details.

How do I call this model via an OpenAI-compatible API?

Send a POST to https://api.orcarouter.ai/v1/chat/completions with model "qwen/qwen3-vl-8b-thinking" and your API key. Use a content array with text and image_url entries for multimodal input.

Can the model process video input?

Yes, video is accepted by sending extracted frames as separate image_url entries. The model has no native video codec support; it works on static frame sets.

Is there any restriction on the number of images per request?

No, except the total token count (including text and image tokens) must be within the 131,072 context limit. There is no separate image limit.

What programming languages or libraries can I use to access this model via OrcaRouter?

Any language that supports HTTP requests and the OpenAI API format. Python with the openai library, JavaScript with fetch, etc. Just set the base URL and model ID.

Does the thinking variant always return chain-of-thought reasoning?

The model internally uses step-by-step reasoning before generating the final answer, but the output you see is the complete response. You cannot extract the intermediate thinking tokens separately.

Embed this badge

Paste into your blog post

Qwen: Qwen3 VL 8B Thinking•$0.18/M in•4250ms p50•via OrcaRouter

HTML <a href="https://www.orcarouter.ai/models/qwen/qwen3-vl-8b-thinking" target="_blank"> <img src="https://www.orcarouter.ai/embed/qwen/qwen3-vl-8b-thinking.svg" alt="Qwen: Qwen3 VL 8B Thinking on OrcaRouter" /> </a>

Markdown [![Qwen: Qwen3 VL 8B Thinking](https://www.orcarouter.ai/embed/qwen/qwen3-vl-8b-thinking.svg)](https://www.orcarouter.ai/models/qwen/qwen3-vl-8b-thinking)

Model card as data

GET /api/public/models/qwen/qwen3-vl-8b-thinkingOpen

Machine-readable:/llms.txt /llms-full.txt

Qwen3 VL 8B Thinking

What is Qwen3 VL 8B Thinking?

Who should use this model?

What modalities does it support?

How does the 'Thinking' variant differ from standard Qwen3 VL?

What tasks is this model best suited for?

When should you choose a cheaper model instead?

Can it handle multiple images in a single request?

What are the model's known limitations?

What benchmark scores have been reported for this model?

How does latency compare with other models of similar size?

What are the model's strengths and honest limitations?

How is the model priced?

How do image and video tokens affect cost?

Are there any discounts or caching available?

How does this model's pricing compare to other multimodal models?

How do I call this model via OrcaRouter's API?

What parameters are supported?

How can I migrate from another provider to OrcaRouter?

Is the model accessible via streaming?

How does Qwen3 VL 8B Thinking compare to Qwen2 VL 7B?

How does it compare to GPT-4o mini?

How does it compare to Llama 3.2 11B Vision?

When should I use this model over Gemini 1.5 Flash?

Code samples

Call from any SDK

Supported parameters

Pricing

Cost calculator

Token & cost estimator

Performance

Public benchmarks

How it compares

More from Qwen

FAQ

Embed this badge

Model card as data