Qwen3-235B-A22B

Flagship model with 235 billion parameters

Focus on performance and complexity

The Qwen3-235B-A22B model at a glance

Qwen3-235B-A22B is the flagship model of the Qwen3 series – developed by Alibaba Cloud for use in highly complex, performance-intensive AI scenarios. With 235 billion parameters in the modern A22B architecture, it is one of the largest and most advanced publicly available language models ever.

The model combines exceptional language processing capabilities with deep context understanding, precise tool use and strong multilingualism. Using modern training methods, including RLHF and DPO, Qwen3-235B-A22B has been specifically optimized for helpfulness, security and scalability – and is available for commercial use under Apache 2.0 license.

Name:

Qwen3-235B-A22B (part of the Qwen3 model family)

Developer:

Qwen Team (Alibaba Group)

Publication:

April 29, 2025

License:

Apache 2.0 License (Open Source, commercial use permitted)

Model type:

Mixture-of-Experts (MoE) Causal Language Model based on Transformer.

Parameters:

Total: 235 billion, activated per token: 22 billion, without embedding: 234 billion

Tokenizer:

Qwen2 Tokenizer (Tiktoken-based), vocabulary size: 151.936. Compatible with current Hugging Face transformers library (chat template available for Instruct/Chat variants).

Layers:

94 Transformer layer

Attention heads:

64 query headers, 4 key/value headers (Grouped-Query Attention - GQA).

Experts (MoE):

Total number of experts: 128, activated experts per token: 8

Context length:

Native: 32,768 tokens (32K), with YaRN scaling: up to 131,072 tokens

Variations of the Qwen3 series

The Qwen3 series includes various model sizes, both dense and MoE models:

  • Dense models: Qwen3-0.6B, Qwen3-1.7B, Qwen3-4B, Qwen3-8B, Qwen3-14B, Qwen3-32B
  • MoE models: Qwen3-30B-A3B, Qwen3-235B-A22B

Available variants include basic models (“Base”), instruction-fine-tuned models (“Instruct”) and chat models (“Chat”).

Special features of the Qwen3-32B model

"Thinking Mode" and "Non-Thinking Mode"

Supports a mechanism (e.g. via the /think token or the enable_thinking parameter in Instruct models) to instruct the model to “think” before responding, which can improve performance in complex tasks such as tool usage and function calling. Switching between “Thinking Mode” and “Non-Thinking Mode”. Instruct/Chat variants are fine-tuned for following instructions and conversations (SFT and RLHF/DPO).

Multilingual support

Good support for over 100 languages and dialects, with strong ability to follow multilingual instructions and translation.

Agentic/Tools capability

Optimized for integrations in agents and tool calling, especially the Instruct variants (e.g. with Qwen agent).

Compatible inference frameworks

Hugging Face Transformers (>=4.51.0 for MoE models), SGLang (>=0.4.6.post1), vLLM (>=0.8.5), Ollama, LMStudio, MLX-LM, llama.cpp, KTransformers and others.
Individual AI consulting

Is Qwen3-235B-A22B the right model for you?

We would be happy to advise you individually on which AI model suits your requirements. Arrange a no-obligation initial consultation with our AI experts and exploit the full potential of AI for your project!

The post-training pipeline for Qwen3-235B-A22B

Training data & training process

Qwen3-235B-A22B – like all models in the Qwen3 series – has been pre-trained on a comprehensive, carefully curated dataset of over 3.5 trillion tokens. The data comes from a diverse mix of publicly available web content, program code, technical literature, books and scientific papers. A multi-stage preparation process ensured that only high-quality, relevant and secure content was included in the training – with the aim of creating a model with maximum linguistic competence, robustness and scalability.

Following the pre-training, the model was further refined for instruct and chat applications: first through supervised fine-tuning (SFT) on extensive instruction data sets, then through reinforcement learning from human feedback (RLHF). Among other things, Direct Preference Optimization (DPO) was used to specifically align the model with human preferences, helpful behaviour and controllable output quality.

Hardware requirements (inference)

  • GPU: Requires high-end GPU accelerators with significant VRAM.
    • The exact requirements depend on the quantization, batch size and context length.
    • FP16 weights require several hundred GB of VRAM (e.g. 8x NVIDIA H100 80GB or equivalent).
    • Tensor Parallelism (TP) is recommended for inference (e.g. tp=8 for SGLang).
  • RAM: Very high RAM requirement if not fully loaded on GPUs. For quantized versions (e.g. GGUF) on CPU still considerable (at least 128GB RAM is mentioned in the PDF for the largest Qwen depending on quantization).
  • Note: This is a very large model that requires considerable computing resources to operate.
Versatile & powerful

Recommended applications for Qwen3-235B-A22B

Is Qwen3-235B-A22B the right AI model for your individual application? We will be happy to advise you comprehensively and personally.

Sophisticated multilingual assistants and dialog systems
With deep understanding.
Complex reasoning, mathematics, code generation and problem solving
Sophisticated agentic use cases
With complex tool integration and function calling.
Processing and analyzing long texts
With YaRN scaling.
Creative writing, role-playing and complex multi-turn dialogs
Research and development
In the area of state-of-the-art LLMs and MoE architectures.
Qwen3-235B-A22B

Strengths & weaknesses of the Qwen3-235B-A22B model

Strengths

Outstanding performance in reasoning, math and code generation.

Excellent adaptation to human preferences for natural conversations.

Leading skills in agentic use and tool calling.

Very strong multilingual support (over 100 languages).

Ability to process long contexts with YaRN (up to 131K tokens).

“Thinking Mode” for improved performance in complex tasks.

Fully open source under Apache 2.0 license (both code and model weights), allowing commercial use.

Part of a comprehensive family of models (Qwen3).

Weaknesses & limitations

Extremely high hardware requirements for inference, which are typically only available in professional environments or cloud infrastructures.

Complexity of the MoE architecture can make inference optimization more difficult.

Standard disadvantages of LLMs: potential for hallucinations, bias and lack of transparency.

Performance on shorter texts can potentially be affected if static YaRN is enabled for long contexts.

Energy consumption is considerable due to the size of the model and the hardware required.

Qwen3-235B-A22B: Top performance for your AI projects

Ready for open source AI in its strongest form?

Qwen3-235B-A22B provides you with one of the most powerful open source language models in the world – ideal for complex applications, advanced assistance systems or large-scale research projects. We support you with selection, integration and hosting – whether locally, in your cloud or on our secure GPU infrastructure in Germany.Take advantage of our expert knowledge to implement your AI strategy – efficiently, securely and sustainably.

FAQ - Frequently asked questions

Worth knowing about Qwen3-235B-A22B

Theoretically yes, with extreme quantization levels (e.g. via llama.cpp GGUF) and a lot of RAM (well over 128GB). However, the performance would probably be insufficient for interactive use. This model is primarily designed for GPU-accelerated operation.

For FP16 inference, several GPUs with a total of hundreds of GB of VRAM are required (e.g. 8 x NVIDIA H100 80GB, which corresponds to 640GB of VRAM). Exact numbers depend on the configuration and quantization. Even with 4-bit quantization, the requirement is still very high.

Yes, both the code and the model weights of Qwen3-235B-A22B are published under the Apache 2.0 license, which allows commercial use.

The model natively supports 32K tokens. For longer contexts (up to 131K), the YaRN scaling method can be enabled in compatible frameworks (such as transformers, vLLM, SGLang, llama.cpp). Please note the information on potential performance degradation for shorter texts when using static YaRN.

Would you like individual advice?

Our AI experts are here for you!