Qwen3-30B-A3B

MoE model with high performance and low consumption

Focus on complex applications

The Qwen3-30B-A3B model at a glance

Qwen3-30B-A3B is the largest and most powerful model in Alibaba Cloud’s Qwen3 series – designed for maximum speech understanding performance, high inference quality and complex enterprise-level applications. With 30 billion parameters in the A3B architecture, the model combines state-of-the-art training methods, advanced Instruct capabilities and a very large context window.

Qwen3-30B-A3B is fully open-source, published under the Apache 2.0 license and is ideal for production-ready AI applications with the highest demands on quality, scalability and control.

Name:

Qwen3-30B-A3B (part of the Qwen3 model family)

Developer:

Qwen Team (Alibaba Group)

Publication:

April 29, 2025

License:

Apache 2.0 License (Open Source, commercial use permitted)

Availability:

Hugging Face or GitHub repository

Model type:

Mixture-of-Experts (MoE) Causal Language Model based on Transformer.

Parameters:

Total: 30.5 billion, activated per token: 3.3 billion, without embedding: 29.9 billion

Tokenizer:

Qwen2 Tokenizer (Tiktoken-based), vocabulary size: 151.936. Compatible with current Hugging Face transformers library (chat template available for Instruct/Chat variants).

Layers:

48 Transformer layer

Attention heads:

32 query headers, 4 key/value headers (Grouped-Query Attention - GQA).

Experts (MoE):

Total number of experts: 128, activated experts per token: 8

Context length:

Native: 32,768 tokens (32K), with YaRN scaling: up to 131,072 tokens

Variations of the Qwen3 series

The Qwen3 series includes various model sizes, both dense and MoE models:

Dense models: Qwen3-0.6B, Qwen3-1.7B, Qwen3-4B, Qwen3-8B, Qwen3-14B, Qwen3-32B
MoE models: Qwen3-30B-A3B, Qwen3-235B-A22B

Available variants include basic models (“Base”), instruction-fine-tuned models (“Instruct”) and chat models (“Chat”).

Special features of the Qwen3-30B-A3B model

"Thinking Mode" and "Non-Thinking Mode"

Supports a mechanism (e.g. via the /think token or the enable_thinking parameter in Instruct models) to instruct the model to “think” before responding, which can improve performance in complex tasks such as tool usage and function calling. Switching between “Thinking Mode” and “Non-Thinking Mode”. Instruct/Chat variants are fine-tuned for following instructions and conversations (SFT and RLHF/DPO).

Multilingual support

Good support for over 100 languages and dialects, with strong ability to follow multilingual instructions and translation.

Agentic/Tools capability

Optimized for integrations in agents and tool calling, especially the Instruct variants (e.g. with Qwen agent).

Compatible inference frameworks

Hugging Face Transformers (>=4.51.0 for MoE models), SGLang (>=0.4.6.post1), vLLM (>=0.8.5), Ollama, LMStudio, MLX-LM, llama.cpp, KTransformers and others.

Individual AI consulting

Is Qwen3-30B-A3B the right model for you?

We would be happy to advise you individually on which AI model suits your requirements. Arrange a no-obligation initial consultation with our AI experts and exploit the full potential of AI for your project!

The post-training pipeline for Qwen3-30B-A3B

Training data & training process

Qwen3-30B-A3B, the flagship model of the Qwen3 series, has been trained on a comprehensive, curated dataset of over 3.5 trillion tokens. The training data comes from a high-quality mix of publicly available sources, including web texts, program code, books and scientific papers. For maximum robustness, security and model quality, a multi-stage data preparation process was used, which systematically filtered out irrelevant or risky content and merged the data in a targeted, weighted manner.

In post-training, the model was first adapted to a wide range of instructional data with the help of supervised fine-tuning (SFT). This was followed by targeted refinement using reinforcement learning from human feedback (RLHF) – including direct preference optimization (DPO)– in order to adapt the model even more closely to human preferences, comprehensibility and usability in real-life applications.

Hardware-Anforderungen (Inferenz)

GPU: Requires powerful GPU accelerators.
- For FP16 weights, approx. 60-70 GB VRAM is required (e.g. 1-2x NVIDIA A100/H100 80GB or equivalent consumer GPUs such as RTX 4090 in a network, if possible and supported).
- Quantized versions (e.g. 4-bit via llama.cpp/GGUF) can significantly reduce VRAM requirements and enable operation on a single high-end consumer GPU (e.g. RTX 3090/4090 with 24GB VRAM) or powerful workstation GPUs, depending on the context length.
RAM: High RAM requirement if not fully loaded on GPUs or if CPU offloading is used. For CPU inference with quantization, at least 32-64 GB RAM is recommended, more for longer contexts.
Note: Although smaller than the 235B model, this model also requires significant computing resources.

Versatile & powerful

Recommended applications for Qwen3-30B-A3B

Is Qwen3-30B-A3B the right AI model for your individual application? We will be happy to advise you comprehensively and personally.

Powerful multilingual assistants and dialog systems

Reasoning, mathematics and code generation

Good reasoning, math and code generation.

Advanced agentic use cases with tool integration

Processing and analyzing long texts

With YaRN scaling.

Creative writing and multi-turn dialogs

Research and development

In the area of efficient MoE architectures.

Qwen3-30B-A3B

Strengths & weaknesses of the Qwen3-30B-A3B model

Strengths

Significantly improved reasoning skills.

Excellent adaptation to human preferences for natural conversations.

Strong skills in agentic use and tool calling.

Very good multilingual support (over 100 languages).

Ability to process long contexts with YaRN (up to 131K tokens).

“Thinking Mode” for improved performance in complex tasks.

More efficient inference compared to dense models with a similar total number of parameters due to the MoE architecture (only 3.3B parameters active).

Fully open source under Apache 2.0 license (both code and model weights), allowing commercial use.

Part of a comprehensive family of models (Qwen3).

Weaknesses & limitations

Still requires significant hardware resources, although it is more efficient than a dense 30B model.

Complexity of MoE architecture can complicate inference optimization in some frameworks.

Standard disadvantages of LLMs: potential for hallucinations, bias and lack of transparency.

Performance on shorter texts can potentially be affected if static YaRN is enabled for long contexts.

Qwen3-30B-A3B: Maximum open source performance

Ready for top-class AI?

Qwen3-30B-A3B combines the highest level of speech understanding with full control through open source licensing. Whether for assistance systems, business-critical AI solutions or specialized research – we support you in the selection, integration and hosting of this powerful model. Fully managed in our German GPU Cloud on request.

FAQ - Frequently asked questions

Worth knowing about Qwen3-30B-A3B

Yes, with strong quantization (e.g. via llama.cpp GGUF) and sufficient RAM (at least 32-64GB recommended) CPU inference is possible, but speed will likely be limited for interactive applications. GPU acceleration is recommended for better performance.

For FP16 inference approx. 60-70 GB. With 4-bit quantization, the requirement can be reduced to approx. 15-20 GB VRAM, which enables operation on high-end consumer GPUs. Exact figures depend on the configuration.

Yes, both the code and the model weights of Qwen3-30B-A3B are published under the Apache 2.0 license, which allows commercial use.

The model natively supports 32K tokens. For longer contexts (up to 131K), the YaRN scaling method can be enabled in compatible frameworks. Please note the information on potential performance degradation for shorter texts when using static YaRN.

Would you like individual advice?

Our AI experts are here for you!