Meta’s Answer to the Closed-Model Lead
When OpenAI, Google and Anthropic sprinted ahead with GPT-5, Gemini 3.1 Pro, and Claude Opus 4.6, the open-weight community was left with strong mid-tier options but no true frontier peer. Meta’s April 2025 release of the Llama 4 herd — Scout, Maverick, and the previewed 2-trillion-parameter Behemoth — was designed to close that gap. A year later, Llama 4 Maverick remains the single most capable open-weight model an enterprise can legally deploy on its own infrastructure.
Maverick ships 400 billion total parameters, of which only 17 billion are active per token thanks to a native Mixture-of-Experts (MoE) architecture with 128 experts. That design is the key economic unlock: the model has the knowledge capacity of a 400B dense model while costing about the same to run as a 17B one. It was pretrained on approximately 22 trillion tokens of multimodal data spanning text, images and video.
Context Windows: Maverick at 1M, Scout at 10M
A common point of confusion: Llama 4 Maverick supports a 1 million token context window — wide enough for entire codebases and long document analysis — while its smaller sibling Llama 4 Scout (109B total / 17B active / 16 experts) pushes to a 10 million token context window, the largest of any publicly available model. Scout fits on a single H100 GPU and is the practical choice when the workload is long-context rather than heavy reasoning.
The division of labor is intentional. Maverick is Meta’s reasoning and coding heavyweight, designed to compete with GPT-4o and Claude 3.7 Sonnet. Scout is the long-context workhorse, a drop-in choice for retrieval-heavy pipelines, enterprise knowledge bases, and multi-day conversation threads. The upcoming Behemoth (2T total, 288B active) is the teacher model distilled into both — still in preview at the time of writing.
Benchmark Performance vs. Closed Peers
Maverick hit 1,417 ELO on Chatbot Arena at launch, outperforming GPT-4o and trading blows with Claude Sonnet 3.7 and Gemini 2.0 Pro on STEM benchmarks including MATH-500 and GPQA Diamond. Independent evaluation on Artificial Analysis confirms it is the strongest open-weight model on reasoning and the best open-weight choice for multimodal tasks with visual inputs.
Where it lags: coding benchmarks. Despite being roughly 13x larger in total parameter count than rivals like Gemma 4 31B, Maverick underperforms on agentic coding and tool-use evaluations, which has pushed many developer-focused buyers to dual-deploy with a smaller specialist model.
For enterprises, the comparison worth running is Maverick against Google’s Gemma 4 family and Qwen 3.5 — the other two serious open-weight options in 2026. Gemma 4 31B ranks #3 on LMArena overall, scores 85.2% on MMLU Pro, and activates only 3.8B parameters per token. For most development workloads the smaller Gemma 4 or Qwen 3.5 will be faster and cheaper to host.
Advertisement
The License Question — and Why It Matters
This is where the open-weight landscape gets complicated. Llama 4 Maverick is not Apache 2.0 licensed — it ships under Meta’s Llama 4 Community License, which carries two consequential restrictions:
- 700M MAU clause. Any service with more than 700 million monthly active users must obtain separate written permission from Meta before using the model commercially. Effectively, this carves Amazon, Microsoft, Google, ByteDance and a handful of others out of the license by default.
- Distillation prohibition. Outputs from Llama 4 models cannot be used to train or improve models that would compete with Meta’s. This is the clause that matters to foundation model startups and enterprises considering their own distilled variants.
By contrast, Gemma 4 uses Apache 2.0 — no MAU cap, no distillation restriction. GLM-5.1 uses the even more permissive MIT license. For regulated enterprises and government buyers in Europe, the Middle East and North Africa, Gemma 4 and GLM-5.1 have become the preferred open-weight choices specifically because the license terms are auditor-friendly.
Hardware and Deployment Realities
Running Maverick in production is a different exercise from running Scout. The 400B total parameter count means the weights occupy roughly 750 GB in FP16, putting inference squarely in multi-GPU territory — typically 8x H100 or 4x H200 nodes for production throughput. NVIDIA has published optimization work specifically targeting Llama 4 Scout and Maverick with TensorRT-LLM kernels that materially improve throughput, and the Hugging Face release ships with vLLM support.
Cost-to-serve on self-hosted hardware lands near $0.50 per million input tokens at steady-state utilization on an 8xH100 node, which is competitive with GPT-4o-mini API pricing but considerably more than Gemma 4 27B self-hosted. For organizations with existing GPU capacity and compliance requirements that demand on-premise inference, Maverick pays off. For purely economic deployments, cheaper options win.
Enterprise Implications
- Sovereign AI just got real. Countries and regulated sectors that demand model weights inside national borders now have a 400B-class reasoning model and a 10M-context long-document model they can legally self-host. Expect procurement activity from defense, healthcare and finance.
- RAG pipelines get a rewrite. Scout’s 10M context eliminates much of the need for complex retrieval for mid-sized corpora. A 10M window holds roughly 7,500 pages of text — enough for most companies’ legal, policy, or product knowledge base to fit in a single prompt.
- Watch the licensing fine print. The 700M MAU clause is a landmine for high-traffic consumer applications. If your product has any path to meaningful scale, Gemma 4 or Qwen 3.5 may be the safer long-term bet.
- The Behemoth is coming. Meta’s previewed 2T-parameter teacher model, if released openly, would upend the balance of power between closed and open model labs. Its license terms will be the most-watched announcement of the second half of 2026.
The Bigger Picture
For the first time since Llama 2 arrived in 2023, the open-weight ecosystem has a genuine frontier-class reasoning model and a record-setting long-context model, published together. That gives enterprises a real second source to every closed vendor — a negotiating lever that was missing from the 2024-2025 procurement cycle.
The irony is that Meta’s own restrictive license ensures the ecosystem’s center of gravity is shifting toward truly-open alternatives. Gemma 4 under Apache 2.0, Qwen 3.5 under Apache 2.0, and GLM-5.1 under MIT are absorbing the demand Meta’s terms exclude. Llama 4 Maverick may be the single most capable open-weight model of 2026. But it is increasingly not the one most developers actually deploy.
Frequently Asked Questions
What is the difference between Llama 4 Maverick and Scout?
Maverick is the reasoning and coding heavyweight — 400B total parameters, 17B active, 128 experts, 1M-token context, designed to compete with GPT-4o and Claude Sonnet. Scout is the long-context workhorse — 109B total parameters, 17B active, 16 experts, and a record 10M-token context window. Scout fits on a single H100 GPU; Maverick needs 8x H100 or 4x H200 nodes for production throughput.
Can I use Llama 4 Maverick commercially without paying Meta?
Yes, but with restrictions. The Llama 4 Community License allows commercial use below 700 million monthly active users and prohibits using model outputs to train competing models. For a typical enterprise, neither clause is blocking. For a startup whose product could scale past 700M MAU or for a foundation model lab, the clauses matter — and Gemma 4 (Apache 2.0) or GLM-5.1 (MIT) are safer long-term bets.
Does the 10M-token context window replace my RAG pipeline?
For mid-sized corpora, often yes. A 10M window holds roughly 7,500 pages of text — enough for most companies’ complete legal, policy, or product knowledge base to fit in a single prompt. For larger enterprise document sets (hundreds of thousands of pages) or workloads with strict freshness requirements, RAG still wins on cost and latency. Scout’s 10M context is best used as a “drop-in simplifier” for medium-complexity retrieval problems.






