A Release Aimed at Agents, Not Chat
Anthropic shipped Claude Opus 4.7 on April 16, 2026, roughly two months after Opus 4.6. The headline framing was explicit: this is a model optimized for long-running agent workflows, not chat. The company’s positioning language — “work that previously needed close supervision can now be handed off with confidence” — is aimed squarely at the enterprise agent market that OpenAI, Google, and Anthropic are all now fighting over.
Pricing stays at $5 per million input tokens and $25 per million output tokens, unchanged from Opus 4.6. That stability matters: enterprise procurement teams care about pricing predictability, and holding the line while shipping measurable capability gains is the kind of move that keeps large contracts from slipping.
The Benchmark Picture
On the benchmarks that matter most for agent workflows, Opus 4.7 narrowly retakes the top spot for generally available frontier models.
- SWE-bench Verified: 87.6% — a jump from Opus 4.6’s 80.8% and ahead of Gemini 3.1 Pro at 80.6%
- SWE-bench Pro (the harder multi-language variant): 64.3% — leading GPT-5.4 at 57.7% and Gemini 3.1 Pro at 54.2%
- OSWorld-Verified (computer-use agent benchmark): 78.0%, up from 72.7% in Opus 4.6 and ahead of GPT-5.4 at 75.0%
- GPQA Diamond (graduate-level reasoning): 94.2%, effectively tied with Gemini 3.1 Pro (94.3%) and GPT-5.4 Pro (94.4%) — this benchmark is approaching saturation at the frontier
- Multi-step agentic reasoning: a reported 14% improvement over Opus 4.6, with roughly one-third the tool-use error rate
The one area where Opus 4.7 visibly trails: BrowseComp (open-web research) dropped from 83.7% on Opus 4.6 to 79.3%, behind Gemini 3.1 Pro at 85.9% and GPT-5.4 Pro at 89.3%. For agent workflows that lean heavily on open-web research (deep research, competitive monitoring), Gemini or GPT may still be the stronger pick.
Advertisement
What “Long-Running” Actually Means
Anthropic’s long-running-agent pitch rests on three capability claims, each of which maps to a measurable product outcome.
Loop resistance. Older agent models often degenerate into repetitive actions when they encounter ambiguity or a tool error. Opus 4.7 reportedly reduces this failure mode, which is what lets an agent continue a multi-hour task instead of stalling and burning tokens in a loop.
Error recovery. When a tool call fails or returns an unexpected output, the model’s behavior determines whether the task fails outright or re-routes around the obstacle. Anthropic’s third-of-the-errors claim for tool use directly improves the probability that a long sequence completes.
Vision at higher resolution. Opus 4.7 supports images up to 2,576 pixels on the long edge — more than triple the previous limit. For computer-use agents that parse full screen captures, this translates into better UI element detection and fewer transcription errors, and it explains the large jump on OSWorld-Verified (from 72.7% to 98.5% on visual acuity sub-scores).
The combination is why Anthropic describes Opus 4.7 as a model that can “work coherently for hours” — not because any single capability is transformative, but because the compound error rate across a long agent chain is now noticeably lower.
New Controls: xhigh, Task Budgets, Code Review
Three operational features shipped alongside the model and matter for enterprise buyers.
First, Anthropic introduced an “xhigh” effort level that sits between the existing “high” and “max” settings — a finer-grained lever on the cost-vs-accuracy trade-off for hard problems. Teams that previously bounced between aggressive capacity and budget overruns now have a middle setting.
Second, task budgets let operators cap the reasoning and tool-call spend per agent run. This is a direct response to a common failure mode in production agents: a single runaway task silently consumes thousands of dollars in tokens before anyone notices.
Third, Anthropic bundled new Claude Code review tools aimed at reviewing pull requests generated by AI agents — a workflow that has become central to engineering teams using Claude Code in production.
The Competitive Frame
The timing of Opus 4.7 is not accidental. OpenAI’s Frontier enterprise platform (launched February 2026) and Google’s A2A protocol plus Workspace Studio (announced at Google Cloud Next 2026) both arrived in the same quarter. All three providers are now pitching the same thesis: AI’s next revenue phase is long-horizon, multi-tool, multi-agent workflows — not chat turns.
Anthropic’s advantage in this frame is credibility on agent reliability. Opus 4.6 had already established Claude as the default model for coding agents and computer-use workflows in many enterprise stacks, and 4.7 extends that lead on the benchmarks that map most directly to those use cases. Its disadvantage is scale distribution: OpenAI and Google have larger enterprise sales motions and tighter integration with existing productivity suites, and Anthropic’s enterprise growth still depends heavily on partner channels like AWS Bedrock, Google Vertex AI, and Microsoft Foundry — all of which carry Opus 4.7 from day one.
For enterprise architects mapping a 2026 model strategy, the practical implication is that “which model is best” is increasingly workflow-specific. Long-horizon coding, computer-use automation, and agentic SaaS back-office tasks now favor Opus 4.7. Open-web research and very large context windows may still favor Gemini 3.1 Pro. High-concurrency consumer-facing deployments with tight latency budgets may favor GPT-5.4. The single-vendor bet is harder to defend than it was a year ago.
Frequently Asked Questions
What is Claude Opus 4.7 optimized for?
Long-running agent workflows — multi-hour, multi-tool, multi-step tasks such as software engineering agents and computer-use automation. Anthropic’s claim is that Opus 4.7 resists looping, recovers from tool errors more reliably, and can “work coherently for hours” on sustained problems.
How does Opus 4.7 compare to GPT-5.4 and Gemini 3.1 Pro?
On SWE-bench Pro, Opus 4.7 scores 64.3% vs GPT-5.4 at 57.7% and Gemini 3.1 Pro at 54.2%. On OSWorld-Verified (computer use), Opus 4.7 reaches 78.0% vs GPT-5.4’s 75.0%. Reasoning benchmarks like GPQA Diamond are effectively tied across all three. On open-web research (BrowseComp), Opus 4.7 trails both competitors.
What should enterprise teams do next?
Run side-by-side evaluations on the specific agent workflows that drive the most cost or reliability pain, use the new task-budget controls to cap runaway spend, and treat “best model” as workflow-specific rather than vendor-specific. Opus 4.7 is available today via the Anthropic API, AWS Bedrock, Google Vertex AI, and Microsoft Foundry.
Sources & Further Reading
- Introducing Claude Opus 4.7 — Anthropic
- Anthropic releases Claude Opus 4.7, narrowly retaking lead for most powerful generally available LLM — VentureBeat
- Claude Opus 4.7 leads on SWE-bench and agentic reasoning — The Next Web
- Claude Opus 4.7 Benchmarks Explained — Vellum AI
- Anthropic releases Claude Opus 4.7, a less risky model than Mythos — CNBC






