The Model That Wants Your Mouse
On March 5, 2026, OpenAI released GPT-5.4 — and the headline feature was not about tokens, parameters, or training data. It was about a mouse cursor.
For the first time in commercial AI, a general-purpose language model ships with native computer-use capabilities. GPT-5.4 can see your screen, move your mouse, click buttons, type into text fields, navigate between applications, and chain together multi-step workflows across your operating system — all without custom scripting, browser extensions, or specialized wrappers.
GPT-5.4’s computer-use capabilities are available through the API and Codex, with developers passing a computer_use tool type to enable screen interaction. The model handles mouse movement, keyboard input, screenshot parsing, and application switching as first-class capabilities alongside text generation and reasoning.
When a foundation model can operate a computer the way a human does, every piece of software that has a graphical interface becomes programmable through natural language. No API required. No integration work. Just tell the AI what you want done, and watch it navigate the screen.
What GPT-5.4 Actually Does
Architecture and Computer Use
GPT-5.4 follows GPT-5.0 (August 2025) and GPT-5.2 (December 2025) in the GPT-5 model family. The model supports a standard 272,000-token context window, with an experimental 1 million token context available through Codex and API configuration. OpenAI reports that individual claims are 33% less likely to be false compared to GPT-5.2, based on evaluation of de-identified user prompts.
The defining feature is integrated computer use. Rather than bolting screen-interaction capabilities onto existing models through external tools, GPT-5.4 processes screenshots as input and returns structured actions — mouse clicks, drags, scrolls, and keystrokes — as native outputs. Earlier computer-use systems relied on a pipeline approach: take a screenshot, feed it to a vision model, get a textual description, reason about the next step, and translate reasoning into action via an external controller. Each handoff introduced latency and error propagation. GPT-5.4 collapses much of that pipeline into a more integrated workflow.
Thinking and Pro Variants
GPT-5.4 ships in multiple tiers. The base model handles standard computer-use tasks. GPT-5.4 Thinking introduces an extended reasoning mode that plans multi-step sequences before executing them, trading latency for accuracy on complex workflows. GPT-5.4 Pro, available to ChatGPT Pro subscribers, unlocks additional capabilities for sustained sessions.
The Thinking variant is particularly relevant for enterprise deployments. When faced with a task like “find last quarter’s revenue figures in our Salesforce dashboard, compare them with the projections in Google Sheets, and draft a summary email” — GPT-5.4 Thinking constructs a step-by-step execution plan, verifies it against the current screen state, and executes with explicit checkpoints. If an application loads differently than expected, the model re-plans from the current state rather than blindly continuing.
The Million-Token Context
The experimental 1 million token context window enables the model to maintain awareness of everything it has seen and done during extended sessions. Open multiple browser tabs, switch between applications, and scroll through long documents — GPT-5.4 retains the context. This is what makes sustained, multi-application workflows possible rather than isolated one-shot actions. The extended context counts against usage limits at 2x the normal rate for requests exceeding the standard 272K window.
The Benchmarks That Changed the Conversation
OSWorld: 75% and Above Human Baseline
The benchmark that landed hardest was OSWorld, developed by researchers at Carnegie Mellon University and the University of Hong Kong. OSWorld tests AI systems on real computer tasks across multiple operating systems, with 369 tasks spanning file management, web browsing, document editing, email, spreadsheets, and multi-application coordination.
GPT-5.4 scored 75% on OSWorld — surpassing the human baseline of 72.36% established by the benchmark’s creators. This means GPT-5.4 successfully completed three out of every four real-world computer tasks presented to it, outperforming average non-expert human participants on the same task set.
To be precise about what this means: GPT-5.4 is more reliable at operating a computer through its graphical interface than the average person who participated in the benchmark study. Not better than expert power users or IT professionals, but more reliable than a typical office worker navigating unfamiliar software.
GDPval: 83% Professional Match Across 44 Occupations
The second major benchmark was GDPval — Generalized Digital Proficiency Validation — OpenAI’s evaluation framework that measures how well AI systems perform real-world knowledge work tasks. GDPval spans 44 occupations across 9 sectors, with tasks requesting real work products such as sales presentations, accounting spreadsheets, urgent care schedules, and manufacturing diagrams.
GPT-5.4 matched or exceeded industry professionals in 83% of comparisons across these 44 occupations — up from 70.9% for GPT-5.2. This does not mean GPT-5.4 can replace 44 professions. It means it can handle the routine, screen-based portions of those jobs — the parts involving established workflows, form-filling, data transfer between applications, and documented procedures. The creative, interpersonal, and deeply analytical components remain beyond current capabilities.
Advertisement
What Changes Now
For Software and Enterprise
Every SaaS company with a graphical interface just gained — or lost — an integration layer they did not build. GPT-5.4’s computer-use capability means any application with a screen can be automated through natural language, regardless of whether it offers an API.
This creates a paradox. Companies that invested in robust APIs face competition from a model that can just click through their UI. Conversely, legacy applications that never built APIs — ancient ERP systems, government portals, industry-specific tools — suddenly become automatable overnight.
Enterprise IT departments face a new category of access control challenge. When an AI agent can see your screen and operate your mouse, it inherits whatever access the logged-in user has across every visible application. OpenAI addresses this through a configurable permissions framework in the API, where developers can adjust the model’s safety profile and confirmation policies to match their application’s risk tolerance.
For Workers
The GDPval results quantify something many knowledge workers have felt building: AI approaching the capability level needed to handle routine digital tasks. The 83% match across 44 occupations measures task completion on well-defined workflows. Real jobs involve ambiguity, context switching, interpersonal negotiation, and judgment calls that GPT-5.4 cannot replicate. But the “busy work” portion of many roles — hours spent navigating between applications, copying data, filling forms, following procedures — is now automatable in a qualitatively different way.
The Competitive Landscape
Claude Catches Up
Anthropic launched Claude Computer Use in beta in October 2024, accumulating roughly 17 months of real-world data before GPT-5.4’s release. That head start matters — Anthropic has built robust error-recovery systems from extensive production testing. Claude Opus 4.6 now scores 72.7% on OSWorld, just below the human baseline but trailing GPT-5.4’s 75%.
Notably, Anthropic expanded Claude’s computer-use capabilities to general availability in late March 2026, allowing users to message Claude a task from their phone and have the agent complete it on their computer — signaling that computer use is becoming a standard feature, not a differentiator.
Google’s Browser-First Approach
Google DeepMind has taken a different path with Project Mariner and the Gemini 2.5 Computer Use model. Rather than general-purpose desktop control, Google has focused on browser-based automation with deep Chrome and Google Workspace integration. The approach is more constrained but reliable within its scope — Project Mariner achieves 83.5% on the WebVoyager benchmark for web-specific tasks.
Open-Source Agents Narrowing the Gap
Open-source computer-use agents have made significant progress. OS-Symphony achieved 65.8% on OSWorld, while commercial agents built on open foundations have pushed even higher. The gap between open-source and frontier models has narrowed substantially from where it stood a year ago.
The Safety Question
Screen Data as Attack Surface
Computer-use agents introduce a novel security surface: everything visible on screen becomes potential input. This includes information users may not consciously register — notification pop-ups, background browser tabs, desktop file names. OpenAI has implemented configurable safety behaviors and confirmation policies, and GPT-5.4 is treated as “High cyber capability” under OpenAI’s Preparedness Framework with corresponding monitoring systems and access controls.
The Action Hallucination Problem
Text hallucinations are problematic. Action hallucinations are dangerous. When GPT-5.4 misidentifies a button, clicks the wrong element, or misreads screen text, the consequences are physical changes to real systems. A 25% failure rate on OSWorld means one in four tasks ends with an incorrect outcome. In high-stakes environments — financial systems, medical records, legal documents — that error rate necessitates human-in-the-loop supervision.
What Comes Next
GPT-5.4 establishes computer use as a standard capability for frontier AI models rather than an experimental add-on. Companies like Induced AI, MultiOn, and Perplexity Computer are building vertical solutions on top of these capabilities, while Anthropic’s acquisition of computer-use startup Vercept signals how seriously the industry takes this direction.
The longer-term trajectory points toward AI agents that orchestrate entire digital workflows across dozens of tools, maintaining context and making judgment calls over sustained autonomous operation. GPT-5.4 is the first production-grade step on that path.
The mouse has changed hands.
Frequently Asked Questions
Can GPT-5.4 really use a computer better than a human?
On the OSWorld benchmark, GPT-5.4 scored 75%, which exceeds the human baseline of 72.36% on the same 369-task set. However, this benchmark measures performance on well-defined, isolated desktop tasks. Expert power users and IT professionals still outperform the model on complex, ambiguous tasks requiring judgment and improvisation. GPT-5.4 is more reliable than the average office worker at navigating unfamiliar software, but not better than someone who deeply knows their tools.
How does GPT-5.4 compare to Claude’s computer use?
Claude Computer Use launched in beta 17 months earlier (October 2024) and Claude Opus 4.6 scores 72.7% on OSWorld — close to the human baseline but below GPT-5.4’s 75%. Anthropic expanded to general availability in late March 2026. GPT-5.4 leads on raw benchmarks, while Claude benefits from longer production deployment experience and more mature error-recovery. Google takes a different approach with Project Mariner, focusing on browser-based automation with 83.5% on WebVoyager.
What jobs are most affected by GPT-5.4’s capabilities?
The GDPval benchmark shows GPT-5.4 matching professional performance in 83% of comparisons across 44 occupations spanning 9 sectors — from software developers and lawyers to nurses and mechanical engineers. Roles dominated by routine screen-based workflows face the most automation potential. However, GDPval measures isolated digital tasks, not full job roles. The creative, interpersonal, and judgment-heavy components of knowledge work remain beyond current AI capabilities.
Sources & Further Reading
- Introducing GPT-5.4 — OpenAI
- OpenAI launches GPT-5.4 with Pro and Thinking versions — TechCrunch
- OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks — arXiv
- GDPval: Measuring AI Performance on Real-World Tasks — OpenAI
- GPT-5.4 vs Claude Opus 4.6 for Agentic Tasks — DataCamp
- OpenAI’s GPT-5.4 Doubles Down on Safety — Help Net Security
- Introducing the Gemini 2.5 Computer Use Model — Google
















