Overview
OpenAI has launched ChatGPT Agent, a new AI tool with advanced task automation, integration, and web interaction capabilities. This marks a significant leap in agentic AI, enabling real-world task completion with robust safety measures and gradual rollout.
Key Features and Capabilities
- ChatGPT Agent automates complex tasks using its own virtual computer, separate from the user's device.
- It can perform web browsing, API integration, spreadsheet editing, and generate editable slide decks.
- The agent blends OpenAI's prior operator and deep research agents, leveraging new tools like terminal access and connectors to apps.
- Users can prompt the agent for multi-step tasks, such as competitor analysis or online shopping automation.
- The system is interactive; users can pause, adjust, or cancel tasks and request progress updates.
- ChatGPT Agent connects to services like Gmail, Google Calendar, and GitHub after secure authentication.
- Performance notifications are available on the mobile app when tasks complete.
Benchmark and Performance Results
- Achieved 41.6% on “Humanity’s Last Exam” benchmark, nearly double previous OpenAI models.
- Scored 44.4% with parallel attempts on the same exam.
- Outperformed prior models on the Frontier Math benchmark with 27.4% using tool access versus 6.3% for the previous best.
- Matched or exceeded human outputs in about half of real-world business task evaluations.
- Scored 45.54% on spreadsheet tasks when editing .xls files, more than double Copilot in Excel, though humans scored 71.33%.
- Outperformed human baselines on DSBench data science tasks.
- Set new records on web navigation benchmarks like Browse Comp (68.9%) and outperformed previous agents on WebArena.
Technical Architecture
- Uses multiple environments (visual and text-based browsers, terminal) on a virtual machine for persistent context.
- Can run scripts, manipulate files, and display outputs without losing track of progress.
- Employs connectors for secure API access across user apps and services.
Safety, Privacy, and Limitations
- Classified as high-risk for biological and chemical misuse; comprehensive safety stack implemented.
- All prompts are classified and filtered for harmful content, with real-time monitoring and agent memory disabled.
- Designed to resist prompt injection threats from malicious web content.
- Requires explicit user permissions for consequential actions and offers watch mode for supervised tasks.
- Provides privacy features: data deletion, active session logout, cookie management, and private browser logins.
- Known limitations include beta-level slide deck creation and lack of custom slide uploads.
- Next version aims for improved formatting and broader capabilities.
Rollout and Access
- Pro users have access now with 400 messages/month; team users get 40 messages/month, more via paid credits.
- Enterprise and education rollout in coming weeks; European Economic Area and Switzerland access delayed.
- Web developers are advised to use structured content to facilitate AI agent interactions.
Impact and Future Outlook
- ChatGPT Agent is positioned to change online productivity for tasks like trip planning, business analysis, and spreadsheet editing.
- Structured website content is increasingly important for AI agent efficiency.
- AI is already enabling new income streams and workflows for regular users.