Agents are crossing from demos to real work
For the first two years of the modern AI era, AI was something you talked to — a chatbot answering questions, a co-pilot offering suggestions. The story of 2025 and 2026 is AI starting to do the work itself: writing code, handling support tickets, processing legal documents, running multi-step tasks end-to-end without a human in the loop. Whether "agents" actually deliver economic value at scale, or remain expensive demos, is the central question of the next two years.
Timeline
- November 6, 2023
OpenAI launches the Assistants API and "GPTs" — the first serious attempt at packaging an AI as a stateful task-doer rather than a chatbot. Adoption is mostly enthusiast experimentation.
- March 12, 2024
Cognition Labs launches Devin, marketed as the first "AI software engineer." The launch video shows Devin completing real GitHub tasks unattended; reactions split between impressed and skeptical.
- October 22, 2024
Anthropic ships "computer use" — Claude can now control a desktop, click buttons, and fill forms. The capability is rough but signals where the frontier is heading.
- February 11, 2025
Klarna publicly reports that an AI agent now handles 70% of its customer service tickets, equivalent to 700 full-time agents — the first widely-cited number for agents replacing human work at scale.
- September 8, 2025
Salesforce, Microsoft, and Google all reposition their AI products around "agents" rather than "copilots." The category goes from frontier-lab experiment to enterprise default in twelve months.
- February 19, 2026
Anthropic ships agent-as-product to enterprise GA. The shift from "talk to AI" to "give AI a goal and let it work" becomes the dominant deployment pattern at the frontier.
- May 9, 2026
Reports of agent failures begin to surface in production — botched ticket triage, stale customer records, missed escalations. The story shifts from "can agents do real work" to "can they do it reliably enough."
Where things stand right now
Agents now handle real volume in customer support, coding, and operations at major enterprises — the demo-versus-production debate is over. The new debate is reliability: as agents take on more autonomous decisions, the frequency and cost of their failures has become the question that decides how fast the rollout continues.