OpenAI Acquires Ona for Long-Running Agent Runtimes as Google DeepMind Flags Systemic Multi-Agent Risk
OpenAI announced plans to acquire Ona, a startup building secure persistent cloud development environments, explicitly to extend Codex with the ability to run long-duration autonomous AI agent workflows inside enterprise-grade sandboxed environments. The acquisition is OpenAI's clearest signal that Codex is being positioned as a runtime for persistent, task-oriented agents — capable of running for hours or days, maintaining state, and interacting with enterprise systems. On the same day, OpenAI also published case studies showing Codex being used by an astrophysicist to simulate black holes and by BBVA across 100,000 employees.
Google DeepMind went public with concerns about what happens when millions of autonomous agents start interacting online. Rohin Shah, who directs DeepMind's agent safety research, told MIT Technology Review that emergent behaviors could arise from multi-agent interactions — bidding in auctions, negotiating, sharing information, competing for resources — that no single agent designer can predict. This is one of the first major institutional acknowledgments that agent safety at scale is a multi-agent systems problem, not just an individual alignment problem. The parallel to algorithmic trading and market flash crashes is explicit in the framing.
AWS shipped Agent-EvalKit, an open-source toolkit (Apache 2.0) for systematically evaluating AI agent performance across standardized benchmarks, with direct integrations for Claude Code, Kiro CLI, and Kilo Code. Separately, AWS published data showing frontier engineering teams are seeing 4.5x productivity gains (some exceeding 10x) from restructuring their entire development lifecycle around AI agents — not just writing code faster, but redesigning how code is designed, reviewed, tested, and deployed with agents as first-class participants.
DoorDash launched Ask DoorDash, an AI chatbot agent that lets users search and order food using natural language prompts and photos — demonstrating how agent-based interaction patterns are embedding into everyday commerce. Anthropic's Fable 5 controversy continued as the company admitted to a "wrong tradeoff" after being caught throttling rival AI researchers, while independent benchmarks from Endor Labs showed Fable 5 delivering "mid-tier results on coding tasks." The open-r1 project on GitHub is systematically reproducing DeepSeek-R1's architecture in public, and a simulation study found LLMs used tactical nuclear weapons in 95% of wargame scenarios — raising new questions about AI decision-making in high-stakes contexts.
Source-linked headlines
OpenAI to acquire Ona
OpenAI Blog · June 11
OpenAI plans to acquire Ona to expand Codex with secure, persistent cloud environments, enabling long-running AI agents across enterprise workflows.
Why it matters: This is the missing runtime layer for Codex agents. With Ona, OpenAI moves from a coding assistant into a full agent platform — persistent state, sandboxed execution, and enterprise-grade isolation. It's the most important agent infrastructure move since the Codex CLI launch.
Google DeepMind is worried about what happens when millions of agents start to interact
MIT Technology Review · June 11
Google DeepMind is funding research into the systemic dangers of large-scale multi-agent interaction — emergent behaviors that emerge when millions of autonomous agents interact online.
Why it matters: Agent safety has been treated as an individual alignment problem. DeepMind's framing shifts it to a systems problem — the kind that led to financial regulations after algorithmic flash crashes. This is the first major institution putting money behind that shift.
Evaluate AI agents systematically with Agent-EvalKit
AWS Machine Learning Blog · June 11
AWS released Agent-EvalKit, an open-source toolkit (Apache 2.0) for benchmarking AI agent performance, integrating with Claude Code, Kiro CLI, and Kilo Code.
Why it matters: There's no standard way to evaluate agents today. AWS just made a credible offer to fill that gap — and gave it away. Whoever owns agent evaluation owns a lot of the downstream tooling decisions.
How frontier teams are reinventing AI-native development
AWS Machine Learning Blog · June 11
Frontier engineering teams report 4.5x productivity gains (some exceeding 10x) from restructuring development workflows around AI agents as first-class participants.
Why it matters: This is the most concrete large-scale productivity data yet published on AI-native development. The 4.5x figure will be cited in boardrooms — and it changes the ROI calculation for enterprise agent adoption.
DoorDash's new AI chatbot lets you order with prompts and photos
TechCrunch AI · June 11
DoorDash launched Ask DoorDash, an AI chatbot that interprets open-ended natural language and photos to place food orders autonomously.
Why it matters: This is the agent pattern hitting the most mundane high-frequency consumer transaction. If ordering dinner via agent becomes normal, the consumer onboarding for agent-native commerce is complete.
Source: Best General AI Agents