AI
The bill behind 73% success, agent evaluation moves beyond models
IBM Research and Hugging Face’s Open Agent Leaderboard evaluates AI agents as systems, including harnesses, costs, and failure modes.
AI
IBM Research and Hugging Face’s Open Agent Leaderboard evaluates AI agents as systems, including harnesses, costs, and failure modes.
AI
OverEager-Bench measures whether coding agents cross the user’s authorized scope during benign tasks, using 500 scenarios and roughly 7,500 runs.
AI
Cohere Command A+ is an Apache 2.0 open model aimed at enterprise agents, private deployment, and the practical cost of sovereign AI.
AI
Alibaba Qwen3.7-Max is not just a model launch. It packages agents, custom chips, 128-accelerator racks, and cloud runtime into one stack.
AI
Google added Street View grounding to Project Genie. The world-model race is moving from prompts toward real spatial data and responsibility boundaries.
AI
Nature published Google DeepMind Co-Scientist and FutureHouse Robin together. Research automation is moving from model demos to verified agent loops.
AI
Cohere’s Reliant AI acquisition shows enterprise AI shifting from general chatbots toward regulated industry agents, evidence tracking, and data sovereignty.
AI
Honeycomb Agent Observability tries to reconstruct LLM calls, tool use, agent handoffs, and downstream systems as one traceable production event.
AI
xAI Grok Build early beta enters coding agents with a terminal UI, headless execution, ACP, and Claude Code compatibility behind a $300 tier.
AI
Anthropic’s Stainless acquisition shows the agent race moving from model quality into SDKs, MCP servers, and the API plumbing agents need to act.
AI
AWS Security Agent full repository code review targets trust boundaries and data flows that traditional SAST often misses.
AI
Vercel Labs Zero is less about new syntax than JSON diagnostics, stable error codes, and typed repair metadata for coding agents.