Devlery - AI news for builders
Devlery blog
AI news for builders.
The bill behind 73% success, agent evaluation moves beyond models
IBM Research and Hugging Face’s Open Agent Leaderboard evaluates AI agents as systems, including harnesses, costs, and failure modes.
Overeager Coding Agents Put Permission Boundaries on the Benchmark
OverEager-Bench measures whether coding agents cross the user’s authorized scope during benign tasks, using 500 scenarios and roughly 7,500 runs.
Command A+ on two H100s, and the cost threshold for sovereign AI
Cohere Command A+ is an Apache 2.0 open model aimed at enterprise agents, private deployment, and the practical cost of sovereign AI.
Why Qwen3.7 is pairing 35-hour agents with custom chips
Alibaba Qwen3.7-Max is not just a model launch. It packages agents, custom chips, 128-accelerator racks, and cloud runtime into one stack.
Genie swallowed Street View, and maps are the world-model bottleneck
Google added Street View grounding to Project Genie. The world-model race is moving from prompts toward real spatial data and responsibility boundaries.
Two AI Scientist Papers in Nature, and the Lab Bottleneck Is Still Human
Nature published Google DeepMind Co-Scientist and FutureHouse Robin together. Research automation is moving from model demos to verified agent loops.
Cohere buys Reliant AI as sovereign AI moves into pharma literature
Cohere’s Reliant AI acquisition shows enterprise AI shifting from general chatbots toward regulated industry agents, evidence tracking, and data sovereignty.
Agent Timeline Turns Agent Failures Into Traceable Evidence
Honeycomb Agent Observability tries to reconstruct LLM calls, tool use, agent handoffs, and downstream systems as one traceable production event.
Grok Build Beta Puts xAI Into the Coding Agent War
xAI Grok Build early beta enters coding agents with a terminal UI, headless execution, ACP, and Claude Code compatibility behind a $300 tier.
Why Anthropic bought the SDK plumbing behind Claude
Anthropic’s Stainless acquisition shows the agent race moving from model quality into SDKs, MCP servers, and the API plumbing agents need to act.
Full repo scanning, the SAST gap AWS Security Agent is targeting
AWS Security Agent full repository code review targets trust boundaries and data flows that traditional SAST often misses.
Zero 0.1.3 Turns Compiler Diagnostics Into an Agent API
Vercel Labs Zero is less about new syntax than JSON diagnostics, stable error codes, and typed repair metadata for coding agents.