Devlery

Devlery - AI news for builders

DEVLERYDEVLERYDEVLERY

Devlery blog

AI news for builders.

The bill behind 73% success, agent evaluation moves beyond models

The bill behind 73% success, agent evaluation moves beyond models

IBM Research and Hugging Face’s Open Agent Leaderboard evaluates AI agents as systems, including harnesses, costs, and failure modes.

Overeager Coding Agents Put Permission Boundaries on the Benchmark

Overeager Coding Agents Put Permission Boundaries on the Benchmark

OverEager-Bench measures whether coding agents cross the user’s authorized scope during benign tasks, using 500 scenarios and roughly 7,500 runs.

Command A+ on two H100s, and the cost threshold for sovereign AI

Command A+ on two H100s, and the cost threshold for sovereign AI

Cohere Command A+ is an Apache 2.0 open model aimed at enterprise agents, private deployment, and the practical cost of sovereign AI.

Why Qwen3.7 is pairing 35-hour agents with custom chips

Why Qwen3.7 is pairing 35-hour agents with custom chips

Alibaba Qwen3.7-Max is not just a model launch. It packages agents, custom chips, 128-accelerator racks, and cloud runtime into one stack.

Genie swallowed Street View, and maps are the world-model bottleneck

Genie swallowed Street View, and maps are the world-model bottleneck

Google added Street View grounding to Project Genie. The world-model race is moving from prompts toward real spatial data and responsibility boundaries.

Two AI Scientist Papers in Nature, and the Lab Bottleneck Is Still Human

Two AI Scientist Papers in Nature, and the Lab Bottleneck Is Still Human

Nature published Google DeepMind Co-Scientist and FutureHouse Robin together. Research automation is moving from model demos to verified agent loops.

Cohere buys Reliant AI as sovereign AI moves into pharma literature

Cohere buys Reliant AI as sovereign AI moves into pharma literature

Cohere’s Reliant AI acquisition shows enterprise AI shifting from general chatbots toward regulated industry agents, evidence tracking, and data sovereignty.

Agent Timeline Turns Agent Failures Into Traceable Evidence

Agent Timeline Turns Agent Failures Into Traceable Evidence

Honeycomb Agent Observability tries to reconstruct LLM calls, tool use, agent handoffs, and downstream systems as one traceable production event.

Grok Build Beta Puts xAI Into the Coding Agent War

Grok Build Beta Puts xAI Into the Coding Agent War

xAI Grok Build early beta enters coding agents with a terminal UI, headless execution, ACP, and Claude Code compatibility behind a $300 tier.

Why Anthropic bought the SDK plumbing behind Claude

Why Anthropic bought the SDK plumbing behind Claude

Anthropic’s Stainless acquisition shows the agent race moving from model quality into SDKs, MCP servers, and the API plumbing agents need to act.

Full repo scanning, the SAST gap AWS Security Agent is targeting

Full repo scanning, the SAST gap AWS Security Agent is targeting

AWS Security Agent full repository code review targets trust boundaries and data flows that traditional SAST often misses.

Zero 0.1.3 Turns Compiler Diagnostics Into an Agent API

Zero 0.1.3 Turns Compiler Diagnostics Into an Agent API

Vercel Labs Zero is less about new syntax than JSON diagnostics, stable error codes, and typed repair metadata for coding agents.