Devlery
Blog/AI

Agent memory moved into files, and AMP shows a different path from OpenAI

OpenAI Agents SDK memory and the AMP v0.1 draft turn long-term agent memory into files, Git history, MCP resources, and auditable state.

Agent memory moved into files, and AMP shows a different path from OpenAI
AI 요약
  • What happened: OpenAI documented beta sandbox agent memory in the Agents SDK, while Agent Memory Protocol published a v0.1 draft for Markdown and Git based agent memory.
    • Both treat memory as an operational artifact: MEMORY.md, raw memory files, indexes, MCP resources, and filesystem conventions rather than a hidden model feature.
  • Why it matters: The agent memory race is starting with ownership, portability, provenance, and auditability, not just recall quality.
  • Watch: AMP is still a draft and its reference implementation is planned, so this is not a settled standard yet.
    • Recent research also warns that aggressive memory consolidation can hurt performance. Preserving raw episodes and gating summaries may matter more than simply remembering more.

AI agents' "memory" is moving out of product copy and into concrete developer infrastructure: files, manifests, indexes, retention rules, and protocol surfaces. OpenAI now describes agent memory as a beta capability for sandbox agents in the Agents SDK, where lessons from one run can be distilled into workspace artifacts and reused by future runs. At the same time, Agent Memory Protocol, or AMP, has published a v0.1 draft built around Markdown-first, Git-friendly memory with MCP and filesystem integration.

These are not the same kind of product. OpenAI's memory is an SDK feature meant to make sandbox agent runs cheaper and more effective inside its own execution model. AMP is an independent format draft, with its reference implementation still described as planned. Yet reading them side by side shows a useful shift. Agent memory is no longer only about putting more chat history into a context window or trusting a model to remember on its own. It is becoming an operations-layer question: who can read and write the memory, which record is the original evidence, when should a memory expire, and can the state move to another runtime?

Agent Memory Protocol official logo

Memory becomes operating state, not just session history

The interesting part of OpenAI's documentation is that memory is separated from an ordinary conversation session. The SDK's conversational Session handles message history, while sandbox agent memory is described as a way for future runs to learn from previous executions by distilling lessons into workspace files. The default layout includes memory_summary.md, MEMORY.md, raw_memories.md, raw_memories/, and rollout_summaries/. After a run, the system performs conversation extraction and layout consolidation to create raw memories and summary memories.

That is a small but important change in framing. When an agent "remembers," the state is not merely some hidden profile inside a vendor service. At least from the developer tooling angle, memory becomes a file, a summary, a raw evidence trail, and a consolidated lesson. OpenAI's documentation also warns that memory can contain sensitive information and should be governed by a retention policy similar to the workspace itself. In other words, memory is not just prompt optimization. It is data governance.

AMP pushes the same issue more directly. Its README describes agent knowledge, preferences, decisions, and procedures as invisible, hard to prune or audit, locked to providers, and incompatible across frameworks. Its proposed answer is a .amp/ directory: an amp.yaml manifest, Markdown memory nodes under nodes/, ephemeral daily notes, and a regenerable index/. Each node carries frontmatter and is classified into types such as fact, preference, episode, procedure, and reflection.

.amp/
├── amp.yaml
├── nodes/
│   ├── facts/
│   ├── preferences/
│   ├── episodes/
│   └── procedures/
├── daily/
└── index/

The key point is that the store is not a special database. It is meant to be readable as Markdown, connected through wiki links, versioned in Git, and portable by copying a folder. This is less a claim about a breakthrough long-term memory algorithm than an argument that AI agents should leave their operating state in a document store humans can inspect.

OpenAI and AMP answer different questions

OpenAI's path is product-integrated memory. When an agent works in a sandbox, reads files, runs shell commands, applies patches, and repeats workflows across runs, memory can reuse corrections and lessons from earlier executions. The cost argument is explicit: reduce agent cost by avoiding repeated exploration, reduce user cost by not asking people to restate preferences, and reduce context cost by not forcing teams to hunt down and paste old threads.

AMP's path is portable memory. If an agent framework can read the .amp/ structure, the memory should be able to move with the project. The agent integration spec proposes three surfaces. An MCP server can expose tools such as amp_store, amp_recall, amp_search, amp_forget, and amp_link. A resource protocol can expose memory slices through URIs such as amp://store_id/recall?context=.... A filesystem convention gives agents such as Claude Code, Codex, Cursor, or OpenClaw a simpler route: read and write Markdown files in the workspace.

ItemOpenAI Agents SDK memoryAMP v0.1 draft
Main purposeLet sandbox agents reuse lessons from previous runsMove agent memory outside one provider or framework
Storage shapeMEMORY.md, raw memories, rollout summariesMarkdown nodes, frontmatter, wiki links, indexes
Integration surfaceSandbox capability inside the Agents SDKMCP tools, resource URIs, direct filesystem access
StatusBeta documentationv0.1 draft, reference implementation planned
Core riskThe memory lifecycle is controlled inside the vendor runtimeBefore adoption, it may remain one more format rather than a standard

That distinction changes the developer question. If the only thing we ask is which vector database recalls better, memory looks like a search problem. If we ask whether a fact that changed a tool call can be traced back to its source, memory becomes a provenance problem. If we ask whether preferences and procedures follow an agent from Claude Code to Codex or an internal runtime, memory becomes a portability problem. If we ask who supersedes a wrong memory and where that audit log lives, memory becomes an operations policy problem.

Why memory may be the next MCP battlefield

MCP gave tools a common surface. What remains less standardized is the state an agent consulted before deciding to call a tool. Two agents can call the same search_docs tool and produce very different behavior depending on whether one remembered "this customer requires the EU region" while another remembered an older "US region is the default" note.

That is why AMP emphasizes MCP integration. Tools standardize the moment an agent touches the outside world. Memory explains the internal evidence layer that shaped why the agent wanted to act. It is not yet clear whether MCP tool verbs are the right final abstraction for memory. The Korean research note tracked early Reddit discussion in r/mcp and r/openclaw, where the reaction was still small and mixed. Some commenters questioned whether memory should be standardized as tool-like verbs at all, while others argued that conformance tests need to cover recall determinism, duplicate and conflict consolidation, and provenance preservation.

That criticism matters. If a memory API only exposes store and recall, it is just another RAG wrapper from a production developer's point of view. What teams need is stricter. Does the same query and context return a stable-enough memory set? When old and new memories conflict, which one is active? If a memory pushed an agent toward a tool call, did that memory originally come from a user message, a file, a web page, or a tool result? Without answers to those questions, long-term memory becomes a hard-to-reproduce side effect, not a capability.

Git-managed memory has advantages and traps

AMP's versioning spec treats Git as the default versioning mechanism. Memory mutations can become commits; branches and merges can be used for divergent memory states; conflicts can surface through familiar workflows. For developers, that idea is intuitive. When an agent creates a new preference, a commit can record it. When an old node is archived, a changelog can show it. When a semantic conflict is detected, the system can mark it as disputed or create a reflection node.

The benefits are real. First, memory becomes human-inspectable. A memory store that an operator can read with cat and audit with git log is easier to inspect than a profile hidden inside a vendor UI. Second, it fits team collaboration. If agent memory lives near the repository, project conventions, deployment procedures, verification checklists, and recurring failure notes can be reviewed next to code. Third, migration becomes cheaper. Instead of waiting for one vendor's export feature, a team has a directory it can move.

The traps are just as real. Git history is hard to erase. AMP's extension spec discusses redaction while warning that original content may remain in Git history, so full deletion requires separate work. If personal data, customer data, or secret-like information enters memory, Markdown's transparency quickly turns into easy replication and leakage. Visibility levels such as team, workspace, or public also cannot be treated as more than advisory if enforcement only happens through files. A real MCP server or API layer has to enforce those boundaries.

Research throws cold water on automatic consolidation

The recent arXiv paper Useful Memories Become Faulty When Continuously Updated by LLMs adds an important warning to agent memory optimism. The paper studies systems where LLMs continuously rewrite past trajectories into a textual memory bank. It reports that consolidated memory can improve utility at first, then degrade with repeated consolidation and even fall below a no-memory baseline. The authors locate the problem not in the original experiences themselves, but in the consolidation step. The same trajectories can produce different memory quality depending on the update schedule, and a control that preserves raw episodes can remain competitive.

The paper is not an evaluation of OpenAI's design or AMP's draft. Still, it maps directly onto the design questions. A memory system should be careful about throwing away raw episodes and keeping only clean lessons. Summaries reduce cost and speed up recall, but if they replace the original evidence, a bad generalization can be injected into future runs. The developer question is less "how much does the agent remember" and more "when do we allow consolidation, which raw sources stay as evidence, and how do we revert a faulty memory?"

raw episode / rollout summary

gated consolidation

memory summary / Markdown node / index

recall with provenance and stale-memory checks

The weakest link in that flow is usually the second step. If an agent automatically rewrites lessons after every interaction, the memory can appear alive while quietly accumulating drift. If it only stores raw episodes, the system is safer but context costs rise and recall gets slower. That is why the next memory race may not be about "remembering better" in the abstract. It may be about separating raw evidence from abstraction and defining when an abstraction deserves trust.

What changes for developer teams

The first practical change is that files around the repository become more important. Many coding agents already read AGENTS.md, CLAUDE.md, .cursor/rules, and workflow documents. If a memory store joins that layer, the repository becomes not only a place for code but also a home for agent operating context. Project-specific verification commands, banned deployment paths, repeated customer-environment incidents, and review preferences can become memory nodes.

The second change is that review scope expands. Teams may need to review agent memory diffs in the same spirit as code diffs. A bad procedure such as "always skip this test" can change the next run. An outdated fact such as "customer A allows this security exception" can push an agent toward a risky recommendation. Once memory changes tool behavior, the memory diff is no longer a prompt artifact. It is change management.

The third change is the return of vendor lock-in as a memory question. Models can be swapped more easily than accumulated agent experience. If memory lives inside a vendor, the agent's learned preferences and procedures may not move. AMP's "copy the directory equals migration" premise is attractive for that reason. But an attractive format is not the same as a standard. Real standardization needs implementations, a conformance suite, a security model, import and export adapters, and adoption. At the time of the Korean article, AMP's own README still marked amp-cli, amp-mcp, and amp-python as planned.

The real signal is the question, not a settled standard

It would be overstated to say AMP has become the standard for agent memory. The repository is early, the implementation is planned, and OpenAI's memory documentation is also beta. The more important signal is that multiple paths are now asking the same question: where should agent memory live, who can inspect it, how does it move, and how is it tied back to evidence?

If MCP pulled tool integration into the open, memory may be the next internal agent state to surface. The pressure will rise as long-running agents enter real work. A wrong memory lasts longer than a wrong tool call. A tool call can end as one log line, but memory changes the next run's judgment. Good agent platforms will not only brag about longer context windows. They will need to show which memories are active, which are disputed, and which summaries no longer have raw evidence behind them.

For builders, the right posture today is careful experimentation. If you use product-integrated memory such as OpenAI Agents SDK memory, define retention policies and sensitive-data boundaries first. If you evaluate a portable draft such as AMP, treat it as a draft and start with a narrow scope that does not conflict with existing AGENTS.md files or runbooks. Most importantly, do not treat memory as only a performance cache. Once an agent remembers, that memory becomes operating knowledge and an audit object.

The core of this news is not that remembering agents suddenly became smart. It is that memory is becoming a file, a Git diff, an MCP resource, and a policy problem for deletion and redaction. The next bottleneck in AI agents may not be what the model knows. It may be whether we can verify, move, and erase what an agent claims to remember.