Sierra localized Agent Studio in four months with AI coding agents
Sierra published a concrete Agent Studio localization case study covering 900+ frontend files, batch scripts, lint loops, and context-window failures.
- What happened: Sierra engineer Stephen Burgess described how
Agent Studiowas prepared for localization with AI coding agents.- A similar Slack effort had taken a 10-person team 9-12 months; Sierra says one engineer orchestrated this project in under four months.
- The scale: User-facing strings were spread across 900+ frontend files, with review batches of roughly 30 files.
- The lesson: The bottlenecks moved from model output to PR queues, human review, oversized skills docs, and missed context-window instructions.
- Cursor and cloud agents gave way to a Claude API batch script plus a localization linter feedback loop.
Sierra engineer Stephen Burgess published "AI-native product localization" on May 28, 2026, describing how Sierra prepared Agent Studio for multiple languages. The comparison that opens the post is intentionally sharp: roughly a decade earlier, a similar localization project at Slack took a 10-person team 9-12 months. At Sierra, Burgess says one engineer orchestrated AI coding agents and finished the initial work in under four months. The useful part for engineering teams is not the headline productivity ratio. Sierra also documented where the agents failed, where review became the bottleneck, and why a repeatable batch pipeline beat agent chat for a large code migration.
This was not a story about machine translation alone. Localization readiness starts earlier than translated strings. Sierra had English text embedded in React components, API strings, pluralization helpers, and concatenated strings. Those strings had to move into ICU MessageFormat, translation files had to be produced, and longer translated text had to be tested against the product UI. Sierra also needed linting and CI so new English-only strings would not keep entering the codebase. Burgess wrote that user-facing strings were scattered across more than 900 frontend files.
The Slack comparison needs that context. Sierra's author had prior localization experience from Slack, and he explicitly notes that the two products had different surface areas and maturity levels. Sierra initially supported Spanish and Japanese, with four more locales planned. Native speakers, internal dogfooding, and human review remained part of the process. A more accurate reading is not "one person replaced a 10-person localization team." It is that AI agents and batch tooling reduced coordination overhead while architecture decisions, quality review, and language judgment stayed human.
Sierra's first approach was the familiar one: ask an IDE agent to convert individual files. Cursor produced good file-level results, but the workflow stayed blocking and sequential. A person still had to sit next to the agent, feed it one file at a time, inspect the result, and move to the next file. At 900+ frontend files, the question stopped being prompt quality and became queue design. How many files can be processed before review loses track of failure patterns? How does the engineer keep a migration moving without turning into a human task scheduler?
The second approach was cloud agents. Running several agents against several files immediately improved throughput, but Sierra found a different bottleneck: tracking mistakes at scale. The agents usually wrapped strings correctly, yet small errors remained. Concatenated strings could be mishandled, ICU syntax could be subtly wrong, and possessive phrasing around variables could become awkward. Each agent also opened its own pull request, so parallel execution created a parallel review queue. For one engineer, concurrency came back as PR triage.
The third approach became the operating unit: a custom batch script that called the Claude API directly. The script accepted a file list or glob pattern, sent each file with Sierra's localization skills documentation, used configurable concurrency, and wrote the transformed output back to disk. That removed the agent UI from the critical path. Instead of manually opening chats and selecting files, Sierra treated localization as a repeatable transformation pipeline. Burgess says he processed roughly 30 files per batch and manually reviewed every changed file in each batch.
The batch size matters more than it first appears. Sierra reviewed every changed file after each batch because agent errors are rarely isolated when the prompt or instruction file has a systematic flaw. If an agent mishandles a pluralization pattern once, it may repeat that mistake across hundreds of files before anyone notices. In localization, the mechanical act of wrapping a string in i18n.t() is not the hard part. Sentence order, variable interpolation, plural categories, translator context, and UI layout all determine whether a migrated string is usable. A 30-file batch is a practical compromise between throughput and failure containment.
Sierra also treated review findings as inputs for the next run. After each batch, the engineer collected mistakes, asked the agent to explain why they happened, and added explicit guidance to the skills documentation. That turns the coding agent from a one-shot executor into one component of a feedback loop. The human classifies failures, the agent helps identify causes, the instructions change, and the next batch reads the narrower playbook. The model weights do not change, but the project-specific operating procedure becomes more precise.
The linter is the second half of that loop. Sierra built a rule that flags user-facing strings that are not wrapped in the translation helper. AI generated and refined much of that rule. Burgess describes an iteration cycle where the engineer explains the desired behavior, the agent edits the rule, the linter runs again, and the false positives or misses become the next prompt. The steady-state workflow became: run the linter, batch-process flagged files, run the linter again, and investigate remaining warnings.
This is where static analysis and agent output become mutually useful. The linter exposed migration misses, while strange warnings from migrated files exposed false positives in the linter. Neither system was perfect on its own. Together, they gave the engineer a stronger signal about whether a file was localization-ready. The transferable lesson is not simply "let AI write lint rules." It is that codemods, static analysis, CI, and agent transformations work best when they are connected into the same reviewable loop.
The most operationally important failure came from the context window. Sierra returned to Cursor to refine linter edge cases and localization skills docs. Error rates rose again, and the agent started repeating patterns that were already prohibited in the skills file. The cause was not that the instruction was missing. The documentation had grown too large. Each time the team found a failure, verbose explanations and large examples were added. Eventually, the interactive Cursor session did not reliably use the full file: it consumed the beginning and silently missed later instructions.
That detail lands directly on teams maintaining AGENTS.md, Cursor rules, Claude Code skills, or Copilot instructions. Longer instruction files do not monotonically improve agent behavior. They may be easier for humans to audit, but they also increase retrieval and attention load for the model. Sierra's fix was to compress the docs, remove bulky examples, increase signal per line, and split one broad document into focused files such as panels-and-typing and what-not-to-translate. The agent did not need every rule at all times; it needed the right small rule set for the current transformation.
The contrast between the batch script and an interactive IDE session is also useful. A batch script can make mostly stateless API calls: compact instructions plus one current file in, transformed file out. An IDE session accumulates conversation history, tool results, file reads, and agent state across turns. Adding a large skills file to that context can leave important rules technically present but unreliable in practice. Sierra's report is a concrete warning against treating maximum context as maximum performance.
String descriptions created a separate source-code design problem. Translators need to know whether "close" means closing a dialog, ending a session, or describing distance. At Slack, Burgess had used @i18n comments above source strings, which extraction tooling then carried into translation files. Sierra initially followed that path, and AI generated the comments quickly. The comments became too verbose, though, and started making UI component files harder for humans to read. Metadata that few engineers needed daily was crowding the application code.
Sierra moved that metadata out of the source code. During extraction, the system records each string's file location and source position. A later enrichment step sends a surrounding code window to Claude and asks for contextual descriptions. Those descriptions are stored in translation files, not beside the React component code. If an AI system generates the description and translation tools consume it, the source file no longer has to carry that explanation permanently. It is a small example of AI-first tooling changing code layout, not just code generation speed.
The broader AI coding-agent market keeps emphasizing autonomous PRs, cloud execution, sandboxed workers, and review-comment repair. Sierra's case study asks the next operational questions. When agents create many PRs, who reduces the review queue? Where are failure patterns recorded? Who compresses instruction files after they become too verbose? Which linter or CI signal tells the engineer that a batch is safe enough to merge? Which language and UX issues remain outside the agent's authority?
The economics are also wider than token price. IDE agents cost attention because the engineer remains in the loop for each file. Cloud agents increase parallel execution but can multiply review overhead. API batch scripts reduce UI friction but require concurrency, retries, and failure containment. A flawed instruction can spread the same error faster than a human could. Sierra's decision to review every file in every batch is the cost that makes the productivity claim credible. Hide that human review cost and the case study becomes misleading; design the review unit well and one engineer can cover a much larger migration surface.
For teams planning similar migrations, the practical checklist is narrow. Define the migration as a batchable transformation, not as an open-ended chat. Pick a batch size and stop before pattern failures can spread. Manage instructions as small, selectable playbooks instead of a single file that grows after every incident. Pair agent output with linting, CI, and human review. Sierra's result came from those pieces working together, not from one model call being unusually good.
The limits remain clear in Sierra's own account. Native speakers and human reviewers were still needed. Less-visited product screens can still hide untranslated strings. Language nuance, overall UX quality, and locale-specific expectations cannot be closed by a code agent alone. Localization sits between code migration and product quality, so wrapping strings faster does not guarantee a better language experience for users.
Sierra's case study is not evidence that AI agents remove senior engineers from large migrations. It makes the senior engineering role more visible. Someone still chooses the batch boundary, identifies a recurring failure, decides which instruction to delete, moves metadata out of source code, and judges whether a linter warning is real. AI lowered the cost of repetitive edits. The main job shifted toward designing the migration system, verifying it, and extracting durable rules from its failures.
The strongest conclusion is therefore more specific than "AI makes localization faster." Large migrations with coding agents are won or lost in the batch loop: review gates, compact instructions, static analysis, and failure containment. Sierra published that pattern through a real Agent Studio localization project, and the same pressure will show up in many agent-assisted refactors. The hardest moment is not when the model edits a file. It is when an engineer has 30 edited files and needs evidence that the next batch should proceed.