Hugging Face TITO Warns Agentic RL Teams About Token Drift

Hugging Face explains how retokenizing tool-using agent rollouts can break gradients, and proposes TITO as a safer training-loop rule.

AI 요약

What happened: Hugging Face authors published TITO, a Token-In, Token-Out rule for agentic RL, on May 29, 2026.
- The rule is direct: do not decode sampled tokens, rerender them through a chat template, and encode them again for the RL loss.
The bug: Tool-call rollouts often rebuild a message list after every step, which can put gradients on tokens the model never actually sampled.
The fix: Keep sampled token IDs as the source of truth, then append only prefix-preserving tool-response delta tokens with a zero loss mask.
- The post says 18 of 19 checked open-weight families preserve the tool-message prefix; Qwen3 needed a one-line template fix.
Watch: Context compaction, history rewriting, and sub-agent summaries change the sampled trajectory, so they need explicit loss-mask policy.

Hugging Face published Agentic RL: Token-In, Token-Out Done Right on May 29, 2026. The authors, Quentin Gallouédec and Kashif Rasul, are not announcing a new model or a benchmark win. They are pointing at a lower layer of the agent stack: when a tool-using LLM agent is trained with reinforcement learning, the loop can silently update on tokens the model did not generate.

The article is newsworthy because agentic RL is no longer a single completion followed by a reward. A rollout can include an assistant tool call, an external tool result, another assistant generation, and more tool invocations. In that loop, token boundaries, chat-template rendering, tool-result masking, and history compaction become part of the training objective. If those mechanics drift, the reward model can be correct while the policy gradient is attached to the wrong sequence.

Hugging Face frames the rule simply: RL should update only the exact tokens sampled by the model. Tool responses are not policy outputs, so they should not receive loss. In a single-turn completion, this sounds like a definition. In a multi-turn agent rollout, common implementation patterns violate it. The risky loop decodes assistant output so a tool call can be parsed, appends the tool response to a message list, rerenders the whole conversation through a chat template, and tokenizes the conversation again before computing loss.

Token round-trip drift in agentic RL

The first failure mode is tokenization itself. Decoding and encoding are not perfect inverses. Different token sequences can decode to the same visible string, and the tokenizer can then choose a different greedy segmentation when that string is encoded again. JSON spacing, argument ordering, boolean casing, and special-token rendering add more ways for the displayed text to look equivalent while the token IDs differ from the sampled IDs.

The second failure mode is the loss mask. A natural agent implementation keeps a conversation as structured messages, tokenizes the full rendered chat at the end, then tries to reconstruct which spans came from the assistant and which came from tools. Hugging Face calls this pattern MITO. The loop has to recover per-turn boundaries after losing them, and the recovery depends on role markers and chat-template details. If that reconstruction is wrong, tool-response tokens can receive loss, or assistant tokens can be skipped.

TITO avoids both problems by keeping a running token buffer. The model's sampled token IDs are appended directly to that buffer and remain the source of truth. The decoded text can still be inspected to route a tool call, but the parsed dictionary is used only for dispatch. The decoded content is not encoded again and substituted back into the trajectory.

# Conceptual example: decoded content is used for routing, not retokenized for loss.
buffer.extend(sampled_ids)
loss_mask.extend([1] * len(sampled_ids))

tool_call = parse_for_routing_only(sampled_ids)
if tool_call:
    tool_result_delta = compute_tool_response_delta(tokenizer, tool_call.result)
    buffer.extend(tool_result_delta)
    loss_mask.extend([0] * len(tool_result_delta))

The remaining question is how to append the tool response. The next model step expects role markers, tool-response wrappers, or other chat-template tokens around the tool output. TITO does not solve that by rerendering the entire conversation. Instead, it renders a dummy sequence with the tool message and another without it, then takes only the suffix delta after the shared prefix. In the Qwen2.5 example from the Hugging Face article, that delta represents a user role marker, a <tool_response> wrapper containing the result, and the next assistant prompt.

That approach depends on one property: the chat template must be prefix-preserving for tool messages. If an assistant tool call has already been rendered, appending the tool result should produce a new render whose beginning is token-for-token identical to the old render. Hugging Face says teams can test this property with dummy user, assistant-tool-call, and tool messages, then compare the rendered token prefixes.

The concrete number in the post is 18 of 19. The authors report checking open-weight model families including Qwen2.5, Qwen2.5-Coder, Qwen3 variants, DeepSeek, Llama, Gemma, gpt-oss, GLM, and MiniMax-M2.1. Eighteen preserved the tool-message prefix. The exception was Qwen3.

Qwen3 is useful because its failure is a small template detail rather than a grand theory failure. According to the post, the Qwen3 template inserts an empty <think>...</think> block before a tool call when an assistant turn is the final turn. Once a tool result is appended, that assistant turn is no longer final, so the block disappears and the prefix breaks. Hugging Face says changing the Jinja conditional to {%- if true %} restores prefix preservation without changing inference cost.

That example is the practical warning for agent teams. A model family can support tool calling, and inference can look fine, while the RL training loop is still wrong. A chat template can look like a formatting file for user experience, but in RL it decides where gradients land. Qwen3's one-line fix makes the boundary visible: template conditionals are part of the learning system.

The post also compares TITO with a renderer-based approach. Hugging Face's renderers library provides model-family-specific renderers for Qwen3, GLM, DeepSeek-V3, Kimi, gpt-oss, and others. Those renderers keep message-to-token boundaries as Python objects, expose message_indices and loss_mask, and fail loudly when an unsafe bridge would hide token provenance. If a team only has message-level access, or if it needs vendor-style API semantics, a renderer can still be the right abstraction.

TITO's claim is narrower than "throw away renderers." If an RL pipeline already owns the sampled tokens, the sampled buffer can eliminate the decode-rerender-retokenize drift directly. Instead of maintaining a bespoke renderer for every new model family, the team can run a prefix-preservation property test and compute only the tool-response delta. For teams trying to attach agentic RL infrastructure to many open-weight models, that difference is operational work, not just code style.

The edge cases are where the article becomes especially relevant to coding agents. Long-running agents often rewrite history near the context limit. Claude Code, aider, Codex-style systems, and other coding workflows can replace prior turns with summaries. In a sub-agent setup, a child agent's long trace may return to the parent as a short summary. These rewrites mean the prompt after compaction is not simply the original sampled trajectory plus more tokens.

Hugging Face's position is that the objective changes when history is rewritten. PPO or GRPO importance ratios do not have a clean meaning for a synthetic prompt made from policy outputs, tool outputs, and summaries that the policy never sampled as a single sequence. The suggested workaround is to freeze everything before the last rewrite point as prompt context and set its loss mask to zero. Only genuine sampled tokens after that point receive gradient.

That workaround has a clear cost. If rewrites happen late or frequently, most of a long trajectory may become prompt-only context, leaving only the final few hundred tokens as training signal. For coding agents, this turns context-management policy into an RL-data policy. Compaction is not just a memory optimization; it decides how much of a rollout remains trainable.

Truncation is simpler. If a rollout hits max_seq_len in the middle of a turn, a renderer may be unable to prove a safe extension because closing tokens are missing. In TITO, the buffer ends where generation ended and the loss mask remains aligned. If a reasoning block is unclosed or a tool call is incomplete, the parser does not dispatch a tool. With no remaining budget, there is no need to append a tool result.

Community reaction was still small when the Korean source was written. The Hugging Face page showed two upvotes and a single comment saying it was being saved for later. That muted response does not make the issue minor. Few teams implement agentic RL directly, and this class of bug often appears only after training curves become noisy, shape mismatches appear, or policy updates fail to reproduce expected behavior. The Hugging Face post narrows that failure down to token accounting.

For AI engineering teams, the checklist is concrete. First, inspect whether tool-using rollouts preserve sampled IDs in a buffer. Second, make sure assistant-token and tool-token loss masks are created at append time instead of reconstructed from a later full render. Third, add a property test that chat templates remain prefix-preserving when tool messages are appended. Fourth, treat compaction, history rewriting, and sub-agent summaries as separate RL-objective events rather than ordinary message formatting.

TITO is not a model-performance announcement. It does not claim a larger parameter count or a new benchmark lead. Its point is that agent training infrastructure now needs token provenance as carefully as it needs rewards. As tool-using agents become longer-running systems, the difference between "what the model sampled" and "what the trainer later saw as tokens" grows. Hugging Face's TITO rule is a small implementation constraint with a large consequence: agentic RL teams need to know exactly which tokens earned the gradient.