LFM2.5 8B Brings Tool Calling to Local Agents
Liquid AI released LFM2.5-8B-A1B, a 1.5B active MoE model with 128K context for local tool calling, structured outputs, and agent workflows.
- What happened: Liquid AI released
LFM2.5-8B-A1Bon May 28, 2026.- It is an 8.3B total, 1.5B active MoE model with 128K context and first-day support for local inference runtimes.
- Builder impact: The release targets tool calling, structured output, and low-latency local agent loops rather than broad frontier-model replacement.
- Watch: The model card says it is not optimized for heavy programming or knowledge-heavy QA without retrieval.
- The
LFM Open License v1.0also keeps commercial use by entities with more than $10 million in annual revenue outside the default grant.
- The
Liquid AI released LFM2.5-8B-A1B on May 28, 2026. The company's own framing is narrow and useful: this is an edge model built for fast, reliable tool calling on consumer hardware. It has 8.3B total parameters, activates 1.5B parameters per token through a mixture-of-experts design, and ships on Hugging Face as base, reasoning-tuned, GGUF, ONNX, and MLX variants.
The wrong reading is that a small model now replaces larger models everywhere. Liquid AI's model card lists agentic workflows, tool use, structured outputs, multilingual assistants, and on-device personal-assistant applications as recommended uses. The same card says the model is not optimized for heavy programming or knowledge-intensive question answering without retrieval. The useful category is not a general chatbot. It is a local execution layer that can select tools, fill schemas, and keep short-latency loops moving.
The target surface is local tool calling
The model's headline numbers are aimed at agent builders. According to the Hugging Face card, LFM2.5-8B-A1B has 24 layers, 18 double-gated LIV convolution blocks, 6 GQA layers, 38T training tokens, 131,072 context length, and a 128,000-token vocabulary. Compared with the earlier LFM2-8B-A1B, Liquid AI expanded the context window from 32,768 tokens to 128K and doubled the vocabulary from 65,536 to 128,000.
The runtime story is just as important as the parameter count. Liquid AI lists first-day support for llama.cpp, MLX, vLLM, and SGLang, while the Hugging Face card points developers to Transformers, vLLM, llama.cpp, MLX, and LM Studio documentation. This is not a cloud-API-only announcement. It gives teams deployment paths across Macs, local GPUs, CPU offload, and server-side batch inference.
That distinction matters for agent products. Tool calling is rarely one model call. A single session may open files, search, run OCR, read a calendar, query a local database, scan code, or generate a diff. If every routing decision goes to a frontier model over the network, round-trip latency and API cost accumulate quickly. LFM2.5-8B-A1B is positioned as the smaller local layer that chooses tools and emits structured outputs, not as the system that should make every hard judgment alone.
128K context and a larger vocabulary change the cost curve
Liquid AI says it increased pretraining from 12T to 38T tokens and expanded the tokenizer vocabulary from 65,536 to 128,000. The tokenizer table in the official blog says Korean chars per token improved from 1.652 to 1.943, a 17.6% gain. Hindi moved from 0.961 to 2.118, Thai from 0.671 to 2.269, and Vietnamese from 1.519 to 3.311.
For global builders, those numbers are not just multilingual branding. Local agents often place filenames, calendar titles, chat messages, internal terms, and retrieved snippets directly into prompts. Tokenization efficiency turns into context budget and latency. If the same non-English text fits into fewer tokens, a 128K context window can carry more tool descriptions, state, audit history, and search results before compression or retrieval becomes necessary.
The context expansion path is also specific. Liquid AI says it first used 2T tokens of midtraining focused on reasoning, math, tool use, and longer documents to reach 32K context. It then increased the RoPE base theta and used another 400B tokens of midtraining focused on long documents and long-trajectory data to reach 128K. That reads less like a model optimized only for reading long files and more like one designed to hold longer tool trajectories and multi-step state.
The benchmarks should be read through tool use
Liquid AI's official benchmarks show gains over LFM2-8B-A1B. IFEval rises from 79.44 to 91.84, IFBench from 26.00 to 56.47, Multi-IF from 58.54 to 79.93, and BFCLv4 from 25.52 to 48.50. The agentic workflow benchmark Tau2 Telecom moves from 13.60 to 88.07.

That chart is a starting point, not a product decision. In the r/LocalLLaMA discussion, some users questioned the comparison set and noted that older models appeared in the benchmarks. Other users saw a fit for narrower tasks such as title generation, summarization, tagging, and categorization. The practical move is to carry the model into each team's own tool schemas and failure modes.
| Metric | LFM2-8B-A1B | LFM2.5-8B-A1B | Delta |
|---|---|---|---|
| IFEval | 79.44 | 91.84 | +12.40 |
| IFBench | 26.00 | 56.47 | +30.47 |
| BFCLv4 | 25.52 | 48.50 | +22.98 |
| Tau2 Telecom | 13.60 | 88.07 | +74.47 |
The right evaluation questions are more concrete than "is the answer smart?" Did the model select the right tool? Did it omit required arguments? Did it ask a clarifying question when the request was ambiguous? Did it avoid dangerous write operations without approval? Does it stay stable when the tool description changes slightly? Does accuracy survive when the tool list grows from 10 tools to 70? Local agent quality depends on those questions more than on a single benchmark row.
LocalCowork shows the product shape around the model
Liquid AI's cookbook includes a LocalCowork demo. The README describes a local desktop agent built with Tauri 2.0, a Rust agent core, a React/TypeScript UI, MCP servers, and an OpenAI-compatible localhost inference API. It says the broader tool set spans 14 MCP servers and 75 tools, while the demo uses 6 servers and 20 curated tools.
In that structure, the model is not the whole product. A local agent needs MCP server discovery, a tool registry, audit logs, permission storage, and confirmation dialogs. The LocalCowork README says the confirmation system exists, but the agent loop does not yet use it; currently, tools chosen by the model execute immediately. The README then notes that write actions need preview and confirmation.
That is the more practical reading of the release. A faster local model does not remove the need for a policy engine or a visible approval boundary. File deletion, email sending, payments, permission changes, and external API calls should not run only because a model's confidence is high. A 1.5B active-parameter router can reduce latency, but it cannot replace identity, authorization, and product-level safety controls.
The license separates open weights from open source
Liquid AI publishes the weights and provides several runtime formats. The license, however, is not Apache 2.0 or MIT. The Hugging Face metadata lists license: other and license_name: lfm1.0. The LFM Open License v1.0 includes a commercial-use limitation: legal entities with more than $10 million in annual revenue are not granted commercial use under the agreement.
That may be fine for researchers, individual developers, and small startups. Product teams still need legal review before bundling the model into a local agent that touches customer data. On-device models create an easy mental shortcut: download the weights, ship them with the app, and run locally. App distribution, fine-tuned derivatives, enterprise customer bundles, and managed services may all be read as commercial use.
The Reddit thread raised the same distinction quickly. Some users accepted "open weight" as the better description and pushed back against any language that implied unrestricted open source use. In local AI, availability of weights and availability of commercial rights are separate facts. LFM2.5-8B-A1B opens the runtime path, but larger-company commercial use remains a licensing conversation.
A reasoning-only model creates integration questions
Liquid AI's blog describes LFM2.5-8B-A1B as a reasoning-only model that produces explicit chain of thought before the final answer. The Hugging Face card also says assistant turns include explicit chain of thought. That design may help quality, but it creates separate product-integration work.
If a local agent shows internal reasoning directly in the UI, users may see long intermediate thinking that is not meant to be part of the final answer. If the app logs that reasoning, sensitive filenames, user requests, tool arguments, and incorrect intermediate claims can end up in the audit trail. If the app drops it completely, debugging and evaluation become harder. A serious integration should separate final answers, tool calls, internal reasoning, and audit events into different data classes with different retention policies.
Early compatibility reports also deserve attention. Comments in r/LocalLLaMA mentioned <think> tags appearing in llama.cpp, tool calling not working as expected, tokenizer support pull requests, and GGUF metadata issues. First-day support is not the same as production stability. Local inference depends on the model card, tokenizer, chat template, quantization, and tool-call format all lining up.
What builders should test first
Teams evaluating LFM2.5-8B-A1B should split the workload before choosing a deployment path. Simple tool routing, long reasoning, retrieval-grounded answers, code editing, and risky write actions need different model paths and approval policies. The model is a plausible candidate for structured output and local tool selection, but the model card itself argues against using it as the default model for heavy programming or retrieval-free knowledge QA.
The next step is an internal eval. Official benchmark charts matter less than accuracy against 20, 50, or 100 of a team's own tools. Test whether the model asks follow-up questions on ambiguous requests, refuses to invent tools that do not exist, sets a confirmation flag before dangerous calls, and handles non-English filenames or internal vocabulary reliably. The 17.6% Korean tokenization improvement is useful, but product prompts and logs need direct measurement.
Fallback design is the third requirement. A small active-parameter model can be the fast first router, but failed or high-risk cases need an escalation path to a larger model or to a human. Confidence thresholds, schema validation, dry-run mode, human approval, and audit logs let a team keep the latency advantage while reducing execution risk.
The conclusion: local speed still needs permission design
LFM2.5-8B-A1B is a useful test case for on-device agents. The combination of 8.3B total parameters, 1.5B active parameters, 128K context, 128K vocabulary, GGUF, and MLX gives local developers something concrete to evaluate on laptops and workstations. The release is strongest when read as a tool-calling and structured-output model, not as a universal replacement for larger systems.
The boundaries are equally concrete. The license does not automatically authorize commercial use by entities above the $10 million annual-revenue threshold. The model card does not position the model for heavy programming or retrieval-free knowledge QA. The reasoning-only output requires UI and log handling. The early community reports show that llama.cpp and tool-calling paths need compatibility testing before production use.
The next local-agent competition will not be decided by model size alone. The product decision is which judgments stay on a local MoE model, which judgments escalate to a frontier model, and which actions require human approval. LFM2.5-8B-A1B makes that question more concrete by putting a model, runtime formats, numbers, and license terms on the table.