Devlery
Blog/AI

Needle brings tool calling down to a 26M on-device model

Cactus Compute Needle is a 26M-parameter local model for tool calling, a small experiment that changes how agent latency, cost, and privacy should be designed.

Needle brings tool calling down to a 26M on-device model
AI 요약
  • What happened: Cactus Compute released Needle, a tool-calling model distilled from Gemini 3.1.
    • The package pairs a 26M-parameter model with MIT-licensed code, Hugging Face weights, and a public GitHub repository.
  • Why it matters: Tool calls can move from a large general model into a small on-device router.
  • Builder impact: Personal assistants, command palettes, and wearable agents could lower latency and cost by routing simple actions locally.
  • Watch: The official README says small models can be finicky, and Needle is not positioned as a general conversational model.

Cactus Compute's Needle is small by design. The question behind it is not small at all. In an AI agent, does every decision need to pass through a frontier-scale LLM? Or can a narrow job like "turn this user request into the right structured tool call" be split out and handled by a much smaller model?

According to the official repository, Needle is a 26M-parameter Simple Attention Network distilled from Gemini 3.1. Its target is not open-ended conversation. It is function calling: converting a natural-language request into structured tool-call JSON. Cactus Compute says Needle runs in the Cactus runtime at 6000 tokens/s prefill and 1200 tokens/s decode. The weights and data-generation path are public, and the repository uses the MIT license.

That makes this more interesting than another "small models are surprisingly good" story. Needle pushes on the division of labor inside agent systems. Many products today route intent detection, planning, tool selection, argument generation, and result interpretation through one large model. That is easy to build, but expensive. It also creates friction on phones, watches, glasses, cars, and local desktops, where network latency, battery, and privacy constraints matter.

Needle looks at the problem differently. If the user's sentence already contains the needed fields, and the tool schema is supplied as input, the model may not need broad world knowledge. It may only need to transform input into a specific JSON shape. In that framing, a 26M model can be useful as the first router in an agent pipeline.

26M
parameters
6000
prefill tok/s
200B
pretraining tokens
MIT
repository license

Needle targets calls, not answers

The README example is intentionally simple. The user asks about the weather in San Francisco. The tool list contains get_weather. The model returns a tool call:

result = generate(
    model,
    params,
    tokenizer,
    query="What's the weather in San Francisco?",
    tools='[{"name":"get_weather","parameters":{"location":"string"}}]',
    stream=False,
)

print(result)
# [{"name":"get_weather","arguments":{"location":"San Francisco"}}]

The important detail is that the output is not a natural-language answer. Needle does not describe the weather. It extracts intent and arguments in a form an API can use. General LLMs can answer the user directly and call tools along the way. Needle isolates just the tool-call conversion step.

That separation can be practical in product design. Consider a command palette inside a mobile app. A user says, "Put yesterday's receipt in the expenses folder." The app may expose search_mail, save_file, create_expense, and open_folder as tools. A large model could handle the whole interaction. But a small first-pass router could quickly narrow the tool candidates, then hand the harder judgment to a larger model or deterministic app logic.

For wearables, the distinction becomes sharper. On a watch or glasses interface, users rarely write long prompts. They say short, context-heavy commands: "seven-minute timer," "reply to this person after the meeting," or "save the product I am looking at." Sub-second response, battery, connectivity, and privacy all matter. If the device can form local tool-call candidates before going to a cloud LLM, the product architecture becomes more flexible.

This does not mean Needle replaces large models. It does not claim to interpret results, run long plans, negotiate ambiguous tasks, or perform multi-step reasoning. Its value, if it holds up in practice, comes from limiting the job precisely.

The architecture is deliberately narrow

The public README describes Needle with d=512, 8H/4KV, and BPE=8192. It uses a 12-layer encoder and an 8-layer decoder. The unusual part is no FFN. In standard Transformer blocks, the feed-forward network after attention holds a large share of the parameters and contributes heavily to internal knowledge and nonlinear transformation.

That choice drew immediate technical interest on Hacker News. One reaction was effectively that the project feels like "attention is actually all you need." That is an interesting observation, but it should not be generalized too far. Removing FFNs does not mean every LLM task can be solved by attention alone.

Needle's context is narrower. The input includes both the user request and the tool schema. The model does not need to remember the capital of France or solve a physics problem. It needs to look at the candidates in the prompt and assemble the right tool name and arguments. If the core task is input transformation rather than knowledge storage, an attention-heavy design may work with far fewer parameters.

The training numbers also show that "small" only describes the model size. The README says pretraining used 200B tokens on 16 TPU v6e chips for 27 hours. Post-training used a 2B-token single-shot function-calling dataset for 45 minutes. A 26M model can be cheap to run while still depending on a nontrivial distillation and data pipeline.

User natural-language request

Needle input with tool schema

Structured tool-call JSON

App logic, local tool, or larger model

Why small tool-calling models matter now

The first wave of AI agents focused on smarter base models. That made sense. Complex tasks require planning, code understanding, browser control, and recovery from errors. But once agents become products, another bottleneck appears. Tool calls are frequent, repetitive, and not always difficult reasoning problems.

In a calendar app, "schedule 30 minutes with Minsoo next Tuesday afternoon" becomes a handful of structured fields. In a messaging app, "share this in the team channel" splits into target, body, and action. In a developer tool, "rerun only the tests that just failed" is mostly command and option mapping. Calling a frontier model for every one of those small conversions accumulates latency and cost.

Privacy is just as important. If an on-device model can create tool-call candidates locally, some intent routing can finish before sensitive contact names, file names, calendar titles, or location details leave the device. Tool execution and external service access still need permissions and audit logs. But moving the first intent-routing step onto the device changes the security surface.

This fits a broader infrastructure pattern. Large models such as GPT, Claude, Gemini, and Grok are competing on longer context, stronger reasoning, and richer tool use. At the same time, local models, speculative decoding, multi-token prediction, KV-cache compression, and edge inference are trying to reduce cost and latency. Needle belongs to the second current, but with a tighter target: agentic tool calling rather than general text generation.

Hacker News reaction shows both promise and doubt

The Hacker News discussion helps frame the project. The Show HN post drew strong interest on May 13, 2026. Developers reacted to the 26M parameter count, Gemini distillation, no-FFN architecture, and the possibility of on-device execution.

The practical questions were just as important. A single get_weather example is easy. The real problem is what happens when an assistant has dozens of tools, user intent is ambiguous, and some tools have irreversible side effects. "Tell my boss I will be late" could mean an email, a text message, a calendar update, or a reminder, depending on app context and available tools.

One Hacker News commenter said a Hugging Face example mapped a late-to-boss request to a timer tool. Another commenter replied that the example did not include an email tool, and that Needle produced an email call when send_email was provided. That exchange captures the core lesson. Tool-calling quality is not only a model property. It depends on which tools are supplied, how clearly they are described, how candidates are narrowed, and what the system does when the request is ambiguous.

QuestionFavorable for NeedleStill hard
Tool countA few clear toolsDozens of overlapping candidates
Request shapeFields appear directly in the inputImplicit, omitted, or habit-dependent intent
Failure costReversible local actionsPayments, deletion, outbound messages, permission changes
Product roleFirst router or candidate generatorFinal judge and long-horizon planner

The builder takeaway is task separation

The first response to Needle should not be "put this in every app tomorrow." It should be a classification exercise. Which natural-language features in your product are simple routing problems, and which require real reasoning?

In developer tools, "open this file," "run tests," "show recent logs," and "create a PR" are plausible small-model routing candidates. "Analyze why this architecture is slow and fix it" is a different category. It requires reading code, forming hypotheses, running experiments, editing files, and verifying results. Sending both through the same model path makes simple commands too expensive and complex work too brittle.

The same split applies to personal assistants. "Set a timer for seven minutes" is local-router territory. "Before next week's investor meeting, read my email and Notion notes and prepare a briefing" is not. Needle-like models are more likely to reduce latency for the first class and narrow tool candidates for the second, not replace the reasoning model behind the second.

Evaluation also has to change. A single benchmark score is not enough for tool calling. The relevant questions are more operational. Did the model choose the right tool? Did it fill required arguments? Did it ask for clarification instead of executing an ambiguous command? Did it require confirmation before risky actions? Does it stay stable when the tool description changes slightly? Does performance hold as the candidate tool set grows?

Needle is useful because it makes those questions smaller and more reproducible. A 26M model can be fine-tuned, tested against a product-specific tool schema, and evaluated with a domain-specific routing suite at a lower cost than repeatedly probing a frontier API as a black box.

Small routers can change agent cost structure

Agent systems become expensive because they repeat many small decisions. A single session can include tool selection, retry decisions, state summarization, next-action selection, and result interpretation. If twenty small routing steps happen inside a task, sending all of them to a large model is wasteful when many are straightforward.

A small-router strategy aims for three effects. Clear requests are handled locally. Ambiguous requests are escalated to a larger model. When a larger model is needed, the router can pass a narrowed tool set and candidate arguments, reducing context and work. If the pattern works, the large model becomes the place for hard judgment rather than the place every minor transformation must visit.

The tradeoff is added system complexity. Two models mean two places for errors. If the small model routes incorrectly, the larger model may start from a bad premise. A practical design needs routing confidence, fallback thresholds, user confirmation, and audit logs. For tools with side effects, especially payments, deletion, outbound communication, and permission changes, a small model should not be the sole authority.

That is why Needle is best read as an architecture signal, not a finished product category. Agent products are likely to move away from one large model doing everything. They will combine small models, policy engines, search indexes, tool runtimes, verification models, and human approval paths. Needle asks whether a small open model can occupy the tool-router layer.

Tool calling may become its own layer

The exaggerated reading is that a 26M model replaces frontier LLMs. The official README does not make that claim. Cactus Compute frames Needle as an experiment focused on single-shot function calling, and larger conversational models still have broader scope and capacity.

The important part is that tool calling can be treated as a separately optimized infrastructure layer. It does not have to remain an incidental feature inside a general model response. If a small, fast, local model can handle part of that layer, AI products can gain lower latency, lower cost, and better privacy boundaries.

The practical lesson for builders is clear: do not send every natural-language request down the same model path. Split the work first. Simple routing, risky execution, long reasoning, and result verification have different requirements. Needle shows that at least one of those pieces may be made much smaller.

The next competition may not be only about model size. It may be about deciding which judgments belong to a frontier model, and which can move into smaller models and local runtimes. Agent architecture will spend more time on that boundary.

Sources