feat(tools): progressive tool disclosure for MCP and plugin tools

Adds Tool Search, a structured-tools progressive-disclosure layer that replaces MCP and non-core plugin tools in the model-visible tools array with three bridge tools (tool_search / tool_describe / tool_call) when the deferrable surface would consume more than a configurable percentage of the active model's context window. Core Hermes tools are never deferred. Default mode is 'auto' with a 10% context threshold, so small toolsets pay no overhead. Set tools.tool_search.enabled to 'on' to force or 'off' to disable. Design carefully reflects the OpenClaw production failure modes documented in the openclaw-tool-search-report: - Core tools never defer (toolsets._HERMES_CORE_TOOLS). Addresses the 'tools silently missing from isolated cron turns' regression class (openclaw#84141) by construction: there is no code path that can drop a core tool. - Catalog is stateless across turns — rebuilt from the live tool-defs list on every assembly. No session-keyed Map that can drift out of sync with the registry. - tool_call unwraps the bridge call before any hook fires, so plugin pre/post hooks, guardrails, approval flows, and the activity feed all see the underlying tool name, not the bridge (addresses openclaw#85588 and the verbose-mode complaint on openclaw#79823). - The unwrap happens in both the parallel and sequential paths of agent/tool_executor.py and also in handle_function_call, so direct callers (sandboxed code, eval harnesses) are covered too. - Bridge tools cannot invoke each other (recursion guard) and cannot invoke core tools (those must be called directly). - Tools mode only — no JS-sandbox code-mode. Keeps the surface small. - Token estimation via cheap char/4 heuristic; precision isn't needed for the threshold decision. Files: - tools/tool_search.py — new module (BM25 retrieval, classification, threshold gate, bridge dispatch, unwrap helper). - tests/tools/test_tool_search.py — 35 tests including the OpenClaw #84141 regression guard. - model_tools.py — wires assembly into _compute_tool_definitions as the final step, adds skip_tool_search_assembly kwarg so the bridge can see the real catalog, dispatches the three bridge tools. - agent/tool_executor.py — unwraps tool_call in both parallel and sequential parsing loops so checkpointing, guardrails, plugin hooks, and tool-progress callbacks all observe the underlying tool name. - hermes_cli/config.py — DEFAULT_CONFIG['tools']['tool_search'] block. - website/docs/user-guide/features/tool-search.md — user docs. Validation: - 35/35 new tests pass. - Existing tool/registry/model_tools/config/coercion/executor tests (82 + 74 + small adjacents) green. - Live E2E: 20 fake MCP tools registered, get_tool_definitions returns 3 bridges, tool_search returns top 3 hits, tool_describe returns full schema, tool_call dispatches to the real underlying handler and the underlying result is what the model sees. - Reserved-name recursion guard verified live. - Core-tool refusal via tool_call verified live.
2026-05-23 15:22:01 -07:00 · 2026-05-23 15:22:01 -07:00 · 369075dc95
commit 369075dc95
parent 73d73f1f0d
6 changed files with 1453 additions and 1 deletions
--- a/website/docs/user-guide/features/tool-search.md
+++ b/website/docs/user-guide/features/tool-search.md
@ -0,0 +1,152 @@
+---
+title: Tool Search
+sidebar_position: 95
+---
+
+# Tool Search
+
+When you have many MCP servers or non-core plugin tools attached to a
+session, their JSON schemas can consume a substantial fraction of the
+context window on every turn — even when only a few of them are relevant
+to what the user actually asked for.
+
+**Tool Search** is Hermes' opt-in progressive-disclosure layer for that
+problem. When activated, MCP and plugin tools are replaced in the
+model-visible tools array by three bridge tools, and the model loads each
+specific tool's schema on demand.
+
+:::info Built-in Hermes tools never defer
+The tools that make up Hermes' core capability set (`terminal`,
+`read_file`, `write_file`, `patch`, `search_files`, `todo`, `memory`,
+`browser_*`, `web_search`, `web_extract`, `clarify`, `execute_code`,
+`delegate_task`, `session_search`, `send_message`, and the rest of
+`_HERMES_CORE_TOOLS`) are *always* loaded directly. Only MCP tools and
+non-core plugin tools are eligible for deferral.
+:::
+
+## How it works
+
+When Tool Search activates for a turn, the model sees three new tools in
+place of the deferred ones:
+
+```
+tool_search(query, limit?)     — search the deferred-tool catalog
+tool_describe(name)            — load the full schema for one tool
+tool_call(name, arguments)     — invoke a deferred tool
+```
+
+A typical interaction looks like:
+
+```
+Model: tool_search("create a github issue")
+  → { matches: [{ name: "mcp_github_create_issue", ... }, ...] }
+Model: tool_describe("mcp_github_create_issue")
+  → { parameters: { type: "object", properties: { ... } } }
+Model: tool_call("mcp_github_create_issue", { title: "...", body: "..." })
+  → { ok: true, issue_number: 42 }
+```
+
+When the model invokes `tool_call`, Hermes **unwraps the bridge** and
+dispatches the underlying tool exactly as if the model had called it
+directly. Pre-tool-call hooks, guardrails, approval prompts, and
+post-tool-call hooks all run against the real tool name — not against
+`tool_call`. The activity feed in the CLI and gateway also unwraps so you
+see the underlying tool, not the bridge.
+
+## When does it activate?
+
+By default Tool Search runs in `auto` mode: it activates only when the
+deferrable tool schemas would consume at least 10% of the active model's
+context window. Below that, the tools-array assembly is a pure
+pass-through and you pay no overhead.
+
+This decision is re-evaluated every time the tools array is built, so:
+
+- A session with just a few MCP tools and a long context model never
+  activates Tool Search.
+- A session with many MCP servers attached (15+ tools typically) starts
+  activating it.
+- Removing MCP servers mid-session correctly returns to direct exposure
+  on the next assembly.
+
+## Configuration
+
+```yaml
+tools:
+  tool_search:
+    enabled: auto       # auto (default), on, or off
+    threshold_pct: 10   # percentage of context — only used in auto mode
+    search_default_limit: 5
+    max_search_limit: 20
+```
+
+| Key | Default | Meaning |
+| --- | --- | --- |
+| `enabled` | `auto` | `auto` activates above threshold; `on` always activates if there's at least one deferrable tool; `off` disables entirely. |
+| `threshold_pct` | `10` | Percentage of context length at which `auto` mode kicks in. Range 0–100. |
+| `search_default_limit` | `5` | Hits returned when the model calls `tool_search` without a `limit`. |
+| `max_search_limit` | `20` | Hard upper bound the model can request via `limit`. Range 1–50. |
+
+You can also flip the legacy boolean shape:
+
+```yaml
+tools:
+  tool_search: true   # equivalent to {enabled: auto}
+```
+
+## When NOT to use it
+
+Tool Search trades a fixed per-turn token cost (the three bridge tool
+schemas, ~300 tokens) and at least one extra round trip (search →
+describe → call) for the savings on the deferred schemas. It's a clear
+win when you have many tools and use few per turn; it's overhead when
+you have few tools total.
+
+The `auto` default handles this for you. If you set `enabled: on`
+unconditionally, expect a slight per-turn cost on small toolsets.
+
+## Trade-offs that don't go away
+
+These come from the prompt-cache integrity invariant — they are inherent
+to any progressive-disclosure design, not specific to this implementation:
+
+- **One extra round trip on cold tools.** The first time the model needs
+  a deferred tool, it spends one or two extra model calls to find and
+  load the schema. The token savings on the static side are real, but a
+  portion is paid back at runtime.
+- **No cache benefit on deferred schemas.** A loaded `tool_describe`
+  result enters the conversation history (so it does get cached on
+  subsequent turns) but it never benefits from the system-prompt cache
+  prefix.
+- **Model-quality dependence.** Tool Search assumes the model can write a
+  reasonable search query for the tool it wants. Smaller models do this
+  less well; the published Anthropic numbers (49% → 74% on Opus 4 with
+  vs. without tool search) show the upside but also that ~26 points of
+  accuracy is still retrieval failure.
+- **Toolset edits invalidate cache.** Adding or removing a tool mid-
+  session changes the bridge tools' descriptions (which include the
+  count of deferred tools) and the catalog, so the prompt cache is
+  invalidated. This is the same trade-off as any toolset edit.
+
+## Implementation details
+
+- **Retrieval:** BM25 over tokenized tool name + description + parameter
+  names. Falls back to a literal substring match on the tool name when
+  BM25 returns no positive-score hits, which protects against
+  zero-IDF degenerate cases (e.g. searching `"github"` against a
+  catalog where every tool name contains "github").
+- **Catalog is stateless across turns.** It rebuilds from the current
+  tool-defs list every assembly — no session-keyed `Map`. This avoids
+  the class of bug where a stored catalog drifts out of sync with the
+  live tool registry.
+- **No JS sandbox.** Hermes uses the simpler "structured tools" mode
+  (search / describe / call as plain functions). The JS-sandbox "code
+  mode" some other implementations offer is a large surface area; we
+  skip it.
+
+## See also
+
+- `tools/tool_search.py` — the implementation
+- `tests/tools/test_tool_search.py` — the regression suite
+- The `openclaw-tool-search-report` PDF in the original implementation
+  PR for the research that shaped the design