_maybe_wrap_untrusted() only wrapped str-typed tool outputs. When a
high-risk tool (web_extract, browser_*) returns a multimodal content
list ([{type:text},{type:image_url}]) — which _tool_result_content_for
_active_model() produces by unwrapping the _multimodal envelope for
vision-capable providers — the text part reached the model completely
unguarded. An attacker page that ships one image bypassed the entire
untrusted-data wrapper.
Extend the wrapper to handle list content: each {type:text} part is run
through the same string-wrapping path (min-char threshold, delimiter
neutralization, one well-formed block), image/video parts pass through
untouched so the list stays valid for vision adapters. Recursing into
the existing string branch means the list path inherits the delimiter
defang and the no-forgeable-fast-path hardening from #56172 for free.
The outer list is rebuilt (not returned by identity), so callers compare
by value.
`_maybe_wrap_untrusted` is the architectural defense against indirect
prompt injection. It wraps attacker-controllable tool output
(web_extract, web_search, browser_*, mcp_*) in
`<untrusted_tool_result>...</untrusted_tool_result>` so the model treats
it as data. The content was interpolated verbatim, so the boundary was
forgeable.
Two holes. A poisoned page that embeds `</untrusted_tool_result>` closes
the block early — everything after it reads as trusted instructions. And
the `startswith("<untrusted_tool_result")` re-entrancy guard returned
content that merely started with the opening tag completely unwrapped, so
an attacker just prefixed the tag to drop all data framing.
Fix neutralizes any embedded delimiter token (case-insensitive) before
interpolation and drops the forgeable fast-path, so content is always
sealed in exactly one well-formed block. Re-wrapping an already-wrapped
forward is harmless — it stays framed as data.
## What does this PR do?
Closes an indirect prompt-injection bypass in the untrusted-tool-result
wrapper. Attacker content can no longer break out of, or forge, the
trust boundary.
## Related Issue
N/A
## Type of Change
- [x] 🔒 Security fix
## Changes Made
- `agent/tool_dispatch_helpers.py`: add `_neutralize_delimiters` (case-insensitive defang of the `untrusted_tool_result` token); `_maybe_wrap_untrusted` now always neutralizes then wraps, and the forgeable `startswith` re-entrancy guard is removed.
- `tests/agent/test_tool_dispatch_helpers.py`: replace the double-wrap test (it encoded the bypass) with regression tests for embedded closing tag, leading opening tag, and a cased closing tag.
## How to Test
1. `scripts/run_tests.sh tests/agent/test_tool_dispatch_helpers.py` — 29 pass.
2. Embedded `</untrusted_tool_result>` mid-content: real closing delimiter appears once, at the end; payload trapped inside.
3. Content starting with the opening tag: data framing is applied, not skipped.
## Checklist
### Code
- [x] I've read the Contributing Guide
- [x] My commit messages follow Conventional Commits
- [x] I searched for existing PRs to make sure this isn't a duplicate
- [x] My PR contains only changes related to this fix
- [x] I've run the affected tests and they pass
- [x] I've added tests for my changes
- [x] I've tested on my platform: macOS 15 (Darwin 25.5)
### Documentation & Housekeeping
- [x] I've updated relevant documentation (docstrings) — or N/A
- [x] cli-config.yaml.example — N/A
- [x] CONTRIBUTING.md / AGENTS.md — N/A
- [x] Cross-platform impact — N/A (pure-Python, stdlib `re`)
- [x] Tool descriptions/schemas — N/A
Salvaged from PR #35130 (the safe subset of jnibarger01's security pass):
- threat_patterns.py: replace unbounded (?:\w+\s+)* filler with bounded
{0,8} + cap scan input at MAX_SCAN_CHARS (64KiB), and bound the .*
runs in the exfil/config-mod patterns. Kills catastrophic backtracking
on adversarial near-misses.
- hermes_state.py: cap FTS5 query length (MAX_FTS5_QUERY_CHARS) and
extract quoted phrases with a linear scan instead of a regex so
pathological quote runs can't induce backtracking.
- acp_adapter/edit_approval.py + agent/tool_dispatch_helpers.py: recognize
'*** Move File: src -> dst' V4A headers so patch-mode edits are
permissioned/traversal-checked (previously only Update/Add/Delete), and
surface a proposal for mode=patch V4A calls (previously replace-only).
Tests: +ReDoS-bound + FTS5-cap + Move-File-target + V4A-approval cases.
Make verification closure the default coding behavior after landed file edits while keeping bounded retries and config/env switches for users who need to disable it.
Hardens the context window against Brainworm-class promptware attacks
(see #496). Three changes:
1. tools/threat_patterns.py — single source of truth for injection/promptware
patterns. Replaces the duplicated pattern lists in prompt_builder.py and
memory_tool.py. Adds ~15 new Brainworm/C2 patterns (node registration,
heartbeat/beacon, pull tasking, anti-forensic disk avoidance, identity
override, known framework names). Three scopes — 'all' (narrow, classic
injection), 'context' (adds promptware/role-play, broader detection),
'strict' (adds persistence/SSH-backdoor patterns for user-mediated writes).
2. MemoryStore.load_from_disk() now scans entries at snapshot-build time.
Poisoned entries are replaced with [BLOCKED: ...] placeholders in the
frozen system-prompt snapshot. Live state keeps the original so the
user can still inspect + remove via memory(action=read/remove). Scan is
deterministic from disk bytes — prefix-cache invariant holds.
3. make_tool_result_message() wraps results from high-risk tools
(web_extract, web_search, browser_*, mcp_*) in
<untrusted_tool_result source="...">...</untrusted_tool_result>
delimiters with framing prose telling the model the content is data,
not instructions. Architectural defense against indirect injection
from poisoned web pages, GitHub issues, MCP responses — does NOT
regex-scan tool results (pattern arms race + per-iteration latency).
Multimodal content lists pass through unwrapped to preserve adapter
compatibility.
Pattern philosophy: anchor on C2-specific vocabulary or unambiguous attack
behavior, NOT on bossy English. Dropped patterns suggested in #496 that
would have tripped legitimate content: standalone 'you are obligated to',
'do not respond immediately', 'you must X' without a C2-verb anchor.
Validation:
- 257/257 targeted tests pass (test_threat_patterns + test_memory_tool +
test_tool_dispatch_helpers + test_prompt_builder)
- E2E run with real Brainworm payload: blocked from AGENTS.md context-file
path, blocked from MEMORY.md snapshot, wrapped in delimiters when
arriving via web_extract. Legitimate 'you must follow conventions'
phrasing not flagged.
Explicitly NOT in this PR (per #496 discussion):
- Per-tool-result regex scanning (pattern arms race)
- SessionBehaviorMonitor / polling-loop detection (wrong layer)
- Outbound network gating (Docker backend already covers this)
- security.context_scanning warn|block knob (current behavior is always
block-with-placeholder — there's no warn mode that makes sense)
Closes#496 for Phase 1 + the architectural delimiter piece of Phase 2.
Phase 3 stays in tracking issue territory.
Introduces make_tool_result_message() in tool_dispatch_helpers.py as the
single place where tool-result message dicts are built. All six construction
sites in tool_executor.py, agent_runtime_helpers.py, and mini_swe_runner.py
now use it, so tool_name is set in memory from the moment a message is
created rather than relying on fallback logic in the flush paths.
Fixes blank tool_name in both state.db and JSON session logs.
Adds tests.
Original commit 395e9dd9e by Teknium targeted module-level _is_mcp_tool_parallel_safe
and _should_parallelize_tool_batch helpers in pre-refactor run_agent.py. Both
helpers now live in agent/tool_dispatch_helpers.py — re-applied to that
module.
The tools/mcp_tool.py portion (the public is_mcp_tool_parallel_safe API
+ _parallel_safe_servers tracking) merged cleanly from main via the prior
merge commit.
Co-authored-by: Teknium <127238744+teknium1@users.noreply.github.com>