fix(vision): cap vision_analyze fan-out concurrency process-wide

A single agent turn can fan out N vision_analyze calls at once — the
classic trigger is "analyze every frame of this video", where ffmpeg
explodes a clip into dozens of frames and the model calls vision_analyze
on each. Every call does a CPU-heavy base64-encode/resize burst AND holds
a long-lived LLM stream open. The tool executor runs concurrent tool calls
on a per-session ThreadPoolExecutor (_MAX_TOOL_WORKERS=8), and multiple
agent sessions share one process (the dashboard runs the agent in-process),
so there was no global ceiling. In prod (June 2026) a video-frame fan-out
pinned a worker thread at ~100% CPU and starved the shared asyncio event
loop that also serves the dashboard's /api/status liveness probe, flapping
the instance to UNHEALTHY even though nothing had crashed.

Add a process-global threading.BoundedSemaphore that bounds how many vision
analyses run concurrently across the whole process, held across the entire
analysis (image load + encode + LLM call) in the single _handle_vision_analyze
chokepoint (covers both the native fast path and the legacy aux-LLM path).

It is a threading semaphore, NOT asyncio: each vision call is dispatched
through model_tools._run_async on a per-thread event loop, so an asyncio
primitive bound to one loop cannot coordinate across them. The acquire is
offloaded via run_in_executor so waiting for a slot never blocks the calling
loop.

Default: min(host CPUs, 4), floored at 1 — respect the host's concurrency,
or lower. Override via auxiliary.vision.max_concurrency (config.yaml) or
HERMES_VISION_MAX_CONCURRENCY (env). Values < 1 are ignored so the cap can
never be disabled into an unbounded fan-out.

Tests: bounded-fan-out regression guard + a control proving it would fail
without the cap; resolver tests for host-cpu default, ceiling clamp, low-cpu
host, env override, and sub-1 rejection. Pre-existing handler tests updated
for the now-async _handle_vision_analyze. Verified via the real
registry.dispatch -> _run_async per-thread-loop path (16 concurrent calls,
peak bounded to cap).
This commit is contained in:
Ben Barclay 2026-06-29 15:18:01 +10:00 committed by Teknium
parent 115e78c377
commit eddfecd2ce
5 changed files with 346 additions and 37 deletions

View file

@ -7,7 +7,9 @@ Also tests the vision_tools and browser_tool model override env vars.
import os
import sys
from pathlib import Path
from unittest.mock import patch, MagicMock
from unittest.mock import patch, MagicMock, AsyncMock
import pytest
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", ".."))
@ -238,22 +240,30 @@ class TestGatewayBridgeCodeParity:
class TestVisionModelOverride:
"""Test that AUXILIARY_VISION_MODEL env var overrides the default model in the handler."""
def test_env_var_overrides_default(self, monkeypatch):
@pytest.mark.asyncio
async def test_env_var_overrides_default(self, monkeypatch):
monkeypatch.setenv("AUXILIARY_VISION_MODEL", "openai/gpt-4o")
from tools.vision_tools import _handle_vision_analyze
with patch("tools.vision_tools.vision_analyze_tool", new_callable=MagicMock) as mock_tool:
with (
patch("tools.vision_tools.vision_analyze_tool", new_callable=AsyncMock) as mock_tool,
patch("tools.vision_tools._should_use_native_vision_fast_path", return_value=False),
):
mock_tool.return_value = '{"success": true}'
_handle_vision_analyze({"image_url": "http://test.jpg", "question": "test"})
await _handle_vision_analyze({"image_url": "http://test.jpg", "question": "test"})
call_args = mock_tool.call_args
# 3rd positional arg = model
assert call_args[0][2] == "openai/gpt-4o"
def test_default_model_when_no_override(self, monkeypatch):
@pytest.mark.asyncio
async def test_default_model_when_no_override(self, monkeypatch):
monkeypatch.delenv("AUXILIARY_VISION_MODEL", raising=False)
from tools.vision_tools import _handle_vision_analyze
with patch("tools.vision_tools.vision_analyze_tool", new_callable=MagicMock) as mock_tool:
with (
patch("tools.vision_tools.vision_analyze_tool", new_callable=AsyncMock) as mock_tool,
patch("tools.vision_tools._should_use_native_vision_fast_path", return_value=False),
):
mock_tool.return_value = '{"success": true}'
_handle_vision_analyze({"image_url": "http://test.jpg", "question": "test"})
await _handle_vision_analyze({"image_url": "http://test.jpg", "question": "test"})
call_args = mock_tool.call_args
# With no AUXILIARY_VISION_MODEL env var, model should be None
# (the centralized call_llm router picks the provider default)