hermes-agent/gateway/restart_loop_guard.py
teknium1 b48cacb97b fix(gateway,cron): guard cron model-tool path + add auto-resume loop breaker (#30719)
Completes the #30719 restart-loop defenses. Defenses 1-2 (the
_HERMES_GATEWAY guard on `hermes gateway stop|restart` + terminal_tool,
and the cron-creation lifecycle filter) already landed on main, but two
gaps remained:

- The agent's `cronjob` model tool calls cron.jobs.create_job directly,
  bypassing the hermes_cli.cron.cron_create CLI filter, so lifecycle
  commands scheduled via the model tool were only blocked at execution
  time (terminal_tool), not at creation. Moved the filter to a shared
  cron/lifecycle_guard.py enforced at create_job — the single chokepoint
  every job-creation path hits (CLI + model tool). Re-exported
  _contains_gateway_lifecycle_command from hermes_cli.cron so
  terminal_tool's import keeps working.
- No breaker for the auto-resume loop itself. Defenses 1-2 cover the
  cron/CLI/terminal paths, but any other SIGTERM source (e.g. a raw
  terminal("launchctl kickstart ai.hermes.gateway")) still triggers the
  boot->auto-resume->re-run cycle. Added gateway/restart_loop_guard.py:
  counts restart-interrupted boots in a rolling window (config
  gateway.restart_loop_guard, default 3 boots / 60s) and skips
  auto-resume for that boot once tripped. The gateway still comes up and
  serves real inbound messages; it just stops replaying the session that
  keeps killing it, putting a human back in the loop.

Also tightened the lifecycle regex over main's version: dropped
`hermes gateway start` (benign), required the gateway identifier on the
launchctl/systemctl branches (so `launchctl unload
ai.hermes.update-checker.plist` and `systemctl restart
hermes-meta.service` no longer false-positive), added the inverse
pkill token order, and fixed the binary-script bypass (decode with
errors='replace' instead of swallowing UnicodeDecodeError). The
create_job guard resolves relative script paths under HERMES_HOME/scripts
the same way the scheduler does, so a bare script name is scanned as the
file that actually runs.

Design and much of defense-2 originate from PR #33395 (@kshitijk4poor),
which itself salvaged #30728 (@SimoKiihamaki). Rebuilt against current
main since defenses 1-2 had already landed under different names.

Closes #30719.

Co-authored-by: SimoKiihamaki <simo.kiihamaki@gmail.com>
Co-authored-by: kshitijk4poor <82637225+kshitijk4poor@users.noreply.github.com>
2026-07-01 02:48:36 -07:00

150 lines
5.5 KiB
Python

"""Auto-resume restart-loop breaker (#30719, defense-3).
Defenses 1 and 2 (the ``_HERMES_GATEWAY`` guard on ``hermes gateway
stop|restart`` + ``terminal_tool``, and the cron-creation lifecycle
filter) stop the agent from scheduling its own restart via the cron and
CLI paths. They do NOT cover every SIGTERM source: an agent running a
raw ``terminal("launchctl kickstart -k gui/<uid>/ai.hermes.gateway")``,
an external monitor with a bad trigger, or any other repeated crash can
still drive the supervisor (launchd ``KeepAlive`` / systemd ``Restart=``)
into a tight respawn loop. On each boot the gateway auto-resumes the
restart-interrupted session, whose next turn re-runs the offending
logic — SIGTERM every ~10 seconds until manually broken.
This module is the last-resort circuit breaker: it records a timestamp
each time the gateway boots with restart-interrupted sessions pending,
keeps a rolling window of recent boots persisted across processes (each
boot is a fresh process, so in-memory state is useless), and reports the
loop as "tripped" once too many such boots happen inside a short window.
When tripped, the caller SKIPS auto-resume for that boot — the gateway
still starts and serves real inbound messages, it just stops replaying
the session that keeps killing it, which breaks the cycle and puts a
human back in the loop.
State lives in ``<HERMES_HOME>/gateway/restart_loop.json`` so it is
profile-scoped and survives process death. It is intentionally tiny and
best-effort: any read/write failure fails OPEN (no false trip) because a
broken breaker must never wedge a healthy gateway.
"""
from __future__ import annotations
import json
import logging
import time
from typing import List, Optional
from hermes_constants import get_hermes_home
logger = logging.getLogger("gateway.run")
# Defaults chosen so a legitimate operator restart (or two) never trips the
# breaker, but the documented ~10s respawn loop does within a few cycles.
DEFAULT_MAX_RESTARTS = 3
DEFAULT_WINDOW_SECONDS = 60
def _state_path():
return get_hermes_home() / "gateway" / "restart_loop.json"
def _load_boots() -> List[float]:
try:
raw = _state_path().read_text(encoding="utf-8")
data = json.loads(raw)
boots = data.get("boots", [])
return [float(t) for t in boots if isinstance(t, (int, float))]
except (OSError, ValueError, TypeError):
return []
def _save_boots(boots: List[float]) -> None:
try:
path = _state_path()
path.parent.mkdir(parents=True, exist_ok=True)
path.write_text(json.dumps({"boots": boots}), encoding="utf-8")
except OSError:
pass
def record_restart_interrupted_boot(
window_seconds: int = DEFAULT_WINDOW_SECONDS,
*,
now: Optional[float] = None,
) -> List[float]:
"""Record that the gateway just booted with restart-interrupted sessions.
Prunes boots older than ``window_seconds`` and appends the current time.
Returns the pruned+appended list (most recent last). Best-effort — a
persistence failure returns the in-memory list without raising.
"""
ts = time.time() if now is None else now
cutoff = ts - max(1, window_seconds)
boots = [t for t in _load_boots() if t >= cutoff]
boots.append(ts)
_save_boots(boots)
return boots
def is_restart_loop_tripped(
max_restarts: int = DEFAULT_MAX_RESTARTS,
window_seconds: int = DEFAULT_WINDOW_SECONDS,
*,
now: Optional[float] = None,
) -> bool:
"""Return True if the gateway has restarted ``>= max_restarts`` times with
restart-interrupted sessions inside the last ``window_seconds``.
Reads the persisted boot log written by
``record_restart_interrupted_boot`` and counts boots within the window.
Fails OPEN (returns False) on any error — a broken breaker must never
wedge a healthy gateway.
"""
if max_restarts <= 0:
return False
ts = time.time() if now is None else now
cutoff = ts - max(1, window_seconds)
try:
recent = [t for t in _load_boots() if t >= cutoff]
except Exception: # pragma: no cover — _load_boots already guards
return False
return len(recent) >= max_restarts
def clear() -> None:
"""Remove the persisted boot log (used on clean shutdown / by tests)."""
try:
_state_path().unlink(missing_ok=True)
except OSError:
pass
def check_and_record(
max_restarts: int = DEFAULT_MAX_RESTARTS,
window_seconds: int = DEFAULT_WINDOW_SECONDS,
*,
now: Optional[float] = None,
) -> bool:
"""Record this restart-interrupted boot and report whether the loop is now
tripped.
This is the single entry point the gateway calls: it appends the current
boot, then checks whether the (now-updated) window has reached the
threshold. Returns True when auto-resume should be SKIPPED to break the
loop.
"""
boots = record_restart_interrupted_boot(window_seconds, now=now)
tripped = len(boots) >= max_restarts if max_restarts > 0 else False
if tripped:
logger.warning(
"Restart-loop breaker TRIPPED: %d restart-interrupted gateway "
"boots within %ds (threshold %d). Skipping auto-resume to break "
"a suspected SIGTERM-respawn loop (#30719). Restart-interrupted "
"sessions stay resume-pending and will continue on the next real "
"user message. If this is a false positive, delete %s.",
len(boots),
window_seconds,
max_restarts,
_state_path(),
)
return tripped