Completes the #30719 restart-loop defenses. Defenses 1-2 (the _HERMES_GATEWAY guard on `hermes gateway stop|restart` + terminal_tool, and the cron-creation lifecycle filter) already landed on main, but two gaps remained: - The agent's `cronjob` model tool calls cron.jobs.create_job directly, bypassing the hermes_cli.cron.cron_create CLI filter, so lifecycle commands scheduled via the model tool were only blocked at execution time (terminal_tool), not at creation. Moved the filter to a shared cron/lifecycle_guard.py enforced at create_job — the single chokepoint every job-creation path hits (CLI + model tool). Re-exported _contains_gateway_lifecycle_command from hermes_cli.cron so terminal_tool's import keeps working. - No breaker for the auto-resume loop itself. Defenses 1-2 cover the cron/CLI/terminal paths, but any other SIGTERM source (e.g. a raw terminal("launchctl kickstart ai.hermes.gateway")) still triggers the boot->auto-resume->re-run cycle. Added gateway/restart_loop_guard.py: counts restart-interrupted boots in a rolling window (config gateway.restart_loop_guard, default 3 boots / 60s) and skips auto-resume for that boot once tripped. The gateway still comes up and serves real inbound messages; it just stops replaying the session that keeps killing it, putting a human back in the loop. Also tightened the lifecycle regex over main's version: dropped `hermes gateway start` (benign), required the gateway identifier on the launchctl/systemctl branches (so `launchctl unload ai.hermes.update-checker.plist` and `systemctl restart hermes-meta.service` no longer false-positive), added the inverse pkill token order, and fixed the binary-script bypass (decode with errors='replace' instead of swallowing UnicodeDecodeError). The create_job guard resolves relative script paths under HERMES_HOME/scripts the same way the scheduler does, so a bare script name is scanned as the file that actually runs. Design and much of defense-2 originate from PR #33395 (@kshitijk4poor), which itself salvaged #30728 (@SimoKiihamaki). Rebuilt against current main since defenses 1-2 had already landed under different names. Closes #30719. Co-authored-by: SimoKiihamaki <simo.kiihamaki@gmail.com> Co-authored-by: kshitijk4poor <82637225+kshitijk4poor@users.noreply.github.com>
150 lines
5.5 KiB
Python
150 lines
5.5 KiB
Python
"""Auto-resume restart-loop breaker (#30719, defense-3).
|
|
|
|
Defenses 1 and 2 (the ``_HERMES_GATEWAY`` guard on ``hermes gateway
|
|
stop|restart`` + ``terminal_tool``, and the cron-creation lifecycle
|
|
filter) stop the agent from scheduling its own restart via the cron and
|
|
CLI paths. They do NOT cover every SIGTERM source: an agent running a
|
|
raw ``terminal("launchctl kickstart -k gui/<uid>/ai.hermes.gateway")``,
|
|
an external monitor with a bad trigger, or any other repeated crash can
|
|
still drive the supervisor (launchd ``KeepAlive`` / systemd ``Restart=``)
|
|
into a tight respawn loop. On each boot the gateway auto-resumes the
|
|
restart-interrupted session, whose next turn re-runs the offending
|
|
logic — SIGTERM every ~10 seconds until manually broken.
|
|
|
|
This module is the last-resort circuit breaker: it records a timestamp
|
|
each time the gateway boots with restart-interrupted sessions pending,
|
|
keeps a rolling window of recent boots persisted across processes (each
|
|
boot is a fresh process, so in-memory state is useless), and reports the
|
|
loop as "tripped" once too many such boots happen inside a short window.
|
|
When tripped, the caller SKIPS auto-resume for that boot — the gateway
|
|
still starts and serves real inbound messages, it just stops replaying
|
|
the session that keeps killing it, which breaks the cycle and puts a
|
|
human back in the loop.
|
|
|
|
State lives in ``<HERMES_HOME>/gateway/restart_loop.json`` so it is
|
|
profile-scoped and survives process death. It is intentionally tiny and
|
|
best-effort: any read/write failure fails OPEN (no false trip) because a
|
|
broken breaker must never wedge a healthy gateway.
|
|
"""
|
|
|
|
from __future__ import annotations
|
|
|
|
import json
|
|
import logging
|
|
import time
|
|
from typing import List, Optional
|
|
|
|
from hermes_constants import get_hermes_home
|
|
|
|
logger = logging.getLogger("gateway.run")
|
|
|
|
# Defaults chosen so a legitimate operator restart (or two) never trips the
|
|
# breaker, but the documented ~10s respawn loop does within a few cycles.
|
|
DEFAULT_MAX_RESTARTS = 3
|
|
DEFAULT_WINDOW_SECONDS = 60
|
|
|
|
|
|
def _state_path():
|
|
return get_hermes_home() / "gateway" / "restart_loop.json"
|
|
|
|
|
|
def _load_boots() -> List[float]:
|
|
try:
|
|
raw = _state_path().read_text(encoding="utf-8")
|
|
data = json.loads(raw)
|
|
boots = data.get("boots", [])
|
|
return [float(t) for t in boots if isinstance(t, (int, float))]
|
|
except (OSError, ValueError, TypeError):
|
|
return []
|
|
|
|
|
|
def _save_boots(boots: List[float]) -> None:
|
|
try:
|
|
path = _state_path()
|
|
path.parent.mkdir(parents=True, exist_ok=True)
|
|
path.write_text(json.dumps({"boots": boots}), encoding="utf-8")
|
|
except OSError:
|
|
pass
|
|
|
|
|
|
def record_restart_interrupted_boot(
|
|
window_seconds: int = DEFAULT_WINDOW_SECONDS,
|
|
*,
|
|
now: Optional[float] = None,
|
|
) -> List[float]:
|
|
"""Record that the gateway just booted with restart-interrupted sessions.
|
|
|
|
Prunes boots older than ``window_seconds`` and appends the current time.
|
|
Returns the pruned+appended list (most recent last). Best-effort — a
|
|
persistence failure returns the in-memory list without raising.
|
|
"""
|
|
ts = time.time() if now is None else now
|
|
cutoff = ts - max(1, window_seconds)
|
|
boots = [t for t in _load_boots() if t >= cutoff]
|
|
boots.append(ts)
|
|
_save_boots(boots)
|
|
return boots
|
|
|
|
|
|
def is_restart_loop_tripped(
|
|
max_restarts: int = DEFAULT_MAX_RESTARTS,
|
|
window_seconds: int = DEFAULT_WINDOW_SECONDS,
|
|
*,
|
|
now: Optional[float] = None,
|
|
) -> bool:
|
|
"""Return True if the gateway has restarted ``>= max_restarts`` times with
|
|
restart-interrupted sessions inside the last ``window_seconds``.
|
|
|
|
Reads the persisted boot log written by
|
|
``record_restart_interrupted_boot`` and counts boots within the window.
|
|
Fails OPEN (returns False) on any error — a broken breaker must never
|
|
wedge a healthy gateway.
|
|
"""
|
|
if max_restarts <= 0:
|
|
return False
|
|
ts = time.time() if now is None else now
|
|
cutoff = ts - max(1, window_seconds)
|
|
try:
|
|
recent = [t for t in _load_boots() if t >= cutoff]
|
|
except Exception: # pragma: no cover — _load_boots already guards
|
|
return False
|
|
return len(recent) >= max_restarts
|
|
|
|
|
|
def clear() -> None:
|
|
"""Remove the persisted boot log (used on clean shutdown / by tests)."""
|
|
try:
|
|
_state_path().unlink(missing_ok=True)
|
|
except OSError:
|
|
pass
|
|
|
|
|
|
def check_and_record(
|
|
max_restarts: int = DEFAULT_MAX_RESTARTS,
|
|
window_seconds: int = DEFAULT_WINDOW_SECONDS,
|
|
*,
|
|
now: Optional[float] = None,
|
|
) -> bool:
|
|
"""Record this restart-interrupted boot and report whether the loop is now
|
|
tripped.
|
|
|
|
This is the single entry point the gateway calls: it appends the current
|
|
boot, then checks whether the (now-updated) window has reached the
|
|
threshold. Returns True when auto-resume should be SKIPPED to break the
|
|
loop.
|
|
"""
|
|
boots = record_restart_interrupted_boot(window_seconds, now=now)
|
|
tripped = len(boots) >= max_restarts if max_restarts > 0 else False
|
|
if tripped:
|
|
logger.warning(
|
|
"Restart-loop breaker TRIPPED: %d restart-interrupted gateway "
|
|
"boots within %ds (threshold %d). Skipping auto-resume to break "
|
|
"a suspected SIGTERM-respawn loop (#30719). Restart-interrupted "
|
|
"sessions stay resume-pending and will continue on the next real "
|
|
"user message. If this is a false positive, delete %s.",
|
|
len(boots),
|
|
window_seconds,
|
|
max_restarts,
|
|
_state_path(),
|
|
)
|
|
return tripped
|