Once profile isolation is active (multiple gateway profiles or room->profile
multiplexing), get_secret() fails closed outside an installed scope. The cron
ticker fires jobs from a thread with no per-turn scope, so run_job() died in
resolve_runtime_provider() with UnscopedSecretError (e.g. for
OPENROUTER_BASE_URL / CUSTOM_BASE_URL) before model selection - every cron
job failed while interactive turns worked fine.
Wrap run_job() in set_secret_scope(build_profile_secret_scope(...)) with a
finally-reset, mirroring the proven per-turn pattern in gateway/run.py
(_profile_runtime_scope). Single-profile installs are unaffected (the scope
is just the profile's own .env).
tests/cron: 611 passed, 1 pre-existing unrelated failure
(TestRoutingIntents::test_all_token_case_insensitive fails identically on
unmodified main in a full-suite run and passes in isolation).
A Z.ai desktop user reported thinking reverting to medium after one turn,
burning ~200% of a week's credits in 4 days despite reasoning_effort: false
in config.yaml. Four compounding bugs:
- _session_info reported reasoning_effort "" for disabled reasoning,
indistinguishable from unset — the desktop adopted it after the first
turn, wiping its sticky "thinking off" pick so every later chat
reverted to the default effort.
- config.set key=reasoning always wrote agent.reasoning_effort to global
config.yaml, so every desktop model-menu selection (preset.effort ??
'medium') clobbered the user's configured value. Now session-scoped
like the messaging gateway's /reasoning, landing on
create_reasoning_override so lazily-built sessions keep it too.
- YAML `reasoning_effort: false`/`off`/`no` (boolean False) was coerced
to "" by every loader's `str(x or "")`, silently re-enabling thinking.
parse_reasoning_effort now treats False/"false"/"disabled" as
{"enabled": False}; loaders (tui gateway, gateway, cli, cron,
delegate) pass the raw value through. The desktop config reader also
crashed on the boolean (false.trim()), aborting voice/STT settings.
- The zai provider profile never sent thinking on the wire, and GLM-4.5+
defaults to thinking ON server-side — so disabling reasoning was a
silent no-op on direct Z.ai, the actual token burner. The profile now
emits extra_body.thinking {"type": "enabled"|"disabled"} for
thinking-capable GLM models, mirroring the DeepSeek profile.
Also: /new (session reset) now carries reasoning_config across the
rebuild like model_override; config.get reasoning prefers the session's
live value and maps a config False to "none"; Settings shows "Off"
instead of a blank select for hand-written false.
The standalone thread-pool fallback in _deliver_result() runs inside the
`except RuntimeError:` block (taken when asyncio.run() sees a running loop).
When future.result() raised there (SMTP ConnectionError, timeout, etc.), the
exception was NOT caught by the sibling `except Exception:` — it escaped
_deliver_result() and crashed the whole delivery loop, silently skipping every
remaining target. Multi-target delivery (e.g. deliver: 'email:a,email:b') is a
documented feature, so this broke a promised contract.
Wrap the fallback in its own try/except so a per-target failure is logged with
exc_info and the loop continues to the next target.
Fixes#47163
The in_channel surface DOES add a seed: _seed_cron_channel_session CREATES
the flat (platform, chat_id, None) session and mirrors the brief into it,
because mirror_to_session only APPENDS to an existing session and the flat
channel row is otherwise absent for a chat_postMessage delivery. Correct
the scheduler thread-skip comment and the test class docstring, which still
described the earlier 'let the existing mirror seed it' design.
Live testing exposed a real bug: an in_channel continuable cron delivered
flat to the channel (✅) but the reply did NOT continue the job — the bot
had no brief in context and confabulated the answer.
Root cause: mirror_to_session only APPENDS to a session that already
exists (_find_session_id → no-op when none matches); it never CREATEs one.
A flat (slack, chat_id, None) row is only created when a human posts a
top-level message the bot processes — a cron chat_postMessage delivery
never goes through the inbound handler, so the row is absent and the brief
is silently dropped. The prior impl relied on the bare mirror (F5/OQ-1
concluded "deletion only" — wrong).
Fix: _seed_cron_channel_session mirrors _seed_cron_thread_session —
get_or_create_session FIRST (chat_type = "dm" if is_dm else "group",
thread_id=None), keyed to the ORIGIN USER'S id, then mirror. The channel
session key embeds user_id (…:group:<chat>:<user>), so a system:cron id
would key the seed away from the reply; the origin user's id makes seed
key == inbound reply key. DM key ignores user_id but needs chat_type=dm
to match the prefix. Wired into the in_channel branch after delivery;
suppresses the generic mirror to avoid double-write.
DM validated (per request): the seeded key equals the inbound DM reply key
for a 1:1 DM; continuation works there too.
Tests:
- Rewrote the in_channel tests to use a real _session_store and the origin
user_id; assert get_or_create_session is called with the flat, correctly-
keyed source. Prove-fail: (a) reverting the create step and (b) seeding
with system:cron each turn a targeted test RED; restore → GREEN.
- +2 direct _seed_cron_channel_session unit tests asserting the KEY-MATCH
invariant (seed key == inbound reply key) via build_session_key, for both
channel and DM.
- Rewrote tests/manual/cron_inchannel_e2e.py to drive a REAL SessionStore +
real mirror_to_session + real _find_session_id + real build_session_key
(no session-layer mocks — the old mocked E2E is exactly why the bug
shipped). Asserts the brief lands in the transcript and the reply resolves
to the same session, for BOTH channel and 1:1 DM.
Full relevant sweep: 283 passed.
Add a per-platform `cron_continuable_surface` extra key
(`thread` default | `in_channel`) so a continuable cron job can deliver
FLAT into a Slack channel — no dedicated thread — and still be
replied-to. In `in_channel` mode the scheduler skips the thread-open
branch (leaves `thread_id=None`); the shipped origin-mirror then seeds
the `(slack, chat_id, None)` shared-channel session — the same bucket
`reply_in_thread: false` routes inbound channel replies to — so a plain
channel reply continues the job in context.
Design: specs/cron-inchannel-continuable (D1–D7, F5). Model B
(shared-channel session), NOT anchoring to the delivery `ts` — on Slack
replying to a specific message IS threading, so a `ts` anchor would only
relocate the thread, never deliver true threadless continuable.
- gateway/platforms/base.py: `supports_inchannel_continuable` capability
flag (default False → unsupported platforms fail SAFE to `thread`).
- plugins/platforms/slack/adapter.py: flag=True; `_cron_continuable_surface()`
resolver (coerces to the two-value enum); `_warn_if_inchannel_without_flat_reply`
connect-time warning (D5: warn, not hard-require — the misconfig fails safe).
- gateway/config.py: shared-key bridge line (top-level OR nested config).
- cron/scheduler.py: read the key generically from platform config, gate
the `in_channel` branch on the adapter capability flag, skip thread-open.
No new seed function (reuses the existing mirror — G6).
Pairing (docs): `in_channel` + `reply_in_thread: false` +
`require_mention: false` (or a free-response channel). Missing
`reply_in_thread: false` fails safe to a threaded continuation.
Gateway-side config flag — `/restart` to apply; NO Slack app reinstall.
Tests (from inside the worktree, PYTHONPATH=$PWD):
- +6 cron scheduler tests (in_channel skips thread-open; seeds flat
channel session with thread_id=None; thread-mode regression;
fail-safe on unsupported platform; value coercion). Prove-fail:
removing the `and not in_channel_surface` guard turns the two
load-bearing tests RED; restore → GREEN.
- +10 slack resolver/capability/warning tests; +2 config-bridge tests.
- tests/manual/cron_inchannel_e2e.py: offline E2E driving BOTH real
legs (delivery seed + inbound reply keying) → both converge on
(slack, C, None).
- No regressions: test_slack.py 216 passed alone; broader sweep green
(4 pre-existing cross-file-ordering failures reproduce identically on
pristine origin/main).
Docs: cron.md + slack.md + zh-Hans mirrors of both.
Rework follow-up on the per-job TERMINAL_CWD readers-writer lock.
The lock was acquired BEFORE the try: whose finally: is the only release
site, with the env-override statements (os.environ[TERMINAL_CWD] = workdir;
logger.info) sitting in the unprotected window between acquire and try. Any
exception there — a raising log handler, an os.environ error, a thread
interrupt — propagated out of run_job WITHOUT running the finally, leaking
the lock. A leaked writer permanently deadlocks the whole scheduler (every
future cron job blocks on acquire_*); a leaked reader blocks all writers.
- Snapshot _prior_terminal_cwd before the acquire (so the finally can always
restore env even if the body raises before the override).
- Open the try: immediately after acquire and move the env-override lines
inside it, so the existing finally always releases the lock.
- Add a mutation-verified regression test: a workdir job whose in-window
logger.info raises must still release the writer lock (a subsequent
acquire_write must not block).
A cron job with a per-job `workdir` overrides the process-global
`os.environ["TERMINAL_CWD"]` for the entire duration of its agent run and
restores it afterwards. The scheduler dispatches workdir jobs on a
single-thread sequential pool and workdir-less jobs on a separate parallel
pool, and the in-code comments claimed this made the override safe.
That only prevents two workdir jobs from overlapping each other. The two
pools run concurrently in the same process and share `os.environ`, so while
a workdir job has `TERMINAL_CWD` pointed at its project directory, any
workdir-less job firing in the same window reads that same global through the
terminal, file, and code-exec tools and runs its commands in the wrong
directory. The corruption window spans the whole workdir-job run, and a file
write or delete can land in another job's tree.
This serializes the override with a writer-preferring readers-writer lock.
Workdir jobs acquire it as writers (exclusive for their whole run); workdir-
less jobs acquire it as readers, so they still run in parallel with each
other but never alongside a workdir job's override. The guarantee is based on
run overlap rather than tick boundaries, so it also holds when a workdir job
spans ticks.
## What does this PR do?
Fixes a directory-isolation bug in the cron scheduler: a workdir cron job's
process-global `TERMINAL_CWD` override could be observed by a concurrently
running workdir-less cron job, causing that job's shell/file/code-exec
commands to execute in the wrong directory.
## Related Issue
N/A
## Type of Change
- [x] 🐛 Bug fix (non-breaking change that fixes an issue)
- [ ] ✨ New feature (non-breaking change that adds functionality)
- [ ] 🔒 Security fix
- [ ] 📝 Documentation update
- [ ] ✅ Tests (adding or improving test coverage)
- [ ] ♻️ Refactor (no behavior change)
- [ ] 🎯 New skill (bundled or hub)
## Changes Made
- `cron/scheduler.py`: add `_ReadWriteLock` (writer-preferring) and the
module-global `_terminal_cwd_lock`.
- `cron/scheduler.py`: in `run_job`, acquire the lock as a writer for workdir
jobs and as a reader for workdir-less jobs, spanning the `TERMINAL_CWD`
override and its restore in the `finally` block.
- `cron/scheduler.py`: correct the stale comments in `run_job` and `tick` that
claimed the sequential pool alone made the override safe.
- `tests/cron/test_terminal_cwd_lock.py`: new tests for reader concurrency,
writer exclusion, and the no-cross-observation regression.
## How to Test
1. `python -m pytest tests/cron/test_terminal_cwd_lock.py -q` — the regression
test `test_reader_never_observes_writer_override` fails without the lock and
passes with it.
2. `python -m pytest tests/cron/test_cron_workdir.py tests/cron/test_parallel_pool.py -q`
— confirms the existing `TERMINAL_CWD` set/restore and pool behaviour are
unchanged.
## Checklist
### Code
- [x] I've read the Contributing Guide
- [x] My commit messages follow Conventional Commits (`fix(scope):`, etc.)
- [x] I searched for existing PRs to make sure this isn't a duplicate
- [x] My PR contains only changes related to this fix
- [x] I've run the affected `tests/cron/` suites and all tests pass
- [x] I've added tests for my changes (required for bug fixes)
- [x] I've tested on my platform: macOS 15 (Darwin 25.5)
### Documentation & Housekeeping
- [x] I've updated relevant documentation (docstrings/comments) — or N/A
- [x] I've updated `cli-config.yaml.example` if I added/changed config keys — N/A
- [x] I've updated `CONTRIBUTING.md` or `AGENTS.md` if I changed architecture — N/A
- [x] I've considered cross-platform impact (Windows, macOS) — uses stdlib `threading` only
- [x] I've updated tool descriptions/schemas if I changed tool behavior — N/A
Completes the #30719 restart-loop defenses. Defenses 1-2 (the
_HERMES_GATEWAY guard on `hermes gateway stop|restart` + terminal_tool,
and the cron-creation lifecycle filter) already landed on main, but two
gaps remained:
- The agent's `cronjob` model tool calls cron.jobs.create_job directly,
bypassing the hermes_cli.cron.cron_create CLI filter, so lifecycle
commands scheduled via the model tool were only blocked at execution
time (terminal_tool), not at creation. Moved the filter to a shared
cron/lifecycle_guard.py enforced at create_job — the single chokepoint
every job-creation path hits (CLI + model tool). Re-exported
_contains_gateway_lifecycle_command from hermes_cli.cron so
terminal_tool's import keeps working.
- No breaker for the auto-resume loop itself. Defenses 1-2 cover the
cron/CLI/terminal paths, but any other SIGTERM source (e.g. a raw
terminal("launchctl kickstart ai.hermes.gateway")) still triggers the
boot->auto-resume->re-run cycle. Added gateway/restart_loop_guard.py:
counts restart-interrupted boots in a rolling window (config
gateway.restart_loop_guard, default 3 boots / 60s) and skips
auto-resume for that boot once tripped. The gateway still comes up and
serves real inbound messages; it just stops replaying the session that
keeps killing it, putting a human back in the loop.
Also tightened the lifecycle regex over main's version: dropped
`hermes gateway start` (benign), required the gateway identifier on the
launchctl/systemctl branches (so `launchctl unload
ai.hermes.update-checker.plist` and `systemctl restart
hermes-meta.service` no longer false-positive), added the inverse
pkill token order, and fixed the binary-script bypass (decode with
errors='replace' instead of swallowing UnicodeDecodeError). The
create_job guard resolves relative script paths under HERMES_HOME/scripts
the same way the scheduler does, so a bare script name is scanned as the
file that actually runs.
Design and much of defense-2 originate from PR #33395 (@kshitijk4poor),
which itself salvaged #30728 (@SimoKiihamaki). Rebuilt against current
main since defenses 1-2 had already landed under different names.
Closes#30719.
Co-authored-by: SimoKiihamaki <simo.kiihamaki@gmail.com>
Co-authored-by: kshitijk4poor <82637225+kshitijk4poor@users.noreply.github.com>
Addresses egilewski (Codex) CR on PR #52351: the run_job() credential-exfil
backstop caught every exception around _validate_cron_base_url() and set
err = None, so an unexpected validator/import error let an unvetted stored
provider/base_url pair reach resolve_runtime_provider() — the very sink this
checkpoint exists to guard. A synthetic validator-exception probe with a
legacy custom:legit + off-host base_url job slipped through (validator_exception
ALLOW).
Now fail closed: if the validator raises and the job carries a base_url
override (the exfil precondition), refuse the run. A job with no base_url
override can't exfiltrate via this path — the validator would return None — so
it still runs, keeping the common no-override jobs from wedging on an unrelated
error. Operator fallback providers come from config, not the job, so they are
unaffected.
Adds two regressions: validator-exception + base_url -> blocked;
validator-exception without base_url -> still allowed.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The model-facing cronjob tool accepts free-form provider + base_url. On fire,
the scheduler pairs the named provider's stored credential with the job's
base_url, so a prompt-injected job (e.g. provider=anthropic,
base_url=https://attacker/v1) sends the real API key to an attacker endpoint. A
base_url with no provider inherits the default provider's key for the same
effect.
Add a fail-closed guard at the tool boundary: a base_url override is allowed
only for the custom/BYOK sentinel, a configured custom_providers entry, or when
the override host matches the named provider's own endpoint; an override without
an explicit provider is rejected. The trust boundary is the caller, so
operator-configured base_urls for named providers are unaffected.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
A finite one-shot cron job whose side effect kills the tick (gateway
suicide, OOM, segfault, hard-timeout) re-fired forever: mark_job_run —
which increments repeat.completed and removes the job — runs AFTER the
job, so an abrupt tick death never records completion and every
supervisor relaunch re-dispatches the job (#38758).
Commit the dispatch BEFORE the side effect:
- claim_dispatch() increments repeat.completed under the cross-process
jobs lock and persists it before run_job(), converting finite
one-shots from at-least-once to at-most-times.
- Called from run_one_job (the shared body used by BOTH the built-in
ticker and the external Chronos fire_due path) before run_job.
- mark_job_run skips the increment for pre-claimed one-shots (no
double-count) and still removes at the limit.
- get_due_jobs drops a stale one-shot already at its dispatch limit so
a job claimed-but-not-cleaned-up after a crash stops appearing as due.
- No-op for recurring jobs (advance_next_run) and infinite/no-repeat
one-shots; a handed-in job dict absent from the store proceeds.
Closes#38758
Cron's job runner was the last entry point still reading
fallback_providers/fallback_model as an either/or, silently dropping the
legacy fallback_model when fallback_providers was set. Every other entry
point (cli, gateway, oneshot, fallback_cmd, tui_gateway, auxiliary_client)
already merges both keys via get_fallback_chain(). This aligns cron with
them at both call sites: the auth-fallback resolution loop and the
AIAgent(fallback_model=...) argument.
Co-authored-by: xxxigm <tuancanhnguyen706@gmail.com>
Two live cron bugs, both surfaced by @banditburai in #35616 (whose larger
watchdog/supervisor work is already superseded by the CronScheduler provider
refactor on main):
- #32896: `cron list` crashed on a present-but-null `deliver` field —
`job.get("deliver", ["local"])` returns None for an explicit null, which
then hit `", ".join(None)`. Coalesce with `or ["local"]` (same pitfall
the sibling `repeat` line already guards against).
- #33465: cron jobs 401'd on Bitwarden/BSM-backed secrets. The per-run env
reload used a bare `load_dotenv(override=True)`, which re-applied only the
.env placeholder — startup had already recorded this HERMES_HOME in
env_loader._APPLIED_HOMES, so the external-secret re-pull no-oped. Route the
reload through load_hermes_dotenv() and call reset_secret_source_cache()
first to force the re-pull (Bitwarden's 300s value-cache keeps it off the
network; override honours secrets.bitwarden.override_existing, mirroring
startup).
Tests: null-deliver regression guard in test_cron.py; reset-before-reload
ordering guard in test_scheduler.py. Migrated 31 scheduler-reload test seams
from patching dotenv.load_dotenv to the new load_hermes_dotenv /
reset_secret_source_cache seam.
The handoff seed path inlined its own int(chat_id) > 0 private-chat
check; delivery.py already had the identical heuristic. Promote it to
a public name and reuse it from both sites instead of duplicating.
If redact_sensitive_text() raises or fails to import, stdout/stderr
were silently left unredacted and could leak API keys or tokens into
cron job delivery messages and logs.
Replace bare with a warning log and replace
both outputs with '[REDACTED - redaction failed]' to prevent leaks.
Root cause: silent exception swallow in _run_job_script()
Impact: potential secrets leak in cron job output delivery
Cron pre-run scripts were capped at 120s by default, which surprised
users running long data-collection scripts on crons (the whole point of
crons being to offload long work). Raise _DEFAULT_SCRIPT_TIMEOUT to 3600s
(1 hour).
This bounds the script only — skill/agent jobs already run on a separate
inactivity budget (HERMES_CRON_TIMEOUT, default 600s idle, 0=unlimited),
not a wall-clock cap. Scripts dispatch to a persistent thread pool and do
not hold the tick lock, so a long script doesn't starve other due jobs.
Docs clarified to make the script-vs-agent timeout distinction explicit.
env/config overrides (HERMES_CRON_SCRIPT_TIMEOUT,
cron.script_timeout_seconds) unchanged and still take precedence.
The curator's inactivity prune archived any non-pinned agent-created
skill whose activity was older than archive_after_days (90d). A skill
loaded only by a cron job had its usage bumped solely when the job
fired, so paused jobs, infrequent (quarterly/annual) schedules, and
far-future one-shots aged their skills out from under them — the next
run then failed to load the now-archived skill.
- cron/jobs.py: add referenced_skill_names() returning skills used by
ANY job (incl. paused/disabled).
- curator.apply_automatic_transitions(): skip cron-referenced skills
like pinned; add a use=0 grace floor so a never-used skill is not
marked stale/archived until it is at least stale_after_days old.
- LLM review pass: candidate list marks cron=yes; prompt forbids
pruning cron-referenced skills and never-used skills under 30 days.
Tested E2E against a real cron job + real usage records and with 4 new
unit tests.
* fix(windows): stop terminal-window popups from background spawns
Native-Windows desktop/gateway users saw cmd/conhost windows flash on
gateway restart, image paste, the dashboard Projects tree, voice notes,
and ~5 min after closing the app (detached cron). Two root causes:
- Console-subsystem exes (taskkill, schtasks, wmic, netstat, tasklist,
agent-browser, git, ffmpeg, powershell, git-bash) spawned via raw
subprocess allocate a fresh console when the launching process has
none (pythonw desktop backend / detached gateway) - even with output
captured.
- uv venv pythonw shims re-exec console python.exe, so Python children
get a console regardless of how they're launched.
Fixes:
- Single hidden-spawn primitive (_subprocess_compat.run/.popen) that ORs
CREATE_NO_WINDOW on Windows, no-op on POSIX. Route every Hermes-owned
console-exe spawn through it.
- FreeConsole() catch-all in hermes_bootstrap: any Python child that
exclusively owns an auto-allocated console detaches it at startup
(GetConsoleProcessList()==1 gate leaves shared interactive consoles
untouched).
- Replace PowerShell/wmic gateway PID scans with in-process psutil.
- Skip schtasks queries on non-interactive desktop restarts.
- Prefer native agent-browser .exe over .cmd shims.
- Guard test bans raw subprocess spawns of the Windows-only console
tools repo-wide so the popup class can't regress.
* fix(windows): scope FreeConsole to background entry points; fix merge fallout
Console detach review (per #53810 feedback): GetConsoleProcessList()==1 can't
tell a uv pythonw->python phantom console apart from a user opening the
interactive CLI/TUI in its own fresh console (double-click, shortcut, ConPTY) —
both report a single attached process with a tty. Running FreeConsole() in the
import-time bootstrap therefore risked detaching a legitimately-interactive
terminal.
- Extract FreeConsole into explicit hermes_bootstrap.detach_orphan_console();
remove it from apply_windows_utf8_bootstrap() (import side effect).
- Call it only from known background mains: gateway run, dashboard backend
(start_server, what the desktop spawns), cron standalone, tui_gateway entry,
slash worker. Interactive CLI/TUI never calls it.
- Behavior-contract tests: frees only when solo owner, leaves shared console,
no-op without console / on POSIX, and asserts it's not an import side effect.
Merge fallout from origin/main (#53791):
- local.py: 3-way merge left a dangling **_popen_kwargs (NameError crashing
every terminal init). _subprocess_compat.popen already hides the window, so
drop it.
- discord adapter: merge stacked an undefined windows_hide_flags() onto the
primitive call; drop the redundant arg.
- test_gateway: scan now goes psutil-first (zero spawn); rewrite the
case-variant test to drive that production path.
* test(claw): mock _subprocess_compat.run seam for Windows process scan
claw.py's Windows tasklist/powershell scan routes through the hidden-spawn
primitive; the tests still patched claw_mod.subprocess, so on win32 the mock
was never hit and real spawns returned nothing. Patch the actual seam.
A profile's cron jobs now provably live in AND execute under that profile's
HERMES_HOME. A job authored under profile `coder` is stored at
`~/.hermes/profiles/coder/cron/jobs.json` and runs with coder's .env,
config.yaml, scripts and skills — never the default root's.
This was the de-facto behavior on main but only by accident: PR #50112 had
re-anchored cron storage at the shared default root, and a later stale-branch
squash merge (#52147) silently reverted it back to the profile home. Neither
direction was guarded by a test, so it could flip again on the next stale merge.
Changes:
- cron/jobs.py: document the per-profile storage anchor (get_hermes_home, NOT
get_default_hermes_root) and why anchoring at the root leaks
config/credentials/skills across profiles — the #4707 security boundary.
- cron/scheduler.py, cron/suggestions.py: same intent documented at the
dynamic resolution helper and the suggestions store.
- tests/cron/test_cron_profile_isolation.py: pin storage, lock-path, and
execution-home resolution to the active profile so a re-anchor can't regress.
Verified E2E: jobs created under two profiles land in separate per-profile
stores with zero cross-profile leakage and no shared-root store; scheduler
execution-home follows the active profile. Full cron suite: 576/576.
* fix(cron): add default retention to per-run job output to bound disk usage (#52383)
Per-run cron output (cron/output/<job>/<timestamp>.md) is written once
per execution and was never pruned, so a frequently-scheduled job on
a long-running deploy accumulates one file per run indefinitely and
can fill the volume ('no space left on device').
save_job_output() now keeps the most recent N output files per job and
removes older ones. N defaults to 50 and is configurable via
cron.output_retention; a non-positive value disables pruning for
operators who manage cleanup externally.
Salvaged from #52402 by @0xDevNinja.
Closes#52383
* fix(config): add cron.output_retention to DEFAULT_CONFIG
Follow-up to #52383: the retention config key was functional via
get()-with-default but missing from DEFAULT_CONFIG, so the deep-merge
wouldn't auto-populate it for new installs. Add it explicitly.
---------
Co-authored-by: 0xDevNinja <manmit0x@gmail.com>
Scheduled jobs delivering to Telegram/etc. started posting a literal
'⚠️ No reply: the model returned empty content…' message instead of
staying silent. Two interacting causes:
1. The turn-completion explainer (#34452) replaces an empty model turn
with a user-facing '⚠️ No reply…' string. In a cron context that is
not a silence marker, so the scheduler delivered it — a regression
from the previously-silent empty turn. run_job now detects the
explainer text deterministically (via the same formatter that
produced it) for abnormal-empty turn_exit_reasons and strips it to
empty, so the existing empty-response suppression + soft-fail guard
apply. The explainer is unchanged on CLI/gateway.
2. The cron suppression used a loose 'SILENT_MARKER in ...upper()'
substring check. It leaked bracketless near-markers the model emits
('SILENT', 'NO_REPLY', 'NO REPLY' — #51438, #46917) and wrongly
swallowed a real report that merely quoted '[SILENT]' mid-sentence.
Replaced with _is_cron_silence_response(): suppresses a canonical
token as the whole response, its own first/last line, or the
documented bracketed '[SILENT] <note>' prefix — while a token buried
mid-sentence in a genuine report is delivered. Preserves the
intentional cron trailing/prefix tolerance (existing tests unchanged).
Tests: bracketless-variant suppression, mid-sentence-quote delivery,
direct matcher contract, and explainer-strip + defensive real-report
delivery.
Addresses review on #51077 (kxee). The continuable-cron mirror reused
gateway.mirror.mirror_to_session, which writes role=assistant — re-
introducing the exact alternation violation #2313 (37a997945)
deliberately removed: a cron brief landing as assistant after the
agent's last turn yields assistant->assistant, which breaks strict-
alternation providers (OpenAI/OpenRouter) per issue #2221. The mirror/
mirror_source metadata is also dropped at the SQLite boundary, so the
[Delivered from cron] label is lost on replay.
This is an intentional, opt-in (default OFF) reversal of #2313's
'cron output does not belong in interactive history' for the reply-to-
cron use case — gated behind cron.mirror_delivery / attach_to_session.
Fixes:
- mirror_to_session gains a role param (default 'assistant' — interactive
send_message mirror unchanged, it IS the agent speaking). Cron paths
pass role='user' with a '[Cron delivery: <task>]' prefix so the brief
collapses via repair_message_sequence's consecutive-user merge on every
provider, and stays distinguishable on replay despite the metadata drop.
- thread_seeded: defer seeding + the flag until delivery into the new
thread actually succeeds. Previously set pre-delivery, so an open-
succeeds / deliver-fails case both stranded a seeded-but-unseen brief
AND suppressed the DM-fallback mirror.
- seed mirror now passes user_id='system:cron' to resolve the exact
thread-keyed session row it just created.
- dedupe the duplicate BasePlatformAdapter import in _deliver_result.
- trim oversized docstrings to non-obvious WHY (AGENTS.md).
- docs: document cron.mirror_delivery / attach_to_session in
website/docs/user-guide/features/cron.md.
- test: assert the cron mirror writes role='user' with the label prefix.
204 cron+mirror tests pass.
Continuable cron jobs (attach_to_session / cron.mirror_delivery, default
OFF) now prefer a dedicated thread on thread-capable platforms, falling
back to origin-DM mirroring where threads don't exist.
- Thread-capable (Telegram topics, Discord/Slack threads): open a fresh
thread for the job via the shipped adapter.create_handoff_thread,
route the brief into it, and seed the thread-keyed session so the
user's in-thread reply continues with full context. This is the
'continuable cron opens its own thread' interface.
- DM-only (WhatsApp/Signal/SMS): create_handoff_thread returns None ->
fall back to mirroring into the origin DM session (existing behaviour).
Reuses existing infrastructure end-to-end — no new adapter surface, no
provider-chain signature change:
- adapter.create_handoff_thread (already implemented per-platform,
returns None on unsupported platforms = the fallback signal)
- the live SessionStore via adapter._session_store (already set on every
adapter), reached without threading a new param through the frozen
CronScheduler.start() contract
- gateway.mirror.mirror_to_session for the seed/append
- existing per-target delivery routing carries the new thread_id for free
Mirrors GatewayRunner._process_handoff's open-thread-or-fallback +
seed pattern, standalone for the cron delivery path. thread_seeded
guards against a double-mirror after seeding. Scoped to the origin
target only; fan-out/broadcast targets are never threaded or mirrored.
Config docs updated (cron.mirror_delivery) + cronjob tool
attach_to_session description reframed around continuable/thread-preferred.
Tests: +5 (thread id returned on thread platform; None on DM platform;
None without capability/loop; seed creates thread session + mirrors;
seed no-op on empty). 22/22 in TestCronDeliveryMirror; 532 cron tests
pass (4 failures pre-existing: croniter-not-installed + TZ).
Multi-participant parity with interactive send_message, which passes
HERMES_SESSION_USER_ID to gateway.mirror.mirror_to_session so the mirror
lands in the exact participant's session.
- cronjob_tools._origin_from_env now captures user_id from the session
context at job-create time (alongside platform/chat_id/thread_id).
- _maybe_mirror_cron_delivery forwards user_id to mirror_to_session.
- _deliver_result threads origin.user_id through for the origin target.
Effect: in a per-user-isolated group chat (group_sessions_per_user=True,
the default), the mirror resolves to the member who scheduled the job
instead of conservatively no-op'ing on ambiguous candidates. DMs and
shared group/thread sessions are unaffected (single candidate). Default
still OFF.
Tests: helper forwards user_id; E2E _deliver_result forwards origin
user_id. 17/17 in TestCronDeliveryMirror; 527 cron tests pass (4 failures
pre-existing: croniter-not-installed + TZ, identical on baseline).
The cron->session mirror now fires ONLY for the delivery target that
equals the job's origin (platform+chat_id[+thread_id]). A job created
from a live gateway chat stamps that chat as origin, and that session is
guaranteed to exist (it is the conversation the user scheduled the job
in). Fan-out / broadcast / home-channel-fallback targets are never
mirrored: they are not a continuation of a conversation and may have no
session at all.
This makes the prior 'cold-start session seeding' concern a non-case by
construction: when the mirror semantically applies the session exists;
when none exists the target was never the origin, so we no-op.
Adds _target_matches_origin() + origin-scoping tests (exact match,
other-chat/other-platform/no-origin rejection, thread scoping, fan-out
mirrors only the origin target).
Adds an opt-in path so a cron job's delivered output is also appended to
the TARGET chat's gateway session transcript (as an assistant turn), so a
user reply to a recurring delivery (daily brief, reminder) is answered with
the delivery in context instead of 'what is that?' amnesia.
- Reuses the shipped gateway.mirror.mirror_to_session — the same primitive
interactive send_message mirroring already uses. No messaging-toolset
change (cron still can't call send_message; this rides delivery).
- Gated: per-job attach_to_session overrides global cron.mirror_delivery
(config.yaml). Default OFF — historical isolation preserved byte-for-byte.
- Mirrors the CLEAN agent output, not the cron header/footer wrapper.
- Alternation/cache-safe: append lands at a turn boundary, never mid-loop,
never mutates the cached system prompt. Cold-start (no target session)
is a silent no-op; mirror errors never fail a successful delivery.
- Surfaced on the cronjob tool (attach_to_session) + config schema.
Driven by enterprise cron-as-control-plane use case. 10 new tests; full
cron + cronjob-tool suites pass (600).
A naive ISO timestamp (e.g. 2026-06-22T20:07:00) was anchored to the
server's local timezone via dt.astimezone(), but the due-check
(get_due_jobs -> _hermes_now()) runs in the CONFIGURED Hermes timezone.
When the two diverge (cloud host on UTC with a different timezone: set,
or vice-versa) the stored instant lands hours off the user's wall-clock
intent, so one-shots never become due and recurring jobs fire at the
wrong time. The ticker stays healthy (heartbeat + success markers fresh)
because every tick finds nothing due, matching the silent no-fire in #51021.
Anchor naive timestamps to _hermes_now().tzinfo so '20:07' means 20:07 on
the same clock the scheduler checks against. The legacy _ensure_aware path
still treats already-stored naive values as server-local for back-compat.
Fixes#51021
* Revert "fix(cron): scope job execution to its owning profile (#32091 follow-up) (#50993)"
This reverts commit 660e36f097.
* Revert "fix(cron): anchor cron storage at the default root home (not the active profile)"
This reverts commit a5c09fd176.
The #32091 fix moved every profile's cron jobs into one shared root store,
but never wired the execution-scoping half it recommended: a job still ran
under whichever profile's ticker picked it up, not its owning profile. So a
job created under `hermes -p donna` could execute with the root profile's
.env / config.yaml / credentials.
- jobs.py: create_job auto-captures the active profile (explicit profile=
override available) and stores it on the job; resolve_profile_home() maps a
profile name to its HERMES_HOME; legacy jobs backfill to 'default'.
- scheduler.py: run_job applies the job's profile via a scoped HERMES_HOME
override (env var + in-process ContextVar) before any .env/config/script
load, restored in finally. tick() routes profile-mismatched jobs to the
single-worker sequential pool so the env mutation can't race.
- cronjob tool threads profile through (NOT exposed in the model schema, to
avoid cross-profile privilege escalation); hermes cron add gains --profile.
E2E verified against a temp HERMES_HOME with a real profile dir: a root-profile
ticker runs a profile='donna' job with HERMES_HOME=donna during execution and
restores the ticker env afterward.
An unpinned cron job follows the global default provider (config.yaml
model.default + resolve_runtime_provider). If that global state is changed
after the job is created — e.g. a temporary switch to a paid provider like
nous/claude-fable-5 — the job silently inherits it on its next tick and spends
real money. This is the reported $7.73 incident: a job created under a
free/default provider later inherited a temporary paid switch.
Fix (ask #1 only) preserves the legitimate "unpinned job should follow
model.default" use case by detecting *drift* rather than freezing the model:
- create_job (cron/jobs.py): for UNPINNED, agent-backed jobs (no explicit
provider, not no_agent), snapshot the provider that resolution WOULD pick
right now into a new optional `provider_snapshot` field, resolved via the
same resolve_runtime_provider() path the ticker uses. Fail-open to None on
any resolution error so job creation never breaks.
- run_job (cron/scheduler.py): right after runtime resolution, if the job has
a provider_snapshot AND is unpinned AND the currently-resolved provider
DIFFERS from the snapshot, fail closed for that run — make no paid call and
deliver a loud, actionable alert naming both providers and telling the user
to pin explicitly (`cronjob action=update job_id=.. provider=..`).
Back-compat: jobs with no snapshot (pre-existing jobs, no_agent jobs, or any
job whose creation-time resolution failed) behave exactly as before — the
guard only engages when a snapshot exists. Explicitly-pinned jobs (job.provider
set) are unaffected since they don't drift with global state.
Tests: tests/cron/test_cron_provider_pin.py covers snapshot-matches (runs),
snapshot-differs (fail closed, no agent constructed), no-snapshot back-compat,
None-snapshot back-compat, explicitly-pinned (runs regardless), plus create_job
snapshot capture/skip/fail-open. The fail-closed case is load-bearing (fails
without the guard).
Issue #44585 asks #2-4 (hard-stop a running job, gateway-stop containment,
fail-closed on provider mutation) are out of scope for this change.
A cron job that sets `enabled_toolsets` to a list of *native* toolsets (e.g.
`["web", "terminal"]`) silently got ZERO MCP tools, while a job with no
per-job list got every globally-enabled MCP server. `_resolve_cron_enabled_
toolsets` returned the per-job list verbatim, bypassing the MCP-merge that the
platform-fallback branch performs via `_get_platform_tools`. So
`discover_mcp_tools()` registered the MCP tools into the registry, but
`get_tool_definitions(enabled_toolsets=...)` kept only the named native
toolsets — the agent then rejected every `mcp_*` call as "Unknown tool". (R2
of #23997.)
Fix: `_merge_mcp_into_per_job_toolsets` layers MCP membership onto a per-job
allowlist with the SAME semantics as `_get_platform_tools`:
* `no_mcp` sentinel present -> no MCP servers (sentinel stripped)
* one or more MCP server names already listed -> treat as an allowlist
* otherwise -> union in every globally-enabled MCP server
To avoid duplicating the "which MCP servers are enabled" computation (it
already existed inline in `_get_platform_tools`), this extracts a shared
`enabled_mcp_server_names(config)` helper in `hermes_cli.tools_config` and has
BOTH the gateway/CLI platform resolver and the cron per-job resolver call it —
so every path agrees on MCP membership (extend, don't duplicate).
Note: the issue's *headline* — bare MCP server names rejected, registry never
includes them — was already fixed on main (commits c10fea8d2 + 04918345e,
both before the issue was filed). This PR closes the remaining cron-specific
gap (R2). The `server:*` / `mcp:server` alias-notation rejection (R1) and the
quiet-mode silent-drop (R3) are tracked separately.
Salvaged from #32788 by sherman-yang (credited below). Reworked to reuse the
shared `enabled_mcp_server_names` helper instead of re-implementing the MCP
membership set in cron/scheduler.py.
Fixes#23997
Co-authored-by: sherman-yang <58446328+sherman-yang@users.noreply.github.com>
`cron/jobs.py` resolved `HERMES_DIR`/`JOBS_FILE` from `get_hermes_home()`,
which follows the active profile override. So a job created from a
profile-scoped agent session (`hermes -p myprofile chat`, where the in-process
`cronjob` tool calls `create_job`) was written to
`~/.hermes/profiles/myprofile/cron/jobs.json`, while the profile-less gateway
(`hermes gateway run`) reads only `~/.hermes/cron/jobs.json`. The job was
silently orphaned: `cronjob action=list` from the same profile reported it
healthy (same file), but the gateway ticker never saw it and it never fired.
`last_run_at` stayed null forever. (#32091)
Fix: resolve the cron store from `get_default_hermes_root()` — the
purpose-built "profile-level operations" root that returns `<root>` even when
`HERMES_HOME` is `<root>/profiles/<name>` (and handles Docker/custom layouts).
Now the creator, the gateway scheduler, and the dashboard all agree on a
single jobs.json at the root, so a job created under any profile is visible to
the gateway.
Scope: this is the storage-location half of the fix. Making a job *execute*
under its originating profile's config/skills (a per-job `profile` field +
runtime context scoping, the #48649 sibling) is a separate, riskier change and
will follow as its own PR — keeping this layer minimal and safe.
Salvaged from #32117 by @mohamedorigami-jpg (authorship preserved). The
comprehensive #33839 (@sweetcornna) takes the same Option-A storage approach
and additionally adds the per-job profile execution scoping; this PR lands the
safe storage layer first.
Tests: `tests/cron/test_cron_profile_storage.py` — asserts the store anchors
at `<root>/cron` under a profile HERMES_HOME (not `<profile>/cron`), and is
unchanged when no profile is active. Full `tests/cron/` suite: 511 passed.
Fixes#32091
Co-authored-by: mohamedorigami-jpg <mohamed.origami@gmail.com>
When a recurring job's execution time exceeds `interval + grace`, the
scheduler entered a perpetual "missed → fast-forward → skip" loop and the
job effectively never ran again. A real job (`hermes-upstream-contribution`)
logged 42 consecutive "missed" events over 9 hours without executing once.
Timeline (5-min interval, 150s grace, ~15-min execution):
14:00 due → advance next_run_at→14:05 → run (blocks 15 min)
14:15 finishes
14:16 tick: next_run_at=14:05, elapsed 660s > grace 150s → "missed!"
→ fast-forward to 14:21 → continue (SKIP) → does NOT run
... repeats forever for any job whose runtime > interval+grace.
The `continue` (skip execution) in `_get_due_jobs_locked` was designed to
prevent burst-catchup after *gateway downtime* — don't run 6 missed
instances of a 30-min job on restart. But it wrongly applied to a job that
missed its slot because it was *still running*, not because the gateway was
down.
Fix: keep the fast-forward (so accumulated missed slots are still collapsed
to a single next slot — no burst) but fall through to `due.append(job)` so
the job runs ONCE now. The log message is updated to be honest about the new
behavior ("Running now; next run fast-forwarded to: ...").
Behavior note: a recurring job missed during gateway downtime now also fires
once immediately on restart (rather than waiting for its next natural slot).
This is the intended trade-off — the same "run once, don't burst" rule now
applies uniformly to both downtime-misses and long-execution-misses.
Salvaged from #33318 by @liuhao1024 (authorship preserved). Also addresses
the diagnosis in #33361 (@agent-trivi), which proposed the same one-line fix.
Tests: updates `test_stale_past_due_skipped` →
`test_stale_past_due_runs_once_and_fast_forwards` (the old test encoded the
skip behavior); adds `test_long_execution_does_not_perpetually_defer` as a
direct regression for the production loop; updates the F2e timezone test that
relied on the old skip path. Full tests/cron/ suite: 510 passed.
Fixes#33315
Co-authored-by: liuhao1024 <sunsky.lau@gmail.com>
PR #22410 added three-mode Telegram topic routing to the live message path
(TelegramAdapter.send via the gateway DeliveryRouter), but the cron delivery
path never got it. cron/scheduler.py::_deliver_result sent through the live
adapter with a bare ``{"thread_id": ...}`` and fell back to the standalone
_send_telegram, neither of which addresses Bot API Direct Messages topics
correctly. After Bot API 10.0 (2026-05-08), sending to a private chat with a
bare ``message_thread_id`` is rejected/mis-routed, so cron deliveries to a
private DM topic landed in the General topic instead of the requested lane.
Fix: the cron live-adapter branch now routes the text send through the
gateway's ``DeliveryRouter._deliver_to_platform`` — the same canonical path
live messages use — so it inherits all three Telegram routing modes:
1. Forum/supergroup (negative chat_id) -> message_thread_id
2. Bot API DM topics (private chat_id + numeric topic id) ->
direct_messages_topic_id (the case #22773 reported)
3. Hermes-created named private DM-topic lanes -> ensure_dm_topic +
reply anchor
For mode 2, a private-chat target with a numeric topic id is passed as
``direct_messages_topic_id`` metadata (verified end-to-end:
TelegramAdapter._thread_kwargs_for_send turns it into
``{message_thread_id: None, direct_messages_topic_id: <int>}``), instead of a
bare message_thread_id. Forum/supergroup and home-channel deliveries are
unchanged. The standalone fallback (gateway down) is preserved.
No new config knob and no duplicated routing logic — this reuses the existing
DeliveryRouter rather than reimplementing topic routing in the cron path.
Salvaged from #42051 (stepanov1975) and #23249 (devsart95), which both
diagnosed the missing three-mode routing in the cron/standalone path;
reimplemented onto the canonical DeliveryRouter that landed since those PRs
were opened.
Co-authored-by: Alex <9785479+stepanov1975@users.noreply.github.com>
Co-authored-by: devsart95 <devsart95@gmail.com>
A recurring cron job persists `next_run_at` as an absolute timestamp with a
UTC offset (e.g. `2026-05-19T21:00:00+10:00`). Cron expressions, however,
describe *local wall-clock* intent ("run at 21:00"). When Hermes/system
timezone changes after the timestamp was persisted, the stored instant is
re-interpreted in the new zone: `21:00+10:00` is the instant `13:00+02:00`,
which is `<= now` (13:02+02:00) — so the job fires HOURS EARLY, then
`compute_next_run` advances it via croniter to `21:00+02:00` the same day,
producing a SECOND fire. (#28934, recurrence of #24289.)
`_get_due_jobs_locked` now detects this precise migration case before the
due check: for a `cron` job whose converted instant looks due, whose stored
UTC offset differs from the current zone's, AND whose stored *wall-clock*
time is still in the future (distinguishing a migrated offset from a
genuinely missed run), it recomputes `next_run_at` from the schedule and
skips the early fire — preserving the local wall-clock intent.
Verified against the issue's reproducer: stored `21:00+10` under runtime
`+02:00` at wall-clock `13:02` is rescheduled to `21:00+02` instead of
firing early + again.
Salvaged from #28941 by @Tranquil-Flow (authorship preserved). Chosen over
the alternative approaches (#28951 normalize-to-UTC, #28985 rebase-and-match)
because UTC-normalization does not change the absolute-instant comparison and
so does not fix the early fire, and this guard is the tightest: it only acts
when all four conditions hold and reuses the existing `compute_next_run`.
Fixes#28934
The in-process cron ticker (cron/scheduler_provider.py) caught only
`Exception` and logged at DEBUG, so a `SystemExit`/`KeyboardInterrupt`
raised from a misbehaving provider SDK or agent retry path killed the
ticker thread silently. The gateway PROCESS stayed up, so `hermes cron
status` — which only checks `find_gateway_pids()` — kept reporting
"✓ jobs will fire automatically" while no jobs ever fired (#32612,
#32895).
This makes ticker death survivable and detectable:
- The ticker loop now catches `BaseException` and logs at ERROR with a
traceback, so a single bad tick no longer tears the thread down and
the failure is visible in the gateway log.
- The loop records a heartbeat (`cron/ticker_heartbeat`, epoch seconds)
on startup and after every tick — best-effort, never raised into the
loop. Both ticker entry points (the gateway and the desktop fallback
in web_server.py) funnel through `InProcessCronScheduler.start`, so one
heartbeat site covers both.
- `hermes cron status` now reads the heartbeat age: if the gateway is
running but the heartbeat is stale (> 200s, i.e. several missed ~60s
ticks), it reports the ticker as STALLED and suggests a restart instead
of falsely claiming jobs will fire. A missing heartbeat (older build /
never ran) is treated as "unknown", not "dead".
Adds tests for BaseException survival, per-iteration heartbeat recording,
heartbeat round-trip/age, staleness detection, and silent-write-failure.
Salvaged from #49660 (BaseException survival on current structure),
extended with the heartbeat + honest-status reporting that the earlier
(pre-refactor) watchdog PRs #35616 and #33849 proposed.
Fixes#32612Fixes#32895
Co-authored-by: banditburai <promptsiren@gmail.com>
Co-authored-by: sweetcornna <96944678+sweetcornna@users.noreply.github.com>
Consolidates three cron-delivery defects in cron/scheduler.py::_deliver_result
that all stem from how the live-adapter send result is interpreted.
#38922 — duplicate message on confirmation timeout.
future.result(timeout=60) raising TimeoutError bubbled to the outer
except handler, which left delivered=False, so `if not delivered:` re-sent
the identical message via the standalone path. future.cancel() cannot
un-send a request already in flight on the wire, so a slow confirmation
deterministically produced a duplicate. The send was already dispatched onto
the gateway loop, so a bare timeout is now treated as delivered
(assume-delivered is safer than guaranteed-duplicate) and the standalone
fallback is skipped. The live-adapter media attempt is also skipped on
timeout since the contended loop would re-block each 30s media budget.
#47056 — silent drop when the gateway has an active session.
The old check `if send_result is None or not getattr(send_result,
"success", True)` let a result object missing a `success` attribute default
to True = counted as a successful delivery, so the scheduler logged
"delivered via live adapter" while the gateway never processed the message.
Delivery is now confirmed via _confirm_adapter_delivery(): only an explicit,
truthy `success` attribute counts; None or a `success`-less object falls
through to the standalone path so the message actually arrives.
A genuine send Exception (not a slow confirmation) still falls through to
the standalone path, and is caught by run_job's outer handler — it is
recorded as the job's last_error and never crashes the cron ticker.
#43014 — deliver=origin fails to resolve in CLI sessions.
A CLI-created job has no {platform, chat_id} origin, so deliver=origin (and
auto-detect / deliver=None) was unresolvable and emitted "no delivery target
resolved" on every run. An unresolvable origin with no configured home
channel is now treated as local (output stays in last_output), matching the
documented auto-deliver contract; a concrete unresolvable platform target
still reports a real error.
Salvaged from #41007 (timeout discriminator), folding in #47127's
_confirm_adapter_delivery hardening and #38937 / #43063's origin→local
fallback. Tests rewritten as behavior contracts (timeout => no duplicate;
None / success-less result => standalone fallback; confirmed success => no
fallback; CLI origin => local, explicit platform => still errors).
Co-authored-by: Evi Nova <66773372+Tranquil-Flow@users.noreply.github.com>
Co-authored-by: kyssta-exe <kyssta-exe@users.noreply.github.com>
Cron jobs created without an explicit `model` are stored as `model: null`.
At fire time `run_job` resolved `model = job.get("model") or os.getenv(
"HERMES_MODEL") or ""` and then `_model_cfg.get("default", model)`, so when
config.yaml had no `model.default` (or `model: {default: null}`) an empty
string flowed straight to the provider and surfaced as an opaque HTTP 400
("Model parameter is required" / "model: String should have at least 1
character"). The operator had to inspect jobs.json to discover the job was
stored with a null model.
This change makes cron model resolution robust and symmetric with the CLI:
- Coerce `model: null`/missing config to `{}` so a falsy default never
overwrites an already-resolved env value with `None`.
- Only overwrite `model` from `model.default` when the resolved value is
truthy; accept a `model.model` alias key, mirroring the sibling resolvers
in hermes_cli/oneshot.py, fallback_cmd.py and prompt_size.py.
- Resolve AFTER the managed-scope overlay so an administrator-pinned model
still wins.
- Fail fast with an actionable error (caught by run_job's outer handler and
recorded as the job's last_error — the cron ticker is unaffected) instead
of letting an empty model reach the API.
- The per-job model is re-read every tick, so a `cronjob action=update
model=...` after a failed run takes effect on the next tick (no cache).
Adds tests/cron/conftest.py pinning a default HERMES_MODEL so existing
run_job tests don't trip the new guard, plus regression tests covering env
fallback, config.default fallback, string-form config, the model alias key,
null-default-no-clobber, corrupt-config graceful degradation, fail-fast,
and the no-cache re-read property.
Salvaged from #24005, rebased onto current main, with additional test
coverage folded in from #45550 and the alias-key behavior from #43952.
Fixes#43899Fixes#23979Fixes#22761
Co-authored-by: szzhoujiarui-sketch <szzhoujiarui@gmail.com>
Co-authored-by: rayjun <rayjun0412@gmail.com>