Commit graph

14068 commits

Author SHA1 Message Date
kshitijk4poor
9cf47fef54 fix(auxiliary_client): demote the 2 sibling routing fall-throughs too (review)
Phase 2c review flagged that only 2 of the 4 structurally-identical
resolve_provider_client routing dead-ends were demoted. Complete the bug-class:
also demote+dedup the external-process ('not directly supported') and OAuth
('not directly supported, try auto') fall-throughs, keyed by provider name, so
none of the four dead-ends spam WARNING on a retry loop.

Add direct tests for the unhandled-auth_type and OAuth dedup paths via a
monkeypatched PROVIDER_REGISTRY (the review noted these were unverified).
Mutation-checked: reverting either sibling demotion fails its test.
2026-07-01 17:00:30 +05:30
kshitijk4poor
c0d3ceb17e fix(auxiliary_client): dedup resolve_provider_client fall-through warnings
The two fall-through branches in resolve_provider_client (unknown provider,
unhandled auth_type) logged at WARNING on every retry of a misconfigured
provider, spamming logs during retry loops. Demote both to logger.debug with
per-process dedup: the first occurrence still surfaces (a provider-name typo or
PROVIDER_REGISTRY/auth_type-drift bug is worth seeing once), while identical
repeats are suppressed for the process lifetime.

Salvaged from #56283 (extracting only the stated auxiliary_client fix; the
original PR also bundled ~2800 lines of unrelated changes across 10 other
files, which are dropped).
2026-07-01 17:00:30 +05:30
kshitijk4poor
fb7a38ad21 fix(macos): compose launchd reload retry with _launchctl_bootstrap + drain-aware window
Reworks @valenteff's #53277 fix per review (Teknium's 3 findings):
- Route refresh_launchd_plist_if_needed's bootstrap through the existing
  _launchctl_bootstrap() EIO-recovery helper (canonical since #56256),
  wrapped in a wall-clock retry loop, instead of an ad-hoc 5x2s loop.
- Window sized to agent.restart_drain_timeout (default 180s), not a fixed
  ~10s: the failure happens while the old gateway is still draining (finding 1).
- Retry on subprocess.TimeoutExpired too, not just CalledProcessError — a
  bootstrap timeout after bootout otherwise escapes and leaves the service
  unloaded (finding 2).
- Confirm success with launchctl list, not a bare bootstrap exit 0 (finding 3);
  mirror verify+drain-window in the detached-helper bash path.
- Shared helpers _launchd_reload_log_path / _append_launchd_reload_log /
  _launchctl_label_registered / _retry_launchctl_bootstrap_until_registered.

3 new tests cover retry-until-listed, TimeoutExpired-retried, deadline-exhaust.
E2E: real reload log + mocked launchctl — retries CalledProcessError+TimeoutExpired,
verifies via launchctl list, logs failures.
2026-07-01 16:56:14 +05:30
Fabio Fernandes Valente
7a7d19e73b fix(macos): retry launchd reload on transient bootstrap failure
refresh_launchd_plist_if_needed ran `launchctl bootout` then
`launchctl bootstrap` with errors silenced (`2>/dev/null` in the
detached helper, `check=False` in the direct subprocess path).
Under high load or a launchd race, the bootout succeeds — removing
the service from launchd — but the follow-up bootstrap fails
silently. The service stays unregistered; KeepAlive can't revive
a service launchd no longer knows about, so the gateway stays dark
until a manual `launchctl bootstrap`.

Observed incident (2026-06-26): `/restart` in chat triggered a
planned drain; during the drain a separate call re-triggered the
plist refresh, which bootout'd the live service. Under loadavg
9.48 the bootstrap failed silently — 2h35min offline until manual
recovery.

Fix: retry the bootstrap up to 5 times with 2s back-off, verify
with `launchctl list <label>` afterwards, and log failures to
~/.hermes/logs/launchd-reload.log so the health watchdog can
detect a persistent orphan. Mirrors the contract across both
the detached helper (refresh inside gateway tree) and the direct
subprocess path (refresh from external CLI).

Existing tests pass:
- test_refresh_defers_reload_when_running_inside_gateway_tree
- test_refresh_uses_direct_reload_when_not_inside_gateway_tree

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-07-01 16:56:14 +05:30
kshitij
d4e8c358c0
Merge pull request #56330 from kshitijk4poor/chore/authormap-valenteff
chore: add AUTHOR_MAP entry for valenteff (#53277 salvage)
2026-07-01 16:49:16 +05:30
shawchanshek
3b739b990b fix(title_generator): strip think blocks from LLM output before extracting title
Think-enabled models (MiniMax M2.7, DeepSeek, etc.) emit inline
<think>...</think> reasoning even for simple prompts like title
generation, and the raw XML was leaking into session titles. Route the
title-model response through the canonical strip_think_blocks scrubber
before cleanup so every tag variant — closed pairs, unterminated blocks,
orphan closes, mixed case — is handled, not just a single literal
<think> pair.

- 2 regression tests: closed <think> pair stripped, unterminated block
  at start yields no title.

Salvaged from PR #44126 by @shawchanshek.
2026-07-01 04:18:48 -07:00
kshitij
037e389c4f
Merge pull request #56325 from kshitijk4poor/chore/authormap-session-persist
chore: AUTHOR_MAP entries for session-persistence salvage batch
2026-07-01 16:47:13 +05:30
kshitijk4poor
314cf43d50 test(matrix): assert real device_id in query_keys, not just guard-skip
Hardens the salvaged #53997 tests per review: the positive-resolution and
reconnect-recovery tests now assert query_keys is awaited with the REAL
resolved device id ({mxid: [<id>]}) and never [None] — the [null] body the
homeserver rejects (the actual bug), plus await_count==2 to prove
verification genuinely re-runs after resolution rather than just the flag
looking right.
2026-07-01 16:46:40 +05:30
Gary Walker
09dbe76955 fix(matrix): reset _device_id_unverified at start of connect()
Per review feedback on #53997 from @teknium1: the flag was set True
on failed device_id resolution but never reset, so a same-adapter
reconnect that successfully resolves a real device_id would keep
skipping server-side key verification indefinitely.

Reset now happens at the top of connect(), before resolution runs,
so every connect() attempt starts clean. A repeat failure re-sets
the flag (unchanged behavior); a recovery correctly clears it.

Adds TestDeviceIdRecoveryOnReconnect to cover the transition.
2026-07-01 16:46:40 +05:30
Gary Walker
9048457eab fix(matrix): device_id fallback prevents E2EE init failure on fresh bot accounts
- Resolve device_id via query_keys({mxid: []}) when whoami() returns None
- Guard _verify_device_keys_on_server and _reverify_keys_after_upload
  against None/unverified device_id to prevent 'device_keys values must
  be a list of strings' serialization failure
- Disconnect existing client before reconnect to prevent dual OlmMachine
  instances on the same crypto store

Re-targeted from #39779 (legacy gateway/platforms/matrix.py) onto the
migrated plugins/platforms/matrix/adapter.py path following the
2026-06-20 adapter migration. Logic unchanged from original fix.

242 tests passing (233 upstream + 9 new).
2026-07-01 16:46:40 +05:30
SahilRakhaiya05
2d8d08cae6 fix(api-server): require auth for /health/detailed and fail closed on weak keys
/health/detailed leaked runtime state (gateway state, connected
platforms, active-agent counts, PID, exit reason) with no auth. Gate it
behind the same Bearer auth as other API routes; plain /health stays
open for liveness probes.

Also refuse to start on a placeholder/too-short (<16 char) API_SERVER_KEY
regardless of bind address — a guessable key on a terminal-capable
endpoint is RCE-adjacent even on loopback, since any local process can
reach it. The required-key check was already unconditional; this extends
the strength floor to loopback binds too. Startup guards are hoisted
above app/background-task creation so a rejected start leaves no partial
state.

Salvaged from #44073 (external-surface hardening), split into a focused
PR per maintainer request.

Co-authored-by: Hermes Agent <agent@nousresearch.com>
2026-07-01 04:14:33 -07:00
kshitijk4poor
9c870548e3 chore: add AUTHOR_MAP entry for valenteff (#53277 salvage) 2026-07-01 16:44:07 +05:30
shandian64
5126902f1d fix(title): honor configured auxiliary timeout 2026-07-01 16:41:43 +05:30
kshitijk4poor
b3f55c2037 chore: add AUTHOR_MAP entries for session-persistence salvage batch
Maps the two plain-email contributors whose PRs are being salvaged so
contributor_audit.py passes:
- info@djimit.nl -> djimit (PR #48034)
- lubos@komfi.health -> lubosxyz (PR #49225)

The other two PRs in the batch (#50405 sasquatch9818, #48764 srojk34)
use users.noreply.github.com emails, which check-attribution auto-skips.
2026-07-01 16:38:56 +05:30
SahilRakhaiya05
5178b3f056 fix(code-exec): bind execute_code tool socket to a per-session RPC token
The execute_code sandbox exposed its tool-call RPC (AF_UNIX socket and
remote file-poll transports) without any caller check, so any local
process that could reach the socket / rpc dir could dispatch
terminal-capable tool calls through the parent. Mint a per-session
HERMES_RPC_TOKEN, pass it to the sandboxed child, and require a
timing-safe match on every request in both _rpc_server_loop and
_rpc_poll_loop. Empty/missing/wrong token fails closed.

Salvaged from #44073 (per-session RPC token). Added timing-safe
secrets.compare_digest comparison and fail-closed regression tests.

Co-authored-by: Hermes Agent <agent@nousresearch.com>
2026-07-01 04:08:37 -07:00
Teknium
5de65624d1
fix(moa): capture streamed aggregator output into full-turn traces (#56312)
MoA full-turn traces (moa.save_traces) recorded the aggregator's acting
output only on the non-streaming path, where it's captured inline at
call time. On the streaming path — which every hermes chat --query run
and every live gateway/CLI turn takes — the aggregator's raw token
stream is handed to the live consumer, so the trace left output=null and
only pointed at the session-db assistant row. An offline audit of a
benchmark run (HermesBench drives --query) then couldn't see what the
aggregator produced without hand-joining to state.db.

Capture the resolved streamed acting text at trace-flush time (the agent
already holds it in _current_streamed_assistant_text) and fold it into
the trace, so the record is self-contained in both modes. New
output_location value inline_from_stream marks a streamed turn whose text
was captured this way; a genuinely empty acting turn (pure tool call)
still points at the session db, matching state.db exactly.

Touches only the trace side-channel — no change to the acting path,
message history, role alternation, or prompt cache.

- agent/moa_loop.py: consume_and_save_trace(..., aggregator_output_fallback)
  on both the facade and the MoAClient wrapper; prefer inline capture,
  fall back to the resolved streamed text.
- agent/moa_trace.py: embed the fallback; add inline_from_stream location.
- agent/conversation_loop.py: pass _current_streamed_assistant_text at flush.
- tests: 5 cases across streaming / non-streaming / empty-fallback / no-double-write.
2026-07-01 04:07:46 -07:00
Teknium
81595cd588 fix(dashboard): run plugin gate after auth + enable example fixture
Follow-up on the salvaged #47491 commits:

- Register _plugin_api_runtime_gate BEFORE the auth middlewares so it
  executes AFTER them, and add an explicit auth check: unauthenticated
  requests to /api/plugins/<name>/ fall through to auth's 401 instead of
  this gate's 404. Prevents the gate from becoming a plugin-name oracle
  (an unauthenticated caller could otherwise fingerprint installed/enabled
  plugins by status code). Keeps test_non_kanban_plugin_route_requires_auth
  green.
- Enable the 'example' user plugin in the _install_example_plugin test
  fixture so the auth / static-asset-allowlist tests still reach the real
  serving paths now that user plugins are gated on plugins.enabled.
- Mark the runtime-gate unit-test scopes as authenticated so they exercise
  the enabled/disabled policy under the new auth-first ordering.
2026-07-01 04:05:15 -07:00
manusjs
b2e0086f1b fix(dashboard): enforce plugin disabled gate at request time and for bundled assets
Address two residual bypasses identified in review:

1. Add _plugin_api_runtime_gate middleware that checks plugins.enabled/
   plugins.disabled on every request to /api/plugins/{name}/... routes.
   Previously, disabling a plugin at runtime had no effect on its already-
   mounted API routes until a restart.

2. Extend serve_plugin_asset to check plugins.disabled for bundled plugins.
   Previously, only user plugins were gated — a bundled plugin in
   plugins.disabled would still serve assets from the unauthenticated
   /dashboard-plugins/{name}/... endpoint.

Both fixes ensure the enabled/disabled policy is evaluated live at request
time, not just at startup.

Adds regression tests covering:
- Middleware blocks disabled user plugin API routes (404)
- Middleware blocks user plugin removed from enabled set (404)
- Middleware passes enabled user plugin API routes
- Middleware blocks disabled bundled plugin API routes (404)
- Bundled plugin assets return 404 when disabled
- Bundled plugin assets served normally when not disabled
- User plugin asset gating still works correctly
2026-07-01 04:05:15 -07:00
manusjs
7cff95644d fix(dashboard): gate plugin asset serving and API mount on plugins.enabled
User-installed dashboard plugins had their assets served and Python
backend code imported without checking the plugins.enabled allowlist.
This meant a plugin installed in the plugins directory but not enabled
could still execute code at dashboard startup and serve arbitrary files.

Changes:
- get_dashboard_plugins API: filter out user plugins not in enabled set
- serve_plugin_asset: reject requests for disabled/non-enabled user plugins
- _mount_plugin_api_routes: skip Python import for non-enabled user plugins
- Bundled plugins still load by default but respect explicit disables

Fixes #46435
2026-07-01 04:05:15 -07:00
kshitij
8415c4703a
Merge pull request #56317 from kshitijk4poor/chore/authormap-bitcryptic
chore: add AUTHOR_MAP entry for bitcryptic-gw (#53997 salvage)
2026-07-01 16:33:47 +05:30
Tao Chen
d3c8667462 fix(slack): authorize bot/workflow senders before the no-user-id guard
Slack Workflow Builder posts (and other app/bot messages) arrive as
subtype=bot_message with user=None. _is_user_authorized rejected them at
the `if not user_id: return False` guard, which runs *before* the #4466
{PLATFORM}_ALLOW_BOTS bypass — so @mentioning the bot from a Slack
workflow silently did nothing, even with SLACK_ALLOW_BOTS (or
SLACK_ALLOW_ALL_USERS) set. The chat-scoped allowlist for Telegram/QQ
already runs before that guard for the same reason (channel broadcasts
with no from_user); Slack was both missing from the bot-bypass map and
had the bypass running too late.

- gateway/authz_mixin: move the {PLATFORM}_ALLOW_BOTS bypass ahead of the
  no-user-id guard and add Platform.SLACK -> SLACK_ALLOW_BOTS.
- plugins/platforms/slack/adapter: set is_bot=True on inbound
  bot_message events so the gateway can identify workflow/app senders
  (they carry no user_id to match against the allowlist).

Tested: new tests/gateway/test_slack_bot_auth_bypass.py plus the existing
Discord/Feishu bot-auth and gateway authz/gating suites all pass.
2026-07-01 16:32:32 +05:30
kshitijk4poor
fcbf850f33 chore: add AUTHOR_MAP entry for bitcryptic-gw (#53997 salvage) 2026-07-01 16:28:15 +05:30
teknium1
27347b2239 fix(gateway): align resume safety-net note with canonical recovery wording
Follow-up on the salvaged resume_pending fix: the empty-turn safety net
now emits the same reason-aware recovery note as the _is_resume_pending
branch (reason phrase + 'session restored' guidance + no-re-execute
instruction) instead of a second, differently-worded note. Also adds the
AUTHOR_MAP entry for the salvaged commit.
2026-07-01 03:57:44 -07:00
Adam Chiaravalle
c2db3ed7d8 fix(gateway): recover resume_pending sessions instead of sending a blank turn
A session interrupted by a gateway restart is flagged resume_pending and
auto-continued on startup via _schedule_resume_pending_sessions(), which
dispatches an empty-text internal MessageEvent. The recovery system note
that should fill that empty turn is gated, in _run_agent(), on
_interruption_is_fresh — the age of the LAST PERSISTED TRANSCRIPT ROW.

For an active thread returned to after >1h of silence, that transcript
clock is stale even though the interruption (last_resume_marked_at) is
seconds old. The gate evaluates False, the note is not prepended, and the
model receives a genuinely blank user turn — replying with confused
'that message came through blank' noise.

Fix (two parts, both default-on, behavior unchanged for healthy turns):

1. resume_pending freshness now also considers last_resume_marked_at (the
   restart watchdog's own stamp). The branch fires when EITHER the
   transcript clock OR the resume mark is fresh, so the startup scheduler's
   freshness decision and the per-turn injection agree.

2. Empty-turn safety net: if the user turn is still blank after all
   injections AND the session is resume_pending, backfill a recovery note
   so a blank turn can never reach the model. Scoped to resume_pending so
   ordinary empty turns (e.g. uncaptioned image) are untouched.

Adds 3 regression tests; the two core ones fail on the pre-fix logic.
2026-07-01 03:57:44 -07:00
teknium1
d1d1d81900 fix(gateway): repair sibling tests + harden _adapter_for_source after fail-closed flip
Follow-up to the salvaged fail-closed defaults. The own-policy default flip
(open -> pairing) and the email dispatch-level deny broke sibling tests
across the suite that relied on the old fail-open behavior:

- test_email.py: dispatch-mechanics tests now opt into EMAIL_ALLOW_ALL_USERS
  (they test formatting/attachments/threading, not authz); the two auth
  contract tests are rewritten to assert the new fail-closed behavior
  (no allowlist + no allow-all => sender dropped at the adapter).
- test_whatsapp_cloud.py / test_whatsapp_formatting.py / test_whatsapp_from_owner.py:
  autouse fixture opts into WHATSAPP_ALLOW_ALL_USERS so dm_policy: open
  dispatch-mechanics tests still flow (open now requires an explicit
  allow-all opt-in, SECURITY.md 2.6).
- _adapter_for_source: use getattr for source.platform/profile so bare
  SimpleNamespace test fixtures without .profile don't crash the busy/queue
  ingress path (AGENTS.md pitfall #17).

Full tests/gateway/ + yuanbao pipeline: 8555 passed, 0 failed.
2026-07-01 03:56:28 -07:00
teknium1
49a87bcd1e chore(release): map SahilRakhaiya05 contributor email for #44073 salvage 2026-07-01 03:56:28 -07:00
SahilRakhaiya05
bb304b4914 fix(gateway): fail-closed external-surface defaults + profile-aware multiplex authz
Aligns runtime behaviour with SECURITY.md 2.6: externally reachable
messaging adapters must fail closed unless access is explicitly
configured. Closes the confirmed multiplex authorization bypass a
secondary profile's open dm/group policy no longer inherits the default
profile's allowlist trust.

- Own-policy adapters (WhatsApp, WeCom, Weixin, QQBot, Yuanbao) default
  dm_policy/group_policy to pairing/allowlist instead of open; open now
  requires an explicit GATEWAY_ALLOW_ALL_USERS or per-platform allow-all.
- Startup guard (_own_policy_open_startup_violation) refuses to boot when
  an enabled adapter is open without the allow-all opt-in; the guard now
  runs for every secondary profile in multiplex mode too.
- Profile-aware own-policy authorization: _authorization_adapter /
  _adapter_for_source resolve the live adapter via SessionSource.profile,
  so _is_user_authorized and the ingress/pairing/busy/queue paths read the
  originating profile's adapter policy, not the default profile's.
- Fail-closed intake for Email, Feishu P2P, and Discord (blank-principal
  denial, empty-allowlist deny, missing-interaction.user deny).

Salvaged from #44073 (external-surface hardening), split into a focused
gateway-authz PR per maintainer request. Follow-up fix by Hermes Agent:
the Discord slash-auth channel bypass now matches DISCORD_ALLOWED_CHANNELS
by the same name-inclusive keys (id + name + #name + parent) the on_message
scope gate uses, so a name-form channel allowlist authorizes slash
interactions consistently (was id-only, breaking #name matching).

Co-authored-by: Hermes Agent <agent@nousresearch.com>
2026-07-01 03:56:28 -07:00
srojk34
8e94e8f882 fix(discord): tag unverified channel-context senders like Slack threads
Discord's _fetch_channel_context backfills recent channel/thread activity
(from any member who can post there, not just the allowlisted user) into
the agent's context with no sender-trust distinction. Slack's equivalent
_fetch_thread_context was fixed to prefix non-allowlisted senders with
[unverified] and add LLM guidance not to act on their content, mitigating
indirect prompt injection from third parties in shared channels/threads.
Port the same mechanism to Discord using the already-wired
_is_sender_authorized/set_authorization_check plumbing.
2026-07-01 16:25:16 +05:30
kshitijk4poor
23518a5e02 test(review): add integration guards for the two isolation wirings (review)
Phase 2c mutation-check found the salvaged tests covered only the pure helpers
(_is_background_review_harness_message / _strip_background_review_harness) — the
two integration WIRINGS had zero coverage: removing the _persist_disabled guard
in _flush_messages_to_session_db, or the _strip call in
get_messages_as_conversation, left all 13 tests green.

Add:
- TestPersistDisabledHardStop: a _persist_disabled agent's flush writes nothing
  to a live SessionDB (guards the run_agent hard-stop).
- TestGetMessagesAsConversationStripsHarness: a session with stray harness rows
  resumes clean end-to-end through get_messages_as_conversation (guards the
  hermes_state load-time wiring).
Mutation-checked: each new test fails when its wiring is reverted.
2026-07-01 16:21:39 +05:30
arminanton
e2fa509bf3 fix(review): isolate the background-review fork from the canonical session
The forked skill/memory review agent shares the parent's session_id for
prompt-cache warmth. Without isolation it wrote its harness turn ('Review the
conversation above and update the skill library…') plus its curator-mode reply
straight into the user's REAL session in state.db; the next live turn re-read
that injected user message as a standing instruction and the agent 'became' the
curator, refusing the actual task.

Root fix: a _persist_disabled flag on the fork that hard-stops every DB write
and lazy-open path (_flush_messages_to_session_db, _ensure_db_session,
_get_session_db_for_recall) — the review writes only to the skill/memory stores
via its tools. Defense-in-depth: _strip_background_review_harness drops any
stray harness message (and the assistant reply that followed) at load time in
get_messages_as_conversation, so an already-polluted session resumes clean.

Salvaged from #50296.

Co-authored-by: arminanton <29869547+arminanton@users.noreply.github.com>
2026-07-01 16:21:39 +05:30
Swissly
242c9639a8 fix(cron): prevent multi-target delivery loop crash on per-target failure
The standalone thread-pool fallback in _deliver_result() runs inside the
`except RuntimeError:` block (taken when asyncio.run() sees a running loop).
When future.result() raised there (SMTP ConnectionError, timeout, etc.), the
exception was NOT caught by the sibling `except Exception:` — it escaped
_deliver_result() and crashed the whole delivery loop, silently skipping every
remaining target. Multi-target delivery (e.g. deliver: 'email:a,email:b') is a
documented feature, so this broke a promised contract.

Wrap the fallback in its own try/except so a per-target failure is logged with
exc_info and the loop continues to the next target.

Fixes #47163
2026-07-01 03:48:37 -07:00
kshitijk4poor
d3010b74db test(agent): strengthen id-reuse regression + refresh flush docstring (review)
Phase 2c review follow-up on the id()-reuse persistence fix:

- test_recycled_id_in_dedup_set_still_persists_new_message seeded an EMPTY
  dedup set, so it never injected a collision and passed under id-based dedup
  too (couldn't distinguish the designs). Replace with
  test_stale_seed_id_from_prior_flush_cannot_suppress_new_message, which asserts
  the durable invariant: the seed is empty after every flush (mutation-checked:
  removing the post-flush reset now fails BOTH id-reuse tests).
- Refresh the _flush_messages_to_session_db docstring: it still described the
  old per-session identity tracking; document the intrinsic-marker mechanism,
  that _flushed_db_message_ids is now a one-shot seed, and the shared-dict
  mutation safety note.
2026-07-01 16:17:46 +05:30
rrevenanttt
e4c6d1b22b fix(agent): persist messages by intrinsic marker to stop id() reuse data loss
_flush_messages_to_session_db deduped persisted messages with a retained
{id(msg)} set (_flushed_db_message_ids) kept across turns. Once a flushed dict
is dropped from the live list (scaffolding rewind / in-place compaction) and
GC'd, CPython recycles its address onto a new assistant/tool dict whose id()
collides with the stale entry — so the real turn is silently never written to
state.db.

Replace the retained id-set with an intrinsic _DB_PERSISTED_MARKER stamped on
each dict. The id-set is demoted to a one-shot seed (valid only while the
caller's objects are alive) that is translated to markers and cleared after
every flush, so no id() outlives a flush to alias a future message. The marker
is _-prefixed so the wire sanitizers strip it before any request leaves.

Preserves the existing _is_ephemeral_scaffolding skip. Salvaged from #50372.

Co-authored-by: rrevenanttt <290873280+rrevenanttt@users.noreply.github.com>
2026-07-01 16:17:46 +05:30
kshitij
1d6645b17f
Merge pull request #56296 from kshitijk4poor/fix/gateway-force-exit-pidlock-release
fix(gateway): release PID file + runtime lock in the force-exit backstop
2026-07-01 16:14:26 +05:30
kshitijk4poor
b7adad1a72 test(error-classifier): parametrize 5xx overflow test over 500/502/503/529
Review nit (helix4u): the fix covers 500/502/503/529 but the positive tests
only asserted 500 and 503. Parametrize over all four so 502/529 are covered
too; keep the plain-5xx negatives.
2026-07-01 16:14:16 +05:30
pefontana
a04b7024ff fix(error-classifier): route 5xx context-overflow into compression
Local inference servers (llama.cpp/llama-server, vLLM/Ollama behind a
Cloudflare/Tailscale hop) report context overflow with HTTP 500/502/503/529
instead of 400/413. _classify_by_status returned server_error/overloaded and
retried blindly, then dropped the turn with no compaction. Route explicit
_CONTEXT_OVERFLOW_PATTERNS matches on those 5xx codes to context_overflow
(should_compress=True); plain 500 stays server_error, plain 503 overloaded.
2026-07-01 16:14:16 +05:30
Teknium
74809b4e94
fix(cli): reap dead-locked worktrees so .worktrees/ can't grow unbounded (#56288)
hermes -w locks each worktree (reason 'hermes pid=<pid>'). git worktree
remove --force (single -f) refuses a locked tree, so a crashed session's
lock was never released and its worktree accumulated forever — a real
contributor to .worktrees/ bloat.

_prune_stale_worktrees now classifies each lock via _worktree_lock_is_live:
a live-owner pid is skipped at any age; a dead-owner (or foreign) lock is
unlocked first so the aggressive age-based cleanup can actually reap it.
The >72h reap tier is kept (that cleanup is intentional) but now guarded so
dirty/unpushed work is preserved, and branch deletion is gated on
git worktree remove succeeding. New fail-safe helpers _worktree_is_dirty
and _worktree_lock_is_live (pid liveness via gateway.status._pid_exists,
Windows-safe).
2026-07-01 03:43:20 -07:00
teknium1
5c2dccd06f chore(release): map kangsoo-bit author for PR #47508 salvage 2026-07-01 03:42:32 -07:00
kangsoo-bit
7a2369718a fix(telegram): keep polling alive during transient bootstrap outages
A transient Bot API network error during gateway bootstrap (deleteWebhook
or the initial start_polling) currently raises out of connect() and marks
the Telegram adapter fatal, restart-looping the whole gateway even though
the right behavior is to degrade the Telegram channel and let the existing
reconnect ladder recover in the background.

- _delete_webhook_best_effort(): swallow only transient network errors and
  continue to polling; non-network errors (e.g. auth failures) still raise.
- _start_polling_resilient(): on a transient conflict/network error at
  bootstrap, schedule background recovery and return degraded instead of
  raising; non-transient errors still propagate.
- Track the polling error-callback recovery tasks in _background_tasks so
  they can't be garbage-collected mid-flight.
- Add a second Telegram Bot API seed fallback IP (149.154.166.110).

Reconnect keeps its existing 10-retry -> supervisor-restart semantics; this
change only fixes the bootstrap raise, it does not alter the retry ladder.
2026-07-01 03:42:32 -07:00
teknium1
9dd6451c80 chore(release): add WXBR to AUTHOR_MAP for #46183 salvage 2026-07-01 03:34:49 -07:00
WXBR
59e7e9d007 fix(agent): persist recovered final responses
Close a recovery/fallback final_response with an assistant transcript entry before session persistence so durable history cannot end at a tool/user message after the caller receives a final answer.

Adds a regression for a tool-tail transcript with a non-empty final_response. Related to #46071 / #46053, but covers the adjacent case where the assistant message was never appended before persistence.
2026-07-01 03:34:49 -07:00
kshitijk4poor
df27267ed7 fix(gateway): release PID file + runtime lock in the force-exit backstop
Follow-up to #54111. That PR routed the early SystemExit exit paths
(clean-fatal-config #51228, startup-aborted-before-running) through
_exit_after_graceful_shutdown / os._exit. Those paths raise right after
runner.start() without going through _stop_impl, so they relied on atexit
to release the PID file + runtime lock — and os._exit bypasses atexit,
leaking both.

Release them explicitly in the backstop (the single guaranteed cleanup
chokepoint). Both calls are idempotent: no-op on the normal _stop_impl
path, actual cleanup on the early-exit paths. Corrects the now-inaccurate
docstring claim that teardown always ran first. Adds a guard test plus the
missing str-code->1 coverage.

E2E: real PID file written + lock acquired, _exit_after_graceful_shutdown(78)
exits code 78 AND removes the PID file (leak confirmed closed).
2026-07-01 15:59:37 +05:30
YLChen-007
e23f723389 fix: make streaming reasoning-tag filter case-insensitive
The streaming think-tag suppressors in cli.py (_stream_delta) and
gateway/stream_consumer.py (_filter_and_accumulate) matched tag names
with case-sensitive str.find(), so only the exact-case literals in the
tag tuples were caught. Mixed-case variants a model may emit — <Think>,
<ThInK>, <REASONING>, <Thought> — slipped through and leaked raw
reasoning into the user-visible stream.

Match against a lowercased view of the buffer with lowercased tag names
at all three sites (open-tag boundary search, partial-tag hold-back,
close-tag search) in both paths. Only KNOWN tag names are matched — no
substring matching — and the block-boundary gating that protects prose
mentions of <think> is preserved.

- 6 parametrized case-insensitive regression tests in each of
  tests/gateway/test_stream_consumer.py and
  tests/cli/test_stream_delta_think_tag.py.

Salvaged from PR #27289 by @YLChen-007.
2026-07-01 03:25:02 -07:00
pprism13
f049227f31 fix(state): order conversation replay by id, not timestamp
get_messages_as_conversation ordered rows by (timestamp, id). append_message
stamps each row with time.time(), which is not monotonic — on WSL2, after an
NTP step, or when a VM/laptop resumes from sleep the clock can jump backwards
mid-conversation. A later row then carries an earlier timestamp than its
predecessor, so ORDER BY timestamp sorts an assistant tool_calls row after its
tool response, orphaning the tool call and triggering an HTTP 400 on the next
completion. Order by the AUTOINCREMENT id (true insertion order) instead.

This is the sibling path to c03acca50, which already fixed get_messages but
missed get_messages_as_conversation.

Salvaged from #50356.

Co-authored-by: pprism13 <290877921+pprism13@users.noreply.github.com>
2026-07-01 15:52:37 +05:30
kshitijk4poor
cde3ca4ebf fix(gateway): widen force-exit to SystemExit paths + os._exit regression tests (#53107)
Builds on the salvaged force-exit fix:
- Route the start_gateway() SystemExit paths (clean-fatal-config #51228,
  planned-restart, service-restart) through the same os._exit backstop. Those
  paths previously fell through to normal interpreter finalization, leaving
  them vulnerable to the SAME wedged-non-daemon-thread hang the boolean-return
  paths now avoid. main() catches SystemExit and converts its code (None->0,
  int->code, str->1) to os._exit. Every exit path is now wedge-proof.
- Document in the helper why bypassing atexit is safe (remove_pid_file +
  release_gateway_runtime_lock are performed explicitly in start_gateway
  teardown) and why logging is not flushed (synchronous RotatingFileHandlers).
- Tests: assert termination via os._exit not SystemExit (adapted from
  @AgenticSpark's PR #53122, a duplicate of #53121), plus SystemExit(78) is
  routed through os._exit(78) and SystemExit(None) maps to os._exit(0).
2026-07-01 15:51:57 +05:30
teknium1
1c350728ec chore(release): map Lazymonter into AUTHOR_MAP for PR #42914 salvage 2026-07-01 03:21:20 -07:00
HiaHia
8feeb0ccb8 fix(gateway): retry launchd bootstrap after bootout on EIO for install/start
On macOS, `launchctl bootstrap` of a label still registered in the domain
fails with 5: Input/output error (EIO). That is the *already loaded* case — a
stale registration from an interrupted restart or a bootout that didn't settle
— recoverable by booting the leftover out and bootstrapping again, and distinct
from the domain being genuinely unmanageable.

launchd_install and launchd_start (both bootstrap paths) treated exit 5 as
'launchd cannot manage this macOS version' and silently degraded to a detached
process, losing auto-start at login and crash-restart. Centralize bootstrap in
_launchctl_bootstrap(), which on EIO boots the stale label out and retries once;
only if the retry also fails does the error propagate so callers apply their
existing _launchctl_domain_unsupported fallback for a genuinely broken domain.

launchd_restart already boots out before bootstrapping (its drained job is
almost always still registered, so a plain bootstrap would hit EIO on the common
path), so it keeps its explicit pre-bootout rather than routing through the
bootstrap-first helper. Corrected the stale exit-5 comment that claimed it
always meant an unmanageable domain.

Adds TestLaunchctlBootstrapEioRetry covering clean bootstrap (no bootout),
EIO -> bootout -> retry success, persistent EIO re-raise, and non-EIO re-raise
without a spurious bootout.
2026-07-01 03:21:20 -07:00
teknium1
69f08c2eb5 fix(telegram): guard _post_connect_task access for object.__new__ test pattern
disconnect() reads self._post_connect_task, but several tests build a bare
TelegramAdapter via object.__new__() without calling __init__ (which sets the
attr). Use getattr(..., None) so disconnect() works on those instances too
(pitfall #17).
2026-07-01 03:18:57 -07:00
LeonSGP43
3362bdb4e5 fix(telegram): defer post-connect housekeeping off the connect path
Command-menu registration (set_my_commands), the status-indicator, and
DM-topic setup make Bot API calls that can stall for certain bot tokens.
They ran inside connect() before/after _mark_connected() but still within
the coroutine the gateway wraps in a connect timeout, so one slow call blew
the whole connect and the adapter never came up — even though polling/webhook
was already live (getMe works via curl). Fixes #46298.

- mark connected as soon as polling/webhook startup succeeds
- move command-menu, status-indicator, and DM-topic setup into a cancellable
  background housekeeping task (_run_post_connect_housekeeping)
- cancel that task during disconnect so it can't fire into a torn-down client
- harden scope-name lookup with getattr fallback

Salvaged onto the relocated plugin adapter (plugins/platforms/telegram/
adapter.py) since the original PR #46404 targeted the pre-migration
gateway/platforms/telegram.py path.

Co-authored-by: Hermes Agent <teknium@nousresearch.com>
2026-07-01 03:18:57 -07:00
Tranquil-Flow
122e5bc037 fix(agent): retry 413 after stripping vision payloads (#47339)
When text compression can't reduce a 413 request further, evict base64
image parts from tool messages and retry once instead of dead-ending
with 'Payload too large and cannot compress further.'

A 413 is a request-body byte-size limit, not a token limit. browser_vision
screenshots (2-5MB base64 each) keep the HTTP body oversized even after
aggressive summarization. The strip pass passes remember_model=False so a
413 does not poison _no_list_tool_content_models — that set is for providers
that reject list-type tool content, a distinct failure mode.

Cherry-picked from #47397 by Tranquil-Flow; placed onto main's current
token-aware 413 recovery else branch.
2026-07-01 03:18:41 -07:00