Commit graph

1646 commits

Author SHA1 Message Date
Jacky Zeng
25aa626cb4 fix(vision): forward custom-endpoint credentials in vision auto-detect
A custom:<name> main provider resolves at runtime to the bare provider id
"custom". In the vision auto-detect chain, the main-provider branch called
resolve_provider_client("custom", ...) WITHOUT explicit_base_url/api_key,
so it returned (None, None) ("no endpoint credentials found") and the whole
chain fell through to OpenRouter/Nous. A user on a custom endpoint with no
aggregator configured then got "No LLM provider configured for task=vision
provider=auto" on every image, even though their main model fully supports
vision.

Recover the live endpoint that set_runtime_main() records each turn
(_RUNTIME_MAIN_BASE_URL/_API_KEY/_API_MODE) and forward it to Step 1, with
a fallback to _resolve_custom_runtime() for non-gateway callers. Mirrors the
existing explicit-base_url branch directly above.

Adds TestResolveVisionCustomProvider covering custom, custom:<name>, and the
no-runtime fallback path.
2026-07-03 03:54:01 -07:00
Jiahui-Gu
8bf797f1c2 fix(agent): prefer native vision over auxiliary fallback in auto mode (#29135) 2026-07-03 03:43:35 -07:00
liuhao1024
5e11628546 fix(image_routing): check stripped custom:<name> provider key for vision override
When model.provider is set to custom:<name>, _supports_vision_override()
previously tried only the runtime provider key ('custom') and the raw
config value ('custom:my-proxy'). It did not try the stripped name
('my-proxy'), which is the actual key under providers: in config.yaml.

This caused native image routing to fall back to text mode even when the
user explicitly declared supports_vision: true on the named provider's
model entry.

Fixes #39963
2026-07-03 03:33:06 -07:00
Teknium
1c4cc00f73
fix(moa): user_turn fanout — synthetic advisory marker must not count as a user turn (#57598)
The advisory view appends a synthetic user marker when it ends on an
assistant turn (Anthropic end-on-user rule) — i.e. on every tool iteration
after the first. The user_turn prefix hash treated that marker as the last
user message, so the hashed prefix included the grown mid-turn context and
the signature changed every iteration: advisors re-ran per iteration,
silently defeating the once-per-turn cadence (live smoke test: 2 fan-outs
for a 2-iteration task; expected 1). Hoist the marker to a module constant
and skip it when locating the last REAL user message. Verified: iteration-2
signature now equals iteration-1 (cache HIT); a new real user message still
re-triggers the fan-out.
2026-07-03 01:24:58 -07:00
Teknium
9e044cf795
feat(moa): per-preset fanout cadence — user_turn runs advisors once per user turn (#57591)
New preset key 'fanout': 'per_iteration' (default, unchanged behavior)
re-runs the reference fan-out whenever the advisory view changes — every
tool iteration. 'user_turn' runs the advisors ONCE per user turn and lets
the aggregator act alone for the rest of the tool loop — the original MoA
shape (upfront multi-model synthesis, then a single acting model), and the
obvious lever on MoA's wall/cost multiplier (advisor generation dominates
per-turn latency).

Implementation reuses the existing turn-scoped reference cache: in
user_turn mode the cache signature hashes only the prefix up to the LAST
user message, so mid-turn advisory-view growth doesn't change the key and
iteration 2+ is a cache HIT (advice reused, zero advisor spend, no
re-trace). A new user message changes the prefix and re-triggers the
fan-out. Unknown fanout values normalize to per_iteration.
2026-07-03 01:02:44 -07:00
Teknium
372f8195c7
fix(moa): default temperatures to unset — provider default, like single-model agents (#57440)
A single-model Hermes agent never sends temperature; the provider default
applies. MoA hardcoded reference_temperature=0.6 / aggregator_temperature=0.4,
and the coercion float(preset.get(key, 0.6) or 0.6) made unset IMPOSSIBLE to
express: absent, null, empty, and even an explicit 0 all collapsed to the
baked-in default. Every MoA advisor and aggregator therefore ran at 0.6/0.4
while the same model running solo used the provider default — silently
skewing solo-vs-MoA comparisons and overriding provider-tuned defaults.

- moa_config normalization: temperatures coerce to None when absent/blank/
  invalid (new _coerce_float_or_none); explicit values incl. 0 honored.
- moa_loop: _preset_temperature() resolves preset values; None flows to
  call_llm, which already omits the parameter when None (same contract as
  max_tokens). Aggregator still inherits the acting agent's own configured
  temperature when the preset doesn't pin one.
- conversation_loop (context-mode MoA): same resolution, no more hardcoded
  0.6/0.4 at the call site.
- DEFAULT_CONFIG preset + web_server payload models + docs updated: unset
  is the default, pinning stays available.
2026-07-03 00:22:49 -07:00
kshitijk4poor
e1a1dac848 fix(agent): enforce marker-strip invariant with a single terminal sweep (#57491)
Follow-up to the per-site strips from the review gate. The two copy-site
strips are correct but positional — a copy site added after the assembly
loops would re-leak _db_persisted into the child-session flush. Add a single
terminal sweep (_strip_persistence_markers) run once on the fully-assembled
compressed list so the invariant 'no compacted message leaves compress()
carrying a persistence marker' is structural, not dependent on copy-site order.

- agent/context_compressor.py: _strip_persistence_markers() called before
  compress() returns; helper docstring notes the sweep is the authoritative guard
- tests/agent/test_context_compressor.py: structural regression — neuter the
  per-site helper to a leaking copy, assert the terminal sweep still strips
- tests/run_agent/test_compression_persistence.py: pin the fixture assumption
  behind the exact-equality row-count assertion
2026-07-03 12:51:12 +05:30
nankingjing
3e204bd771 fix(agent): strip _db_persisted when assembling rotation compression transcript (#57491)
Shallow messages[i].copy() during context compression propagated the
_db_persisted marker from cached gateway incremental flushes into the
post-rotation compressed list. _flush_messages_to_session_db then skipped
every row when writing to the new child session, so gateway restarts
lost the compacted transcript (severe amnesia).

Strip the marker in _fresh_compaction_message_copy() and add regression
tests for rotation flush + compressor assembly.

Fixes #57491
2026-07-03 12:51:12 +05:30
kshitijk4poor
0950dae2fa Merge remote-tracking branch 'upstream/main' into HEAD
# Conflicts:
#	scripts/release.py
2026-07-03 03:52:15 +05:30
kshitijk4poor
1c93799b49 fix(agent): self-review follow-ups on vLLM local-context salvage
Self-review (ruff+ty lint diff = 0 net-new; 2-agent deep review) surfaced one
Warning + comment-accuracy nits; no Critical:

- W1: the local-probe TTL cache memoized None (probe failure) for 30s, so a
  probe that failed during a startup race would suppress a legit retry once
  the server came up. Cache only positive results — still fully bounds the
  hot-path probe rate (reachable servers cache their value) while an
  unreachable one re-probes on the next call. Add a regression test asserting
  a None result is NOT cached (retry re-probes); mutation-verified.

- Tighten the platform-guard comment: gateway/TUI/cron already construct with
  quiet_mode=True (gated by `not agent.quiet_mode`), so the guard's active job
  is CLI dedup vs show_banner, not "filling the gateway/TUI gap" as originally
  worded.

Verified not-issues (per review): positive-value 30s cache does not break the
reconcile-after-restart freshness contract (restart = fresh process, empty
cache); cache key is collision-safe; platform guard is correct in both
directions (no runtime path leaves platform None on a non-CLI surface).

Tests: 149 passed. ruff clean; ty 0 net-new vs base.
2026-07-03 03:36:22 +05:30
kshitijk4poor
b9a197ec59 fix(agent): resolve review findings on vLLM local-context salvage
Salvage review of #56431 surfaced one Critical + two Warning issues; fix
them on top of the contributor's cherry-picked commits:

1. Critical — duplicate non-agentic warning on the interactive CLI. The new
   agent_init warning fires on every platform, but cli.py show_banner()
   already warns on CLI (richer output + /model hint), so a CLI user saw the
   warning twice per startup. Guard the agent_init emit to skip platform=="cli"
   — it now fills exactly the gateway/TUI gap the PR intended, no duplication.

2. Warning — vLLM error-parse regex under-matched. The patterns required a
   literal space before the number, so "max_model_len: 32768", "=32768",
   "(32768)", and "... is 32768" all returned None. Broaden both patterns to
   accept :/=/(/ 'is' delimiters. Add a parametrized test over all delimiter
   variants.

3. Warning — per-call live probe latency on local endpoints. The new
   reconcile-on-hit + pre-defaults step-7 probe made every local resolution
   fire a synchronous network probe (banner + /model switch + compressor
   update_model each within one startup). Add a 30s in-process TTL cache
   keyed by (model, base_url) around _query_local_context_length so back-to-
   back resolutions reuse one round-trip; not persisted to disk, so the
   reconcile freshness contract (re-probe after restart) is preserved. Add an
   autouse fixture clearing the cache between tests + TTL coverage.

Tests: 148 passed (was 138). ruff clean.
2026-07-03 03:27:13 +05:30
infinitycrew39
cecedcddf3 fix(agent): honor live vLLM context limits on local endpoints
Reconcile stale local disk cache against live vLLM/Ollama max_model_len
probes, probe local servers before the llama hardcoded default, parse
vLLM max_model_len overflow errors, and surface the non-agentic Hermes 3/4
warning at agent init on gateway/TUI.

Sub-64K live probes are returned for startup rejection but are not
persisted to the context cache — preserving the 64K minimum-context
contract instead of normalizing undersized windows as valid config.

(cherry picked from commit c3a02db4fd9d57b7b0eb2732de91f8334d311aa5)
2026-07-03 03:22:51 +05:30
Teknium
3a122ba4ac
fix(usage): capture reasoning_tokens from completion_tokens_details on chat_completions (#57340)
normalize_usage only read output_tokens_details.reasoning_tokens (the
Responses API shape). Chat Completions providers — OpenAI, OpenRouter,
DeepSeek, and every OpenAI-compatible proxy — report it under
completion_tokens_details.reasoning_tokens, so reasoning_tokens was 0 for
every chat_completions reasoning model: hidden thinking was invisible in
session accounting, MoA traces, and the eval's per-task token columns.

Measured impact (HermesBench MoA run on deepseek-v4-flash, 4,828 advisor
calls): reasoning_tokens showed 0 everywhere while individual calls burned
up to 21.5K hidden thinking tokens to emit ~500 visible tokens. Verified
live against OpenRouter: deepseek-v4-flash returns
completion_tokens_details.reasoning_tokens=61 for a 74-completion-token
call; the field was simply never read.

Responses-shape reads are unchanged; the new read only fires when the
Responses shape yielded nothing.
2026-07-02 13:52:42 -07:00
teknium1
254328bf56 fix(auth): remove stale loopback_pkce reference in xAI quarantine removal list
The terminal-refresh quarantine filtered in-memory entries on
source == "device_code" but built removed_ids from the deleted
"loopback_pkce" source name, so the revoked device-code entry was
never pruned from the persisted pool in auth.json. Also restores the
_print_loopback_ssh_hint test suite scoped to Spotify (the helper's
remaining caller) instead of deleting it wholesale.
2026-07-02 13:17:41 -07:00
Jaaneek
5ef0b8acb0 feat(auth): make xAI Grok OAuth device-code-only, drop loopback login
Replace the loopback/PKCE-callback server and manual-paste fallback with
the RFC 8628 device-code flow as the only xAI Grok OAuth login path. The
flow works in headless/SSH/container sessions with no 127.0.0.1 listener,
shrinking the local attack surface.

- Poll the token endpoint with server-provided interval, honoring
  slow_down and expires_in; store tokens with auth_mode
  oauth_device_code.
- Adaptive proactive refresh skew for short-lived device-code JWTs;
  rotated tokens sync back to auth.json, the global root store, and the
  credential pool (no refresh-token replay).
- Clear source suppression on successful re-login (CLI + dashboard) and
  drop the duplicate dashboard pool entry so exactly one seeded
  device_code entry exists.
- Use the shared device_code source name for consistency with the
  nous/codex device-code providers.
- Desktop: remove the loopback OAuth flow states and dead type variants;
  pkce providers' sign-in URL selection is unchanged.
- Docs (EN + zh-Hans) rewritten for device-code login; drop the deleted
  --manual-paste flag from documented commands.
2026-07-02 13:17:41 -07:00
Jneeee
b98baa3039 feat(config): extra HTTP headers for LLM API calls (#3526 salvage)
Named providers / custom_providers entries in config.yaml now accept an
extra_headers dict scoped to that endpoint — for reverse proxies, API
gateways, and custom auth schemes (e.g. Cloudflare Access service tokens).

- hermes_cli/config.py: normalize extra_headers on provider entries
  (_normalize_custom_provider_entry + providers-dict translation), add
  get_custom_provider_extra_headers /
  apply_custom_provider_extra_headers_to_client_kwargs helpers keyed on
  base_url (case/trailing-slash insensitive, no substring bypass —
  mirrors the TLS helpers)
- hermes_cli/runtime_provider.py: surface extra_headers in the resolved
  runtime for named custom providers (providers dict, legacy
  custom_providers list, and the credential-pool path)
- run_agent.py / agent/agent_init.py: merge per-provider extra_headers
  onto the OpenAI client default_headers at construction and on every
  _apply_client_headers_for_base_url re-application (credential swaps,
  rebuilds), most-specific level wins; OpenAI-wire only (native
  Anthropic/Bedrock scoped out)
- agent/auxiliary_client.py: accept model.extra_headers as an alias of
  model.default_headers for the global variant
- cli-config.yaml.example: documented commented example
- Header values are treated as secrets and never logged

Salvaged from PR #3526 by @jneeee, reimplemented against current main.

Co-authored-by: Teknium <127238744+teknium1@users.noreply.github.com>
2026-07-02 05:33:25 -07:00
kshitijk4poor
0a2d4a6eea docs(codex): clarify stale-floor docstring reflects the 10k gate
The helper docstring described the typical ~15-25k gateway payload but
read as if that were the trigger range; the floor actually engages above
10k tokens. Clarify the prose to match the gate.
2026-07-02 17:05:05 +05:30
HexLab98
cb1ccc57e6 fix(codex): extend stale timeout for gateway-scale tool payloads
Lower the openai-codex stale-timeout floor from 25k to 10k estimated
tokens so Telegram/gateway sessions (~20k tools+instructions) are not
aborted at the generic 90s cutoff while Codex is still prefilling.
2026-07-02 17:05:05 +05:30
Teknium
3f2a56d1a4
fix(cli): reliable interrupts, bounded exit, and exit feedback (#57000)
Three CLI reliability fixes:

1. Interrupt reliability: chat() only re-queued the user's interrupt
   message when the turn result carried interrupted=True. When the agent
   thread raced past its last interrupt check (or finished) before the
   interrupt landed, the message was silently dropped — and the stale
   _interrupt_requested flag left on the agent instantly aborted the
   NEXT turn. Un-acknowledged interrupt messages are now re-queued as
   the next turn and the stale flag is cleared (only when the agent
   thread actually exited). The clarify-race path also parks the message
   in _pending_input instead of dropping it.

2. Slow exit (5+ min): stdlib ThreadPoolExecutor workers are non-daemon
   and joined unconditionally by concurrent.futures' atexit hook — even
   after shutdown(wait=False). One wedged tool worker (abandoned after
   interrupt/timeout) held the process open forever. Promoted
   async_delegation's daemon executor to a shared tools/daemon_pool
   module and adopted it in tool_executor (concurrent tool batches),
   memory_manager (background sync), delegate_tool (child timeout wrapper
   + batch fan-out), and skills_hub (source fan-out). Added a 30s exit
   watchdog (HERMES_EXIT_WATCHDOG_S) armed at _run_cleanup start as a
   backstop for wedged cleanup steps.

3. Exit jank: after prompt_toolkit tears down the input/status bars the
   terminal sat silent for the whole cleanup window, looking hung. Print
   'Shutting down… (finalizing session)' immediately at exit start.

E2E: live PTY interrupt of a foreground 'sleep 120' terminal tool now
aborts in ~1s and the typed message runs as the next turn; wedged-worker
+ wedged-cleanup subprocess exits in 5.8s (watchdog) instead of hanging.
2026-07-02 04:20:43 -07:00
kshitijk4poor
b837f07dcd fix(agent): route restore custom-pool match through canonical helper
Follow-up on the salvaged #56392 guard. The cherry-picked change matched
custom:<name> pool entries against the primary by raw base_url string
equality, which (a) can't disambiguate two named custom providers sharing
one gateway base_url and (b) left a latent bare-"custom" entry bypass.

Route the match through get_custom_provider_pool_key(rt[base_url]) compared
against the entry's custom:<name> key, mirroring the sibling guard in
recover_with_credential_pool. Use CUSTOM_POOL_PREFIX instead of the literal.

Add regression tests for the custom same-endpoint (swap) and cross-endpoint
(skip) branches, plus the plain-provider fallback-pool case from #56885.
2026-07-02 13:41:53 +05:30
openhands
820a052575 fix(agent): keep primary runtime restore on matching credential pool (#56374) 2026-07-02 13:41:53 +05:30
Teknium
fb403a3a73
fix(auxiliary): retry transient blips harder + isolate client cache per model (#56889)
Two related hardening fixes for auxiliary calls (which include MoA reference
advisors — a pinned-model path where provider fallback is not a meaningful
recovery):

1. Transient-transport retries: the same-provider retry on a connection reset /
   timeout / 5xx / 408 was a single attempt, then fallback. For a pinned aux
   call a second blip silently loses the call (root of the run2 double-advisor
   'Connection error' collapse — a genuine upstream blip). Now retries N times
   with exponential backoff, N = auxiliary.transient_retries (default 2 -> 3
   total attempts, clamped [0,6]). Compression-on-timeout fast-fail carve-out
   preserved.

2. Per-model client-cache isolation: _client_cache_key excluded the model, so
   two concurrent auxiliary calls to the same provider/base_url/key but
   different models (e.g. an opus + gpt-5.5 MoA fan-out) shared one cache entry
   and could race each other's client lifecycle. Model now participates in the
   key -> distinct clients, no cross-call races. Same-model reuse unchanged.

- agent/auxiliary_client.py: _transient_retry_count() + backoff loop; model in
  _client_cache_key and both call sites.
- hermes_cli/config.py: auxiliary.transient_retries default (2).
- tests: new retry/isolation tests; updated 2 stale-expectation tests to the
  corrected behavior (per-model resolve; N-retry escalation).

Backoff base is overridable (_TRANSIENT_RETRY_BACKOFF_BASE) so tests don't sleep.
2026-07-02 01:09:37 -07:00
Teknium
543d305bbb
feat(moa): add reference_max_tokens to cap advisor output and cut turn latency (#56756)
MoA per-turn latency is dominated by advisor GENERATION: turn wall time
correlates ~0.88 with output tokens and ~-0.03 with input tokens (measured over
52 turns). Each turn waits for the slowest advisor to finish writing, and
advisors were uncapped — writing multi-thousand-token essays the aggregator
only needs the gist of.

Add an opt-in per-preset reference_max_tokens knob (mirrors reference_temperature)
that caps ADVISOR output only; the acting aggregator is never capped. Default
None = uncapped, so existing presets are byte-for-byte unchanged (no regression).
Wired through both MoA execution paths (MoAChatCompletions.create and
aggregate_moa_context).

E2E: same task, closed preset uncapped vs reference_max_tokens=600 -> 59s to 33s
(~44% faster), final answer identical/correct.

- hermes_cli/moa_config.py: _coerce_int_or_none helper + reference_max_tokens
  in _normalize_preset/_default_preset/flattened view
- agent/moa_loop.py: read preset.reference_max_tokens, pass to reference fan-out
- agent/conversation_loop.py: pass reference_max_tokens on the per-turn path
- tests + docs
2026-07-02 00:16:35 -07:00
kshitij
88d1d6206f
fix(streaming): handle completed responses with empty/None choices (#55933) (#56713)
* fix(streaming): handle completed responses with empty/None choices

The streaming fallback guard added in #55932 recognized a completed
response object only when its `choices` was a non-empty list. But an
adapter can return a completed response whose `choices` is `None` or an
empty list (an error / content-filter / terminal frame) — still a whole,
non-iterable response, not a token stream. Those shapes fell through to
`for chunk in stream` and crashed with

    'types.SimpleNamespace' object is not iterable

which is exactly issue #55933 (MoA `openai-codex` aggregator on
TUI/Desktop, where a stream consumer forces the streaming path).

Broaden the guard to discriminate on the PRESENCE of a `choices`
attribute (a genuine provider Stream object exposes none), disable
streaming for the session, and return the completed object so the outer
loop's normal invalid-response validation handles empty/None choices via
its retry path instead of iterating.

Based on the diagnosis in #56525 by @spiky02plateau (that PR normalized
the MoA aggregator return with a one-shot chunk iterator; the common
text/tool-call crash was already fixed at this seam by #55932, so this
extends the existing guard to cover only the remaining empty/None-choices
gap).

Fixes #55933

* refactor(streaming): simplify empty-choices guard body and parametrize tests

Post-review cleanup (no behavior change):
- Inline the single-use `response_choices` local and drop the redundant
  `if first_choice is not None else None` guard (getattr(None, ...) already
  returns the default safely).
- Collapse the two near-identical empty/None-choices regression tests into
  one `@pytest.mark.parametrize` case.

Mutation-verified: reverting the guard to the old non-empty-list condition
still makes both parametrized cases fail with the historical
'types.SimpleNamespace' object is not iterable.

---------

Co-authored-by: spiky02plateau <155588579+spiky02plateau@users.noreply.github.com>
2026-07-02 06:36:20 +05:30
helix4u
7951250947 fix(moa): lift hidden Anthropic aux output cap 2026-07-02 06:31:18 +05:30
srojk34
7f64cce96d security(vertex): route credential/project/region resolution through the profile secret scope
agent/vertex_adapter.py resolved VERTEX_CREDENTIALS_PATH,
GOOGLE_APPLICATION_CREDENTIALS, VERTEX_PROJECT_ID, and VERTEX_REGION via raw
os.environ.get() instead of the profile-scoped get_secret() every other
credential lookup in hermes_cli/runtime_provider.py uses. In a multiplex
gateway serving several profiles from one process, os.environ still holds
whichever profile's .env python-dotenv loaded at boot — so a raw read here
let one profile's turn silently mint a Vertex OAuth2 token from, and get
billed against, a different profile's GCP service account. No error, no
fail-closed guard: the multiplex UnscopedSecretError protection was bypassed
entirely because these reads never went through get_secret().

- _resolve_credentials_path/_resolve_project_override/_resolve_region now
  call agent.secret_scope.get_secret(), matching the _getenv() pattern
  already used for every other provider's credentials.
- get_vertex_credentials()'s ADC fallback (google.auth.default()) reads
  GOOGLE_APPLICATION_CREDENTIALS from os.environ internally, bypassing
  get_secret() entirely — closed with a narrow guard: when multiplexing is
  active and this profile's scope has no Vertex credentials of its own, but
  os.environ still carries a value (left by a different profile's boot-time
  dotenv load), refuse ADC rather than silently authenticate as a stranger.
- Zero behavior change for single-profile installs: get_secret() falls
  through to os.environ transparently whenever multiplexing is off.

Same bug class as the already-fixed _HERMES_OAUTH_FILE/_AUTH_JSON_PATH/
HOOKS_DIR cross-profile leaks, now closed for Vertex's OAuth2 credential
path.
2026-07-02 06:07:56 +05:30
kshitijk4poor
676236bb1d fix(agent): honor custom CA certs on aux client + harden TLS resolution
The salvaged fix wired per-provider ssl_ca_cert / ssl_verify (and
HERMES_CA_BUNDLE) into the MAIN OpenAI client. This follow-up:

- Auxiliary client parity: process_bootstrap.build_keepalive_http_client
  accepts and forwards verify; auxiliary_client._resolve_aux_verify mirrors
  the main-client TLS resolution (via load_config_readonly, the read-only
  fast path) so compression/vision/web_extract/title-gen/session_search
  honor the same per-provider CA. Without this, chat worked against a
  private-CA endpoint but every auxiliary call still failed APIConnectionError.
- switch_model now reads custom_providers from live config (load_config_readonly)
  instead of the init-time agent._custom_providers snapshot, so ssl_ca_cert /
  ssl_verify edits are honored on mid-session model switch — matching the
  context-length reload (#15779).
- Drop the dead client-level verify= where a custom httpx transport is used
  (httpx ignores it there); verify lives on the transport. Fix docstrings.
  Applies to both run_agent._build_keepalive_http_client and process_bootstrap.
- resolve_httpx_verify: add CURL_CA_BUNDLE to the env chain (consistency with
  agent/ssl_guard._CA_BUNDLE_ENV_VARS) and emit a loud logger.warning naming
  the endpoint whenever ssl_verify:false disables verification.
- get_custom_provider_tls_settings: case-insensitive base_url match (config
  dedup already lowercases; scheme/host are case-insensitive) so a mixed-case
  entry doesn't silently drop its CA. Exact match preserved — no prefix bypass.
- Demote best-effort except Exception: pass in agent_init/switch_model to
  logger.debug(exc_info=True).
- Tests for aux verify forwarding, _resolve_aux_verify, case-insensitive
  match, and prefix-bypass rejection.
2026-07-02 04:51:56 +05:30
HexLab98
3a2ba959ce fix(agent): honor custom CA certs for custom_providers HTTPS endpoints
Wire ssl_ca_cert and ssl_verify through custom_providers config and env
vars into the keepalive httpx client, fixing APIConnectionError against
mkcert/self-signed Ollama proxies behind HTTPS.
2026-07-02 04:51:56 +05:30
HexLab98
7e957cbd0b feat(agent): add resolve_httpx_verify for custom CA bundle TLS
Introduce a shared helper that maps HERMES_CA_BUNDLE, SSL_CERT_FILE, and
per-provider ssl_ca_cert settings to httpx verify contexts.
2026-07-02 04:51:56 +05:30
Brooklyn Nicholson
ec319e4e3e fix(learning_graph): guard non-dict metadata so /journey can't crash
parse_frontmatter's malformed-YAML fallback stores every value as a string,
so a skill's `metadata` can be a str. `_category`/`_related` chained
`.get("metadata", {}).get("hermes", {})` and blew up with `'str' object has
no attribute 'get'`, taking down `build_learning_graph()` (and thus /journey
and `hermes journey`) whenever any installed skill had bad frontmatter.

Extract a `_hermes_meta()` helper that returns the nested dict only when it
really is one. Fixes the whole class, not just the two call sites.
2026-07-01 16:25:48 -05:00
kshitijk4poor
b23e1c3077 refactor(approval): extract is_approval_bypass_active(); use frozen-env bypass in codex routing
Self-review follow-up on the salvaged approval-routing fix.

The initial adaptation re-read os.getenv("HERMES_YOLO_MODE") at session-build
time. That diverges from the repo's security invariant: HERMES_YOLO_MODE is
frozen into tools.approval._YOLO_MODE_FROZEN at import time precisely so a skill
running mid-process cannot set the env var and instantly flip the approval
bypass (a prompt-injection escalation path). A live re-read re-opened that hole
for the codex routing path.

- Add tools.approval.is_approval_bypass_active() — the canonical three-source
  bypass check (frozen --yolo/HERMES_YOLO_MODE + session /yolo + approvals.mode
  off) in one place. This is the 4th inline copy of that OR-chain (the three
  sites in approval.py and tui_gateway/server.py:3121 all use the same idiom);
  the helper is the shared chokepoint they can collapse onto.
- codex_runtime.py now calls is_approval_bypass_active() instead of the
  hand-rolled mode-or-session check plus a runtime env re-read.
- Update the env-yolo test to patch _YOLO_MODE_FROZEN (the canonical test
  pattern, e.g. tests/tools/test_yolo_mode.py) rather than setenv, which is
  dead-on-arrival against the frozen constant.

Fail-closed default preserved on every branch; 28 integration + 77 session/yolo
tests pass; E2E confirms the real exec decision flips decline->accept only when
bypass is active.
2026-07-01 22:58:37 +05:30
snav
0b8e81996f fix(codex-app-server): honor approvals.mode/yolo for gateway-context approval routing
On gateway/cron/non-CLI contexts the codex app-server runtime has no UI to
surface codex's exec/apply_patch approval requests, so they fail closed
(silently decline) — the bot appears responsive but cannot write files, with
no approval prompt anywhere ("patch rejected by user").

When the user has explicitly opted out of Hermes approvals (approvals.mode: off,
the /yolo session toggle, or HERMES_YOLO_MODE=1), collapse to codex's own
sandbox permission profile (~/.codex/config.toml) as the policy gate by passing
_ServerRequestRouting(auto_approve_exec=True, auto_approve_apply_patch=True) to
the session. Defaults (manual/smart/unset) preserve the current fail-closed
behavior — a no-op for users who have not opted out.

Reads the mode via the canonical tools.approval._get_approval_mode() (which
already normalizes the YAML-1.1 bare-'off'->False case) at session-build time,
so a mid-session /yolo toggle is honored too.

5 integration tests: each opt-out mechanism (config off, YAML False, env var,
session yolo) plus the default fail-closed regression guard.

Closes #26530

Co-authored-by: snav <jake@nousresearch.com>
2026-07-01 22:58:37 +05:30
Teknium
eae3700b16
fix(moa): raise aux timeouts to 900s and give the Codex aux path a stable prompt_cache_key (#56395)
Two independent MoA auxiliary-call fixes:

#53866 — auxiliary.moa_reference.timeout and auxiliary.moa_aggregator.timeout
were 600s while moa_agent was 120s. Raise both to 900s so a genuinely long
reference/aggregator turn (mixed providers, deep reasoning, long tool chains)
has headroom instead of being cut mid-generation.

#53735 — _CodexCompletionsAdapter (the Codex/Responses auxiliary path used by
the MoA acting-aggregator, compression, web_extract, session_search, etc.)
never set prompt_cache_key, so it stayed cache-cold while the MAIN Responses
transport (agent/transports/codex.py) was warm. Derive the same
content-addressed key via the shared _content_cache_key(instructions, tools)
helper and set it on the aux Responses request, with the same host guards the
main transport uses (xAI carries the key in extra_body; GitHub/Copilot opts out
of cache-key routing).

Tests: 5 new prompt_cache_key cases (set+prefixed, stable across identical
prefix, differs on different instructions, skipped for xai/github hosts).
tests/agent/test_auxiliary_client.py 279 pass; tests/hermes_cli/test_config.py
130 pass.
2026-07-01 06:02:40 -07:00
Teknium
aa605b66c8
fix(moa): price aggregator turn at its real model so session cost isn't advisor-only (#56394)
On the MoA path agent.model/provider are the virtual preset name (e.g.
"closed") and "moa", which have no pricing entry. estimate_usage_cost()
returned None for the aggregator turn, so the `if amount_usd is not None`
guard skipped it and the session's estimated_cost_usd reflected only the
advisor fan-out — a ~50% undercount when the aggregator does the full acting
loop (verified: $0.91 advisor-only vs $1.96 true, aggregator = 54%).

MoAChatCompletions.create() now stashes the resolved aggregator slot as
last_aggregator_slot (exposed via MoAClient); conversation_loop reads it to
price the aggregator turn at its real model/provider. cost_source flips from
'none' to 'provider_models_api'.
2026-07-01 06:02:33 -07:00
kshitijk4poor
b795a45b8d fix(compaction): detect and strip merge-into-tail summaries past the delimiter
Follow-up to the END-MARKER reorder: moving the summary prefix after the
[PRIOR CONTEXT] wrapper meant _is_context_summary_content (prefix-at-start)
no longer recognized a merged-tail summary. That silently broke three
consumers — the last-real-user anchor (would pick the merged summary as a
real user turn, causing active-task loss), the carry-forward summary find,
and the auto-focus skip. _strip_summary_prefix would also carry the wrapper
+ stale tail content forward as the next summary body.

Extract the two delimiter strings into _MERGED_PRIOR_CONTEXT_HEADER /
_MERGED_SUMMARY_DELIMITER constants (writer + detector stay in sync), teach
_is_context_summary_content and _strip_summary_prefix to look past the
delimiter, and add a regression test. Standalone summaries unchanged.
2026-07-01 18:23:01 +05:30
Gromykoss
a1a8a967e1 fix(compaction): place END MARKER last in merge-into-tail summaries
When the compression summary is merged into the first tail message
(the alternation corner case where a standalone summary role would
collide with both head and tail), the old format was
SUMMARY + END_MARKER + OLD_TAIL_CONTENT — so the preserved tail content
appeared AFTER the end marker and the model could read it as a fresh
message to respond to.

Reorder so the END MARKER is always last: old tail content is wrapped in
[PRIOR CONTEXT ...][END OF PRIOR CONTEXT — COMPACTION SUMMARY BELOW]
delimiters, then the summary, then the END MARKER. _append_text_to_content
handles both string and multimodal-list content.

Salvaged from #56372 by @Gromykoss. Only the END-MARKER reorder half is
carried over. The PR's second change (a post-compaction pass that strips
user-role messages before the first summary marker on compression_count>=2)
was dropped: on 2nd+ compactions the protected head decays to system-only
(_effective_protect_first_n -> 0, #11996) so the targeted 'ghost head user'
does not occur, and where the strip does fire it deletes legitimate recent
tail user turns (data loss) and can leave consecutive assistant messages
(role-alternation violation).
2026-07-01 18:23:01 +05:30
Steve Lawton
c73e74386b feat(vertex): add Google Vertex AI provider for Gemini (OAuth2)
Adds Vertex AI as a first-class provider for Gemini models via Vertex's
OpenAI-compatible endpoint. Vertex authenticates with short-lived OAuth2
access tokens (service-account JSON or ADC), not a static API key — the
missing piece behind the recurring requests (#13484, #12639, #56259).

- agent/vertex_adapter.py: OAuth2 token minting + refresh-on-expiry
  (5-min margin), ADC->service-account fallback, global vs regional
  endpoint URLs. Config precedence: env var > config.yaml > default.
- plugins/model-providers/vertex/: provider profile (auth_type=vertex),
  reuses Gemini's extra_body.google.thinking_config translation.
- runtime_provider: vertex short-circuit BEFORE the credential pool so a
  credentials-file path is never mistaken for a static API key; mints a
  fresh token + computes base_url per resolve.
- run_agent + conversation_loop: _try_refresh_vertex_client_credentials()
  re-mints the token and rebuilds the client on a mid-session 401, so a
  long-lived gateway agent survives token expiry (~1h).
- auxiliary_client: vertex auth_type branch for side-LLM tasks.
- config.yaml: vertex.project_id / vertex.region (non-secret, bridged to
  env); credential path stays in .env (VERTEX_CREDENTIALS_PATH).
- setup wizard + model picker: dedicated _model_flow_vertex; curated
  google/gemini-* model list; --provider choices.
- pricing/metadata: Vertex prices off the gemini docs snapshot; endpoint
  host auto-maps to the vertex provider (no probe spam).
- lazy_deps + pyproject [vertex] extra: google-auth, opt-in only.
- docs: guides/google-vertex.md + providers page; tests for adapter +
  runtime resolution.

Salvages and modernizes #8427 by @slawt onto current main: rewired from
the legacy PROVIDER_REGISTRY path to the provider-profile architecture,
moved non-secret config out of .env into config.yaml, and added the
per-turn 401 token-refresh the original lacked.
2026-07-01 05:25:33 -07:00
HODLCLONE
6ed2f5d76f fix: make Nous Portal access token resolution resilient
- Track auth store source path on Nous state reads and write rotated
  OAuth refresh tokens back to the same store, preventing stale-token
  replays when Hermes falls back to a global/root auth.json.
- Skip Nous fallback entries locally when no access/refresh token is
  present, suppressing repeated failed resolution attempts within a
  session.
- Sync session model metadata after fallback switches so the gateway
  DB reflects the backend that actually served the latest turn.
2026-07-01 05:06:00 -07:00
ud
c126a99fc1 fix(subdirectory_hints): catch RuntimeError from Path.expanduser()
`pathlib.Path('~user').expanduser()` raises RuntimeError when the
tilde-expansion can't resolve the user (e.g. `~500-700` where the LLM
meant "approximately 500-700" rather than a path). The hint walker's
existing `except (OSError, ValueError):` clauses do not catch
RuntimeError, so it escapes through the tool dispatcher and surfaces
in the conversation loop as a misleading

    Error during OpenAI-compatible API call #N:
    Could not determine home directory.

Reproduced across three unrelated models (openai/gpt-5-mini,
openai/gpt-5.1-codex, deepseek/deepseek-v4-flash) on terminal-tool
commands containing literal tildes in non-path contexts — common in
LLM output ("~500 agencies", "~45,000 CVEs", "~80/hr blended rate").

Reproduction (one-liner):
    >>> from pathlib import Path
    >>> Path("~500-700").expanduser()
    RuntimeError: Could not determine home directory.

Fix: extend the three `except` clauses in
agent/subdirectory_hints.py to also catch RuntimeError:

  line 138 (_add_path_candidate's outer catch around the Path().expanduser() call)
  lines 198+202 (_load_hints_for_directory's nested catches around hint_path.relative_to(Path.home()))

Tests: tests/agent/test_subdirectory_hints_tilde.py adds three cases
covering: tilde-as-approximately in heredoc commands, ~unknown_user paths,
and a regression guard that legitimate ~/path expansion still works.
2026-07-01 04:55:15 -07:00
JabberELF
18a9467fca fix(tui): prevent killpg suicide during MCP shutdown
Root cause: gateway spawns LSP servers (jdtls/pyright/yaml-ls) and
slash_worker without start_new_session=True, so they inherit the
gateway process group (= TUI parent PID). When mcp_tool
_snapshot_child_pids() races with these spawns during stdio MCP
server startup, non-MCP children leak into _stdio_pgids with the
TUI parent PGID. shutdown_mcp_servers() then killpg(tui_parent_pid,
SIGTERM), killing the TUI itself.

Evidence: tui_gateway_crash.log shows recurring SIGTERM stacks:
  shutdown_mcp_servers -> _kill_orphaned_mcp_children ->
  _send_signal -> killpg(pgid, sig) -> SIGTERM received

Fix (3 layers):
1. agent/lsp/client.py: add start_new_session=True to LSP server
   spawn so each LSP server gets its own process group/session.
2. tui_gateway/server.py: same fix for slash_worker spawn, the
   symmetric root-cause patch so no gateway direct child shares
   the TUI parent pgid.
3. tools/mcp_tool.py: add _filter_mcp_children() defense-in-depth
   that drops non-MCP children (slash_worker, jdtls/eclipse LSP)
   from the PID delta before they can poison _stdio_pgids.
2026-07-01 04:54:46 -07:00
kshitijk4poor
dc1ea005d9 fix+test(codex): self-persist projected turns; keep agent_persisted=True
Follow-up correcting the salvaged fix's persistence approach to avoid a
duplicate user-message write (verified via E2E — the #860/#42039 bug class
the original diff aimed to avoid).

Root cause: in gateway mode the AIAgent is built WITH a session_db, so the
inbound user turn is already flushed at turn start (turn_context.
_persist_session). The original fix returned agent_persisted=False, making the
gateway re-write the whole new-message slice via append_to_transcript ->
append_message (a raw INSERT with no dedup), duplicating the already-flushed
user turn.

Corrected approach (single writer): run_codex_app_server_turn now flushes its
OWN projected assistant/tool messages via _flush_messages_to_session_db (which
dedups the already-persisted user turn through _DB_PERSISTED_MARKER) and
returns agent_persisted=True so the gateway skips its write. Net result:
session_search/distill see the full codex conversation, each message persisted
exactly once.

Adds regression coverage asserting exactly-once persistence on a real
SessionDB, agent_persisted=True, FTS visibility, and standard-runtime skip-db
behaviour preserved.

Co-authored-by: Lubos Buracinsky <lubos@komfi.health>
2026-07-01 17:08:59 +05:30
Lubos Buracinsky
5558382457 fix(codex): persist app-server turns to session DB (fixes starved recall)
The codex_app_server runtime path (run_codex_app_server_turn in
agent/codex_runtime.py) is an early-return that bypasses
conversation_loop and never calls _flush_messages_to_session_db().

Meanwhile, gateway/run.py sets:

  agent_persisted = self._session_db is not None   # always True

and passes skip_db=agent_persisted to every append_to_transcript call,
assuming the agent self-persisted (correct for the standard runtime,
wrong for codex). The result: codex turn messages are persisted nowhere.
state.db accumulates only session_meta rows; session_search (full-text
search over state.db) and conversation-distill are blind to real gateway
conversations, causing 'the agent has no memory of what we discussed'.

Fix (three-part, all backward-compatible):

1. agent/codex_runtime.py — run_codex_app_server_turn success return
   now includes 'agent_persisted': False, signalling that the codex path
   did NOT self-persist its turn.

2. gateway/run.py — the agent_persisted assignment now reads:

     agent_result.get('agent_persisted', self._session_db is not None)

   For the standard runtime (which does not set the key) the default
   (self._session_db is not None) preserves the existing skip-db
   behaviour so no duplicate-write regression (#860 / #42039) occurs.
   For the codex runtime the flag is False, so the gateway writes the
   new turn's messages to state.db and FTS index.

3. gateway/run.py — the rebuilt result dict (run_agent return, which
   becomes agent_result upstream) now includes agent_persisted passed
   through from result_holder[0], with a safe True default.  Without
   this passthrough the flag set in step 1 was discarded when the result
   was reconstructed, causing agent_result.get('agent_persisted', ...)
   to always see the default True and never write codex turns.
2026-07-01 17:08:59 +05:30
Dutch Dim
154c382d65 fix(gateway): recover from truncated responses 2026-07-01 17:08:50 +05:30
kshitijk4poor
9cf47fef54 fix(auxiliary_client): demote the 2 sibling routing fall-throughs too (review)
Phase 2c review flagged that only 2 of the 4 structurally-identical
resolve_provider_client routing dead-ends were demoted. Complete the bug-class:
also demote+dedup the external-process ('not directly supported') and OAuth
('not directly supported, try auto') fall-throughs, keyed by provider name, so
none of the four dead-ends spam WARNING on a retry loop.

Add direct tests for the unhandled-auth_type and OAuth dedup paths via a
monkeypatched PROVIDER_REGISTRY (the review noted these were unverified).
Mutation-checked: reverting either sibling demotion fails its test.
2026-07-01 17:00:30 +05:30
kshitijk4poor
c0d3ceb17e fix(auxiliary_client): dedup resolve_provider_client fall-through warnings
The two fall-through branches in resolve_provider_client (unknown provider,
unhandled auth_type) logged at WARNING on every retry of a misconfigured
provider, spamming logs during retry loops. Demote both to logger.debug with
per-process dedup: the first occurrence still surfaces (a provider-name typo or
PROVIDER_REGISTRY/auth_type-drift bug is worth seeing once), while identical
repeats are suppressed for the process lifetime.

Salvaged from #56283 (extracting only the stated auxiliary_client fix; the
original PR also bundled ~2800 lines of unrelated changes across 10 other
files, which are dropped).
2026-07-01 17:00:30 +05:30
shawchanshek
3b739b990b fix(title_generator): strip think blocks from LLM output before extracting title
Think-enabled models (MiniMax M2.7, DeepSeek, etc.) emit inline
<think>...</think> reasoning even for simple prompts like title
generation, and the raw XML was leaking into session titles. Route the
title-model response through the canonical strip_think_blocks scrubber
before cleanup so every tag variant — closed pairs, unterminated blocks,
orphan closes, mixed case — is handled, not just a single literal
<think> pair.

- 2 regression tests: closed <think> pair stripped, unterminated block
  at start yields no title.

Salvaged from PR #44126 by @shawchanshek.
2026-07-01 04:18:48 -07:00
shandian64
5126902f1d fix(title): honor configured auxiliary timeout 2026-07-01 16:41:43 +05:30
Teknium
5de65624d1
fix(moa): capture streamed aggregator output into full-turn traces (#56312)
MoA full-turn traces (moa.save_traces) recorded the aggregator's acting
output only on the non-streaming path, where it's captured inline at
call time. On the streaming path — which every hermes chat --query run
and every live gateway/CLI turn takes — the aggregator's raw token
stream is handed to the live consumer, so the trace left output=null and
only pointed at the session-db assistant row. An offline audit of a
benchmark run (HermesBench drives --query) then couldn't see what the
aggregator produced without hand-joining to state.db.

Capture the resolved streamed acting text at trace-flush time (the agent
already holds it in _current_streamed_assistant_text) and fold it into
the trace, so the record is self-contained in both modes. New
output_location value inline_from_stream marks a streamed turn whose text
was captured this way; a genuinely empty acting turn (pure tool call)
still points at the session db, matching state.db exactly.

Touches only the trace side-channel — no change to the acting path,
message history, role alternation, or prompt cache.

- agent/moa_loop.py: consume_and_save_trace(..., aggregator_output_fallback)
  on both the facade and the MoAClient wrapper; prefer inline capture,
  fall back to the resolved streamed text.
- agent/moa_trace.py: embed the fallback; add inline_from_stream location.
- agent/conversation_loop.py: pass _current_streamed_assistant_text at flush.
- tests: 5 cases across streaming / non-streaming / empty-fallback / no-double-write.
2026-07-01 04:07:46 -07:00
arminanton
e2fa509bf3 fix(review): isolate the background-review fork from the canonical session
The forked skill/memory review agent shares the parent's session_id for
prompt-cache warmth. Without isolation it wrote its harness turn ('Review the
conversation above and update the skill library…') plus its curator-mode reply
straight into the user's REAL session in state.db; the next live turn re-read
that injected user message as a standing instruction and the agent 'became' the
curator, refusing the actual task.

Root fix: a _persist_disabled flag on the fork that hard-stops every DB write
and lazy-open path (_flush_messages_to_session_db, _ensure_db_session,
_get_session_db_for_recall) — the review writes only to the skill/memory stores
via its tools. Defense-in-depth: _strip_background_review_harness drops any
stray harness message (and the assistant reply that followed) at load time in
get_messages_as_conversation, so an already-polluted session resumes clean.

Salvaged from #50296.

Co-authored-by: arminanton <29869547+arminanton@users.noreply.github.com>
2026-07-01 16:21:39 +05:30
pefontana
a04b7024ff fix(error-classifier): route 5xx context-overflow into compression
Local inference servers (llama.cpp/llama-server, vLLM/Ollama behind a
Cloudflare/Tailscale hop) report context overflow with HTTP 500/502/503/529
instead of 400/413. _classify_by_status returned server_error/overloaded and
retried blindly, then dropped the turn with no compaction. Route explicit
_CONTEXT_OVERFLOW_PATTERNS matches on those 5xx codes to context_overflow
(should_compress=True); plain 500 stays server_error, plain 503 overloaded.
2026-07-01 16:14:16 +05:30