hermes-agent

Author	SHA1	Message	Date
Teknium	22c5048d9c	fix(moa): restore prompt caching for the aggregator and advisors (#57675 ) Two caching holes made MoA re-bill essentially its entire input stream: 1. AGGREGATOR: anthropic_prompt_cache_policy() judged the agent's own model/provider — on the MoA path those are the virtual preset name and 'moa', which match no caching branch, so _use_prompt_caching was False and the acting aggregator (Claude on OpenRouter) ran with ZERO cache_control breakpoints. Measured on identical opus-4.8 sessions: 85% cache share solo vs 2% via MoA — ~30M re-billed input tokens on one 132-task benchmark run. Fix: when provider == 'moa', resolve the policy from the preset's real aggregator slot (provider/model/base_url/api_mode via resolve_runtime_provider). 2. ADVISORS: _run_reference never applied cache_control at all, and Anthropic caching is opt-in per request — Claude advisors served 0 cache reads across 1,227 benchmark calls (11.5M re-billed input tokens) even though the advisory view is append-only across iterations (stable prefix; the synthetic end marker is last so it never pollutes it). Fix: _maybe_apply_advisor_cache_control() reuses the SAME policy function and SAME system_and_3 layout as the main loop, judged on the advisor slot's own resolved runtime — advisor requests are now decorated exactly like an acting agent on that provider. Auto-caching routes (OpenAI-family) are left untouched by policy. Live-verified on the wire (per-iteration opus+gpt5.5 preset, 4 fan-outs): claude advisor fan-out 2-3 cache_write=2161/2344, fan-out 4 cache_read=2206 / fresh_in=2; aggregator session cache share 84%/77% (vs 2%/0% before). Sub-1024-token prompts correctly stay uncached (Anthropic minimum).	2026-07-03 04:08:48 -07:00
Teknium	1c4cc00f73	fix(moa): user_turn fanout — synthetic advisory marker must not count as a user turn (#57598 ) The advisory view appends a synthetic user marker when it ends on an assistant turn (Anthropic end-on-user rule) — i.e. on every tool iteration after the first. The user_turn prefix hash treated that marker as the last user message, so the hashed prefix included the grown mid-turn context and the signature changed every iteration: advisors re-ran per iteration, silently defeating the once-per-turn cadence (live smoke test: 2 fan-outs for a 2-iteration task; expected 1). Hoist the marker to a module constant and skip it when locating the last REAL user message. Verified: iteration-2 signature now equals iteration-1 (cache HIT); a new real user message still re-triggers the fan-out.	2026-07-03 01:24:58 -07:00
Teknium	9e044cf795	feat(moa): per-preset fanout cadence — user_turn runs advisors once per user turn (#57591 ) New preset key 'fanout': 'per_iteration' (default, unchanged behavior) re-runs the reference fan-out whenever the advisory view changes — every tool iteration. 'user_turn' runs the advisors ONCE per user turn and lets the aggregator act alone for the rest of the tool loop — the original MoA shape (upfront multi-model synthesis, then a single acting model), and the obvious lever on MoA's wall/cost multiplier (advisor generation dominates per-turn latency). Implementation reuses the existing turn-scoped reference cache: in user_turn mode the cache signature hashes only the prefix up to the LAST user message, so mid-turn advisory-view growth doesn't change the key and iteration 2+ is a cache HIT (advice reused, zero advisor spend, no re-trace). A new user message changes the prefix and re-triggers the fan-out. Unknown fanout values normalize to per_iteration.	2026-07-03 01:02:44 -07:00
Teknium	372f8195c7	fix(moa): default temperatures to unset — provider default, like single-model agents (#57440 ) A single-model Hermes agent never sends temperature; the provider default applies. MoA hardcoded reference_temperature=0.6 / aggregator_temperature=0.4, and the coercion float(preset.get(key, 0.6) or 0.6) made unset IMPOSSIBLE to express: absent, null, empty, and even an explicit 0 all collapsed to the baked-in default. Every MoA advisor and aggregator therefore ran at 0.6/0.4 while the same model running solo used the provider default — silently skewing solo-vs-MoA comparisons and overriding provider-tuned defaults. - moa_config normalization: temperatures coerce to None when absent/blank/ invalid (new _coerce_float_or_none); explicit values incl. 0 honored. - moa_loop: _preset_temperature() resolves preset values; None flows to call_llm, which already omits the parameter when None (same contract as max_tokens). Aggregator still inherits the acting agent's own configured temperature when the preset doesn't pin one. - conversation_loop (context-mode MoA): same resolution, no more hardcoded 0.6/0.4 at the call site. - DEFAULT_CONFIG preset + web_server payload models + docs updated: unset is the default, pinning stays available.	2026-07-03 00:22:49 -07:00
Teknium	543d305bbb	feat(moa): add reference_max_tokens to cap advisor output and cut turn latency (#56756 ) MoA per-turn latency is dominated by advisor GENERATION: turn wall time correlates ~0.88 with output tokens and ~-0.03 with input tokens (measured over 52 turns). Each turn waits for the slowest advisor to finish writing, and advisors were uncapped — writing multi-thousand-token essays the aggregator only needs the gist of. Add an opt-in per-preset reference_max_tokens knob (mirrors reference_temperature) that caps ADVISOR output only; the acting aggregator is never capped. Default None = uncapped, so existing presets are byte-for-byte unchanged (no regression). Wired through both MoA execution paths (MoAChatCompletions.create and aggregate_moa_context). E2E: same task, closed preset uncapped vs reference_max_tokens=600 -> 59s to 33s (~44% faster), final answer identical/correct. - hermes_cli/moa_config.py: _coerce_int_or_none helper + reference_max_tokens in _normalize_preset/_default_preset/flattened view - agent/moa_loop.py: read preset.reference_max_tokens, pass to reference fan-out - agent/conversation_loop.py: pass reference_max_tokens on the per-turn path - tests + docs	2026-07-02 00:16:35 -07:00
Teknium	aa605b66c8	fix(moa): price aggregator turn at its real model so session cost isn't advisor-only (#56394 ) On the MoA path agent.model/provider are the virtual preset name (e.g. "closed") and "moa", which have no pricing entry. estimate_usage_cost() returned None for the aggregator turn, so the `if amount_usd is not None` guard skipped it and the session's estimated_cost_usd reflected only the advisor fan-out — a ~50% undercount when the aggregator does the full acting loop (verified: $0.91 advisor-only vs $1.96 true, aggregator = 54%). MoAChatCompletions.create() now stashes the resolved aggregator slot as last_aggregator_slot (exposed via MoAClient); conversation_loop reads it to price the aggregator turn at its real model/provider. cost_source flips from 'none' to 'provider_models_api'.	2026-07-01 06:02:33 -07:00
Teknium	5de65624d1	fix(moa): capture streamed aggregator output into full-turn traces (#56312 ) MoA full-turn traces (moa.save_traces) recorded the aggregator's acting output only on the non-streaming path, where it's captured inline at call time. On the streaming path — which every hermes chat --query run and every live gateway/CLI turn takes — the aggregator's raw token stream is handed to the live consumer, so the trace left output=null and only pointed at the session-db assistant row. An offline audit of a benchmark run (HermesBench drives --query) then couldn't see what the aggregator produced without hand-joining to state.db. Capture the resolved streamed acting text at trace-flush time (the agent already holds it in _current_streamed_assistant_text) and fold it into the trace, so the record is self-contained in both modes. New output_location value inline_from_stream marks a streamed turn whose text was captured this way; a genuinely empty acting turn (pure tool call) still points at the session db, matching state.db exactly. Touches only the trace side-channel — no change to the acting path, message history, role alternation, or prompt cache. - agent/moa_loop.py: consume_and_save_trace(..., aggregator_output_fallback) on both the facade and the MoAClient wrapper; prefer inline capture, fall back to the resolved streamed text. - agent/moa_trace.py: embed the fallback; add inline_from_stream location. - agent/conversation_loop.py: pass _current_streamed_assistant_text at flush. - tests: 5 cases across streaming / non-streaming / empty-fallback / no-double-write.	2026-07-01 04:07:46 -07:00
Jeff Watts	a2d6f05d1b	fix(moa): append reference block at end of aggregator prompt for KV-cache reuse The MoA aggregator received the per-turn reference block merged into the most recent `user` message. In an agentic tool loop that message is the original task near the top of the context (everything after it is assistant/tool turns), so injecting text that changes every iteration diverges the prompt prefix early. The server's KV cache then cannot be reused and the entire conversation re-prefills on every tool-loop step — full prefill each step, which dominates latency on long contexts. Append the reference block at the end of the prompt instead (merging into the last message only when it is already a trailing user turn, i.e. plain chat). This keeps the [system][task][tool-history] prefix stable and cache-reusable so only the new block re-prefills, and gives the aggregator the references with recency. Extracted as `_attach_reference_guidance` with unit tests. Measured on a local llama.cpp aggregator over a long agentic task: KV-cache reuse on follow-up steps went from ~0.3% to ~93-95% and per-step prefill on an ~80k-token context dropped from ~44s to <1s, with no change to output. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-07-01 01:59:00 -07:00
Teknium	2e8748ed22	feat(moa): opt-in full-turn trace persistence to JSONL (#56101 ) Adds moa.save_traces (default off). When on, every MoA turn that runs the reference fan-out appends one JSON line to <hermes_home>/moa-traces/<session_id>.jsonl capturing the TRUE FULL turn: each reference model's exact input messages (system advisory prompt + full advisory view, not the truncated display preview) + full output + usage + per-advisor cost, and the aggregator's exact input (including the injected reference-context guidance block) + output. Lets MoA runs be audited and improved offline — what every model saw, said, and cost. - agent/moa_trace.py: config-gated JSONL writer, profile-aware path via get_hermes_home(), best-effort (never breaks a turn), moa.trace_dir override. - agent/moa_loop.py: _RefAccounting now carries full input/output/model/ provider/temperature; create() stashes the full turn on a cache MISS (once per turn, never on the cache-HIT repeat iterations); non-streaming aggregator output captured inline, streaming marked + pointed at the session assistant message. consume_and_save_trace(session_id) flushes it. - agent/conversation_loop.py: flushes the trace with the live session_id right after MoA usage consumption. No-op for non-MoA clients. - hermes_cli/config.py: moa.save_traces + moa.trace_dir defaults. Traces are a side channel — NOT the messages table, never in replay, safe to delete. Off by default; only overhead when off is one config read on a MoA cache-MISS turn. Tests: full-trace-when-enabled (per-ref input+output+cost, aggregator input-with-guidance + output), nothing-when-disabled. Live E2E through run_conversation confirmed the loop wiring writes the file.	2026-07-01 00:09:42 -07:00
Teknium	3bdb23de10	fix(moa): count reference (advisor) fan-out token usage + cost (#56087 ) MoA ran the reference models before the aggregator but returned only the aggregator's usage to the loop — _run_reference discarded each advisor response's .usage entirely. Session accounting (state.db, /insights, cost) therefore undercounted every MoA turn by the whole reference fan-out, which is usually the bulk of the spend and scales with advisor count. - _run_reference normalizes each advisor's usage with ITS OWN resolved provider/api_mode and prices it at ITS OWN model rate (correct cache-read/ cache-write split), returning a _RefAccounting(usage, cost). - create() sums advisor usage + cost once per turn (cache MISS only, so a repeat tool-iteration reusing cached advice does not double-charge) and exposes it via MoAClient.consume_reference_usage(). - conversation_loop folds advisor tokens into the reported/persisted token counts and adds advisor cost (priced per-advisor) on top of the aggregator cost, in both the in-memory session totals and the state.db per-call delta. Aggregator cost is still priced on aggregator-only usage so advisor tokens are never repriced at the aggregator rate. - CanonicalUsage gains __add__ for per-bucket summing. Tests: advisor usage/cost capture, per-turn sum + consume-clears + cache-hit no-double-charge, CanonicalUsage.__add__.	2026-06-30 23:08:37 -07:00
Teknium	a653bb0cbe	refactor(moa): unify slot provider-identity on the single call_llm chokepoint (#55991 ) _slot_runtime maintained a hand-listed name-preservation set ({nous, anthropic, openai-codex, xai-oauth, bedrock}) that returned bare provider+model to avoid call_llm collapsing an explicit base_url to the generic 'custom' route. That duplicated _resolve_task_provider_model's _preserve_provider_with_base_url guard (a provider-catalog capability check) and had to be extended by hand for every provider with custom auth/signing — the exact drift that produced the anthropic (#54609) and bedrock (#54912) 429/ empty-response bugs. Removes the whitelist: _slot_runtime now forwards the resolved base_url/api_key/ api_mode for every slot, and the single chokepoint (_resolve_task_provider_model -> _preserve_provider_with_base_url) decides identity preservation. Behavior is unchanged for the five providers — their provider branches (codex Responses+Cloudflare, xai-oauth, bedrock SigV4, anthropic OAuth Bearer+anthropic-beta, nous Portal tags) re-resolve their own credentials by name and ignore a forwarded base_url/api_key, so forwarding is safe even for bedrock's placeholder 'aws-sdk' key. Verified via real-import E2E: _slot_runtime -> _resolve_task_provider_model preserves openai-codex/xai-oauth/bedrock/anthropic/nous (+openrouter control) — none collapse to custom. Tests updated to assert the pipeline invariant against the real resolver instead of the removed whitelist's bare-return shape.	2026-06-30 18:59:45 -07:00
iizotov	6eca917631	fix(moa): route bedrock MoA slots through signed bedrock branch _slot_runtime() resolved a bedrock slot to its bedrock-runtime base_url plus the placeholder api_key "aws-sdk" and forwarded both to call_llm. call_llm then treated it as a plain OpenAI-compatible endpoint and issued an UNSIGNED bearer POST (no AWS SigV4 / IAM signing), so Bedrock returned an empty/malformed ChatCompletion (choices=None) and the MoA aggregator turn failed validation. Add 'bedrock' to the name-preserve set alongside nous/openai-codex/ xai-oauth so bedrock slots are passed by provider name only, routing through call_llm's dedicated SigV4-signed bedrock branch. Affects any MoA preset using a bedrock aggregator or bedrock reference.	2026-06-30 17:45:45 -07:00
Chufeng Fan	4d43669921	fix(moa): route native anthropic OAuth references through provider branch MoA's _slot_runtime() whitelists providers that must keep their provider identity (so call_llm runs their provider branch) instead of being treated as a plain custom endpoint via forwarded base_url/api_key. Native anthropic was missing from this set. Native anthropic subscription OAuth setup-tokens (sk-ant-oat) require Bearer auth plus the 'anthropic-beta: oauth-' header, which only the anthropic provider branch adds. Without the whitelist entry, the slot's base_url/api_key were forwarded and call_llm sent the OAuth token as x-api-key, which Anthropic rejects with a bare 429 (rate_limit_error with no quota details). This made anthropic references in MoA presets fail every time. Add 'anthropic' to the whitelist so native anthropic reference/aggregator slots route through the provider branch. Extends upstream `9229d0db1` which added 'nous' for the same reason.	2026-06-30 17:45:45 -07:00
Jeff Watts	4d2351a528	feat(moa): stream the aggregator response to the user MoA sessions could not stream: the gateway streaming toggle was a no-op for provider "moa", so users saw nothing until the entire response finished — minutes of silence on long turns. The aggregator's reply was always fetched whole. Root cause was twofold: 1. conversation_loop hard-disabled streaming for provider in {"copilot-acp", "moa"} (MoA grouped with the ACP client, whose facade isn't a stream). 2. MoAChatCompletions.create() fetched the aggregator response whole via call_llm(), which had no streaming mode. For provider "moa", _create_request_openai_client() returns the MoAClient facade itself, so the existing streaming consumer already calls MoAChatCompletions.create(stream=True). We reuse that battle-tested consumer (text-delta delivery, tool_call reassembly, stale-stream detection, non-streaming fallback) instead of adding a parallel streaming path. Changes: - call_llm() gains stream/stream_options. When streaming it returns the raw SDK stream iterator directly, bypassing _validate_llm_response and the temperature/max_tokens/payment fallback chain (which assume a complete response). The caller owns reassembly and fallback. - MoAChatCompletions.create() runs the references first (unchanged), then when stream=True returns the aggregator's raw stream, forwarding stream_options and the consumer's per-request read timeout. stream=False is byte-identical to before (no stream/stream_options/timeout forwarded). - conversation_loop streams MoA only when a display/TTS consumer is present; quiet/subagent/health-check paths keep the complete-response path. Tests: tests/run_agent/test_moa_streaming.py — create() stream/non-stream branches, stream_options + timeout forwarding, call_llm raw-stream return vs validated non-stream. Existing MoA tests unchanged (20 passed). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-30 12:07:01 -07:00
teknium1	fe355d0a27	fix(moa): handle dict/str message shape in MoA response extraction Sibling of #15795's context_compressor fix. agent/moa_loop.py used the same response.choices[0].message.content access; while wrapped in try/except (so no crash), a dict/str-shaped message silently returned empty. Coerce defensively so the content is actually extracted.	2026-06-30 04:38:43 -07:00
liuhao1024	d76ca3a7f2	fix(moa): propagate api_mode from slot runtime to call_llm Slot_runtime resolved the provider's real API surface (including api_mode) but only forwarded base_url and api_key to call_llm, dropping api_mode. This caused Copilot GPT-5.x reference slots to hit /chat/completions instead of the Responses API, returning 400 unsupported_api_for_model. - _slot_runtime: forward api_mode from resolve_runtime_provider - call_llm: accept explicit api_mode param, override task config - 4 regression tests for propagation, omission, and signature	2026-06-30 03:39:50 -07:00
Gille	9229d0db17	fix(moa): preserve Nous provider identity for references	2026-06-28 00:47:15 -07:00
Teknium	7c38249c79	feat(moa): references see full tool state + fire on every user/tool response (#54016 ) The advisory reference view stripped all tool calls and tool results, so reference models judged a task whose actions and results they never saw — and references only fired once per user turn, never re-running as the agent's state advanced through the tool loop. Two fixes: - _reference_messages() now PRESERVES the agent's tool calls and tool results, rendering them inline as text ([called tool: ...] / [tool result: ...]) so a reference gives an informed judgement on the real current state. Still emits zero tool-role messages and zero tool_calls arrays (strict providers reject those), and large tool results are previewed head+tail (4000-char budget). The required end-on-user shape is met by APPENDING a synthetic advisory user turn — not by deleting the agent's latest context (which the prior fix did). - References now re-run on every state change — each new user message AND each new tool result — instead of once per user turn. The state-sensitive advisory signature drives the cache: new tool result = miss (re-run), identical-state re-call = hit (no re-run, no re-emit). The acting aggregator still receives the full, untrimmed transcript.	2026-06-28 00:30:11 -07:00
Teknium	1fa44180b0	fix(moa): advisory references end on a user turn + get a reference-role system prompt (#54007 ) * fix(moa): reference advisory view must end with a user turn MoA reference calls failed with Anthropic models that don't support assistant prefill (e.g. Claude Opus 4.8): '400 ... must end with a user message'. The advisory view built by _reference_messages() kept the last assistant turn's text while dropping the following tool result, leaving a trailing assistant turn — which Anthropic (and OpenRouter->Anthropic) interpret as an assistant prefill to continue. References are advisory and must end on the user turn they answer. Strip trailing assistant turns from the advisory view (preserving intervening ones). Update the existing test that encoded the buggy shape and add a mid-tool-loop regression test. * feat(moa): give reference models an advisory-role system prompt Reference models received the bare trimmed conversation with no role framing, so they assumed they were the acting agent and refused ("I can't access repositories/URLs from here") or tried to call tools they don't have. Prepend a dedicated advisory system prompt to every reference call: the model is an analyst, not the actor — it cannot execute, should not apologize for lacking tools, and should reason about the presented state to advise the aggregator/orchestrator on approach, next steps, tool-use strategy, risks, and anything the acting agent missed. Its output is private guidance for the aggregator, not a user-facing answer.	2026-06-27 22:52:25 -07:00
Gille	e7bb67332d	fix(moa): preserve Codex slot routing	2026-06-27 14:20:51 -07:00
Teknium	3b44a3c8bb	feat(moa): show each reference model's output as a labelled block before the aggregator (#53793 ) When a MoA preset is selected, each reference model's answer now renders in the CLI as a thinking-style block labelled with its source model, BEFORE the aggregator responds — so the mixture-of-agents process is visible instead of a silent pause. The aggregator's response (and its tool actions) follow as normal. Mechanism (shared seam, all surfaces): - MoAChatCompletions/MoAClient take an optional reference_callback and emit 'moa.reference' (index/count/label/text) per reference, then 'moa.aggregating' (aggregator label) once. agent_init wires this to the agent's tool_progress_callback, which every surface already consumes — so the events reach CLI/TUI/desktop/gateway with no new plumbing. - CLI _on_tool_progress renders 'moa.reference' as a labelled '┊ ◇ Reference i/n — <model>' header + a thinking-style preview (reusing _emit_reasoning_ preview), and 'moa.aggregating' as a spinner transition. Display-only; never touches message history (cache-safe). Turn-scoped reference cache: the agent loop calls the facade once per tool-loop iteration, but the advisory message view is identical across iterations within a turn, so references are now run AND displayed once per user turn (keyed by the advisory view's signature) instead of re-running/re-spamming on every iteration. This also cuts reference API cost from O(iterations) back to O(turns). Verified live via interactive PTY on the opus-gpt preset (gpt-5.5 + opus refs): reference blocks render once per turn, labelled by model, before the aggregator; fresh blocks on each new turn; aggregator tool actions still execute. Follow-up: TUI/desktop rich rendering + gateway batched-summary already receive the events via tool_progress_callback; their surface-specific renderers are a separate change.	2026-06-27 12:45:23 -07:00
Teknium	02b32e2d7c	fix(moa): call reference + aggregator models through their provider's real route (#53580 ) MoA was calling reference and aggregator models through a bare call_llm(provider=slot["provider"], model=slot["model"]) with a forced temperature and a forced max_tokens (the preset's hardcoded 4096). That left base_url/api_key/api_mode unresolved — so the auxiliary auto-detector guessed the API surface instead of using the provider's real runtime, and the 4096 cap truncated long aggregator syntheses. A MoA slot is just a model selection and must be called the same way any model is called elsewhere. Each slot is now resolved through resolve_runtime_provider (the canonical provider→api_mode/base_url/api_key resolver the CLI, gateway, and delegate_task all use) via a new _slot_runtime() helper, and the resolved endpoint is passed into call_llm. So a reference/aggregator gets its provider's actual API surface — MiniMax → anthropic_messages, GPT-5/o-series → max_completion_tokens, custom endpoints → their base_url — identical to how that model is handled as the acting model. MoA also no longer imposes its own output cap: max_tokens defaults to None (omitted → the model's real maximum) for references and is passed through from the caller for the aggregator. The preset's hardcoded 4096 is gone. The max_tokens preset config field is left in place (config/web/desktop unchanged); it is simply no longer applied as a forced cap. Tests: slots route through resolve_runtime_provider with resolved base_url/ api_key; resolution errors fall back to bare provider/model; neither call carries an output cap even when the preset config still contains max_tokens.	2026-06-27 04:39:42 -07:00
Teknium	c6575df927	feat(moa): expose MoA presets as selectable virtual models (#46081 ) * feat(moa): expose MoA presets as selectable virtual models Reconstructed onto current main (PR #46081's base had diverged with no common ancestor, marking the PR dirty so CI never dispatched). MoA is now a virtual provider: each named preset is a selectable model under provider 'moa', and the preset's aggregator is the acting model that answers and calls tools. Reference models fan out in parallel via a bounded ThreadPoolExecutor (the same batch pattern delegate_task uses) — all references dispatched at once, collected when every one finishes, then handed to the aggregator. Output order is preserved, failures and the MoA-recursion guard stay isolated per reference. - Removed the old mixture_of_agents model tool and moa toolset. - Added moa as a virtual provider in the provider/model inventory. - /moa is shortcut behavior over model selection (default preset / named preset / one-shot prompt). - Dashboard + Desktop manage named presets; presets appear in model pickers. - Parallel reference fan-out in agent/moa_loop.py with regression test. * fix(moa): thread moa_config through _run_agent to _run_agent_inner The reconstructed gateway MoA wiring declared moa_config on _run_agent (the profile-scoping wrapper) and used it inside _run_agent_inner, but the wrapper never forwarded it — _run_agent_inner had no such parameter, so the runtime hit NameError: name 'moa_config' is not defined on the compression-failure session sync path. Add moa_config to _run_agent_inner's signature and forward it from both wrapper call sites (multiplex and non-multiplex). Caught by tests/gateway/test_compression_failure_session_sync.py on CI shard test(4). * fix(moa): classify moa as a virtual provider in the catalog The moa virtual provider has no PROVIDER_REGISTRY/ProviderProfile entry, so provider_catalog() fell through to the default auth_type="api_key" with no env vars — tripping two catalog invariants: - test_provider_catalog: api_key providers must expose a credential env var - test_provider_parity: every hermes-model provider must be desktop-configurable moa already declares auth_type="virtual" in HERMES_OVERLAYS; consult that overlay as an auth_type fallback so the catalog reports moa as virtual (no real credential, no network endpoint). Exempt virtual providers from the desktop parity union check the same way 'custom' is exempt — derived from the catalog, not a hardcoded slug, so future virtual providers are covered too.	2026-06-25 13:52:06 -07:00

23 commits