Commit graph

23 commits

Author SHA1 Message Date
Teknium
22c5048d9c
fix(moa): restore prompt caching for the aggregator and advisors (#57675)
Two caching holes made MoA re-bill essentially its entire input stream:

1. AGGREGATOR: anthropic_prompt_cache_policy() judged the agent's own
   model/provider — on the MoA path those are the virtual preset name and
   'moa', which match no caching branch, so _use_prompt_caching was False
   and the acting aggregator (Claude on OpenRouter) ran with ZERO
   cache_control breakpoints. Measured on identical opus-4.8 sessions:
   85% cache share solo vs 2% via MoA — ~30M re-billed input tokens on one
   132-task benchmark run. Fix: when provider == 'moa', resolve the policy
   from the preset's real aggregator slot (provider/model/base_url/api_mode
   via resolve_runtime_provider).

2. ADVISORS: _run_reference never applied cache_control at all, and
   Anthropic caching is opt-in per request — Claude advisors served 0
   cache reads across 1,227 benchmark calls (11.5M re-billed input tokens)
   even though the advisory view is append-only across iterations (stable
   prefix; the synthetic end marker is last so it never pollutes it). Fix:
   _maybe_apply_advisor_cache_control() reuses the SAME policy function and
   SAME system_and_3 layout as the main loop, judged on the advisor slot's
   own resolved runtime — advisor requests are now decorated exactly like
   an acting agent on that provider. Auto-caching routes (OpenAI-family)
   are left untouched by policy.

Live-verified on the wire (per-iteration opus+gpt5.5 preset, 4 fan-outs):
claude advisor fan-out 2-3 cache_write=2161/2344, fan-out 4
cache_read=2206 / fresh_in=2; aggregator session cache share 84%/77%
(vs 2%/0% before). Sub-1024-token prompts correctly stay uncached
(Anthropic minimum).
2026-07-03 04:08:48 -07:00
Teknium
1c4cc00f73
fix(moa): user_turn fanout — synthetic advisory marker must not count as a user turn (#57598)
The advisory view appends a synthetic user marker when it ends on an
assistant turn (Anthropic end-on-user rule) — i.e. on every tool iteration
after the first. The user_turn prefix hash treated that marker as the last
user message, so the hashed prefix included the grown mid-turn context and
the signature changed every iteration: advisors re-ran per iteration,
silently defeating the once-per-turn cadence (live smoke test: 2 fan-outs
for a 2-iteration task; expected 1). Hoist the marker to a module constant
and skip it when locating the last REAL user message. Verified: iteration-2
signature now equals iteration-1 (cache HIT); a new real user message still
re-triggers the fan-out.
2026-07-03 01:24:58 -07:00
Teknium
9e044cf795
feat(moa): per-preset fanout cadence — user_turn runs advisors once per user turn (#57591)
New preset key 'fanout': 'per_iteration' (default, unchanged behavior)
re-runs the reference fan-out whenever the advisory view changes — every
tool iteration. 'user_turn' runs the advisors ONCE per user turn and lets
the aggregator act alone for the rest of the tool loop — the original MoA
shape (upfront multi-model synthesis, then a single acting model), and the
obvious lever on MoA's wall/cost multiplier (advisor generation dominates
per-turn latency).

Implementation reuses the existing turn-scoped reference cache: in
user_turn mode the cache signature hashes only the prefix up to the LAST
user message, so mid-turn advisory-view growth doesn't change the key and
iteration 2+ is a cache HIT (advice reused, zero advisor spend, no
re-trace). A new user message changes the prefix and re-triggers the
fan-out. Unknown fanout values normalize to per_iteration.
2026-07-03 01:02:44 -07:00
Teknium
372f8195c7
fix(moa): default temperatures to unset — provider default, like single-model agents (#57440)
A single-model Hermes agent never sends temperature; the provider default
applies. MoA hardcoded reference_temperature=0.6 / aggregator_temperature=0.4,
and the coercion float(preset.get(key, 0.6) or 0.6) made unset IMPOSSIBLE to
express: absent, null, empty, and even an explicit 0 all collapsed to the
baked-in default. Every MoA advisor and aggregator therefore ran at 0.6/0.4
while the same model running solo used the provider default — silently
skewing solo-vs-MoA comparisons and overriding provider-tuned defaults.

- moa_config normalization: temperatures coerce to None when absent/blank/
  invalid (new _coerce_float_or_none); explicit values incl. 0 honored.
- moa_loop: _preset_temperature() resolves preset values; None flows to
  call_llm, which already omits the parameter when None (same contract as
  max_tokens). Aggregator still inherits the acting agent's own configured
  temperature when the preset doesn't pin one.
- conversation_loop (context-mode MoA): same resolution, no more hardcoded
  0.6/0.4 at the call site.
- DEFAULT_CONFIG preset + web_server payload models + docs updated: unset
  is the default, pinning stays available.
2026-07-03 00:22:49 -07:00
Teknium
543d305bbb
feat(moa): add reference_max_tokens to cap advisor output and cut turn latency (#56756)
MoA per-turn latency is dominated by advisor GENERATION: turn wall time
correlates ~0.88 with output tokens and ~-0.03 with input tokens (measured over
52 turns). Each turn waits for the slowest advisor to finish writing, and
advisors were uncapped — writing multi-thousand-token essays the aggregator
only needs the gist of.

Add an opt-in per-preset reference_max_tokens knob (mirrors reference_temperature)
that caps ADVISOR output only; the acting aggregator is never capped. Default
None = uncapped, so existing presets are byte-for-byte unchanged (no regression).
Wired through both MoA execution paths (MoAChatCompletions.create and
aggregate_moa_context).

E2E: same task, closed preset uncapped vs reference_max_tokens=600 -> 59s to 33s
(~44% faster), final answer identical/correct.

- hermes_cli/moa_config.py: _coerce_int_or_none helper + reference_max_tokens
  in _normalize_preset/_default_preset/flattened view
- agent/moa_loop.py: read preset.reference_max_tokens, pass to reference fan-out
- agent/conversation_loop.py: pass reference_max_tokens on the per-turn path
- tests + docs
2026-07-02 00:16:35 -07:00
Teknium
aa605b66c8
fix(moa): price aggregator turn at its real model so session cost isn't advisor-only (#56394)
On the MoA path agent.model/provider are the virtual preset name (e.g.
"closed") and "moa", which have no pricing entry. estimate_usage_cost()
returned None for the aggregator turn, so the `if amount_usd is not None`
guard skipped it and the session's estimated_cost_usd reflected only the
advisor fan-out — a ~50% undercount when the aggregator does the full acting
loop (verified: $0.91 advisor-only vs $1.96 true, aggregator = 54%).

MoAChatCompletions.create() now stashes the resolved aggregator slot as
last_aggregator_slot (exposed via MoAClient); conversation_loop reads it to
price the aggregator turn at its real model/provider. cost_source flips from
'none' to 'provider_models_api'.
2026-07-01 06:02:33 -07:00
Teknium
5de65624d1
fix(moa): capture streamed aggregator output into full-turn traces (#56312)
MoA full-turn traces (moa.save_traces) recorded the aggregator's acting
output only on the non-streaming path, where it's captured inline at
call time. On the streaming path — which every hermes chat --query run
and every live gateway/CLI turn takes — the aggregator's raw token
stream is handed to the live consumer, so the trace left output=null and
only pointed at the session-db assistant row. An offline audit of a
benchmark run (HermesBench drives --query) then couldn't see what the
aggregator produced without hand-joining to state.db.

Capture the resolved streamed acting text at trace-flush time (the agent
already holds it in _current_streamed_assistant_text) and fold it into
the trace, so the record is self-contained in both modes. New
output_location value inline_from_stream marks a streamed turn whose text
was captured this way; a genuinely empty acting turn (pure tool call)
still points at the session db, matching state.db exactly.

Touches only the trace side-channel — no change to the acting path,
message history, role alternation, or prompt cache.

- agent/moa_loop.py: consume_and_save_trace(..., aggregator_output_fallback)
  on both the facade and the MoAClient wrapper; prefer inline capture,
  fall back to the resolved streamed text.
- agent/moa_trace.py: embed the fallback; add inline_from_stream location.
- agent/conversation_loop.py: pass _current_streamed_assistant_text at flush.
- tests: 5 cases across streaming / non-streaming / empty-fallback / no-double-write.
2026-07-01 04:07:46 -07:00
Jeff Watts
a2d6f05d1b fix(moa): append reference block at end of aggregator prompt for KV-cache reuse
The MoA aggregator received the per-turn reference block merged into the most
recent `user` message. In an agentic tool loop that message is the original
task near the top of the context (everything after it is assistant/tool turns),
so injecting text that changes every iteration diverges the prompt prefix early.
The server's KV cache then cannot be reused and the entire conversation
re-prefills on every tool-loop step — full prefill each step, which dominates
latency on long contexts.

Append the reference block at the end of the prompt instead (merging into the
last message only when it is already a trailing user turn, i.e. plain chat).
This keeps the [system][task][tool-history] prefix stable and cache-reusable so
only the new block re-prefills, and gives the aggregator the references with
recency. Extracted as `_attach_reference_guidance` with unit tests.

Measured on a local llama.cpp aggregator over a long agentic task: KV-cache
reuse on follow-up steps went from ~0.3% to ~93-95% and per-step prefill on an
~80k-token context dropped from ~44s to <1s, with no change to output.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-07-01 01:59:00 -07:00
Teknium
2e8748ed22
feat(moa): opt-in full-turn trace persistence to JSONL (#56101)
Adds moa.save_traces (default off). When on, every MoA turn that runs the
reference fan-out appends one JSON line to
<hermes_home>/moa-traces/<session_id>.jsonl capturing the TRUE FULL turn:
each reference model's exact input messages (system advisory prompt + full
advisory view, not the truncated display preview) + full output + usage +
per-advisor cost, and the aggregator's exact input (including the injected
reference-context guidance block) + output. Lets MoA runs be audited and
improved offline — what every model saw, said, and cost.

- agent/moa_trace.py: config-gated JSONL writer, profile-aware path via
  get_hermes_home(), best-effort (never breaks a turn), moa.trace_dir override.
- agent/moa_loop.py: _RefAccounting now carries full input/output/model/
  provider/temperature; create() stashes the full turn on a cache MISS
  (once per turn, never on the cache-HIT repeat iterations); non-streaming
  aggregator output captured inline, streaming marked + pointed at the
  session assistant message. consume_and_save_trace(session_id) flushes it.
- agent/conversation_loop.py: flushes the trace with the live session_id
  right after MoA usage consumption. No-op for non-MoA clients.
- hermes_cli/config.py: moa.save_traces + moa.trace_dir defaults.

Traces are a side channel — NOT the messages table, never in replay, safe
to delete. Off by default; only overhead when off is one config read on a
MoA cache-MISS turn.

Tests: full-trace-when-enabled (per-ref input+output+cost, aggregator
input-with-guidance + output), nothing-when-disabled. Live E2E through
run_conversation confirmed the loop wiring writes the file.
2026-07-01 00:09:42 -07:00
Teknium
3bdb23de10
fix(moa): count reference (advisor) fan-out token usage + cost (#56087)
MoA ran the reference models before the aggregator but returned only the
aggregator's usage to the loop — _run_reference discarded each advisor
response's .usage entirely. Session accounting (state.db, /insights, cost)
therefore undercounted every MoA turn by the whole reference fan-out, which
is usually the bulk of the spend and scales with advisor count.

- _run_reference normalizes each advisor's usage with ITS OWN resolved
  provider/api_mode and prices it at ITS OWN model rate (correct cache-read/
  cache-write split), returning a _RefAccounting(usage, cost).
- create() sums advisor usage + cost once per turn (cache MISS only, so a
  repeat tool-iteration reusing cached advice does not double-charge) and
  exposes it via MoAClient.consume_reference_usage().
- conversation_loop folds advisor tokens into the reported/persisted token
  counts and adds advisor cost (priced per-advisor) on top of the
  aggregator cost, in both the in-memory session totals and the state.db
  per-call delta. Aggregator cost is still priced on aggregator-only usage
  so advisor tokens are never repriced at the aggregator rate.
- CanonicalUsage gains __add__ for per-bucket summing.

Tests: advisor usage/cost capture, per-turn sum + consume-clears +
cache-hit no-double-charge, CanonicalUsage.__add__.
2026-06-30 23:08:37 -07:00
Teknium
a653bb0cbe
refactor(moa): unify slot provider-identity on the single call_llm chokepoint (#55991)
_slot_runtime maintained a hand-listed name-preservation set
({nous, anthropic, openai-codex, xai-oauth, bedrock}) that returned bare
provider+model to avoid call_llm collapsing an explicit base_url to the generic
'custom' route. That duplicated _resolve_task_provider_model's
_preserve_provider_with_base_url guard (a provider-catalog capability check)
and had to be extended by hand for every provider with custom auth/signing —
the exact drift that produced the anthropic (#54609) and bedrock (#54912) 429/
empty-response bugs.

Removes the whitelist: _slot_runtime now forwards the resolved base_url/api_key/
api_mode for every slot, and the single chokepoint
(_resolve_task_provider_model -> _preserve_provider_with_base_url) decides
identity preservation. Behavior is unchanged for the five providers — their
provider branches (codex Responses+Cloudflare, xai-oauth, bedrock SigV4,
anthropic OAuth Bearer+anthropic-beta, nous Portal tags) re-resolve their own
credentials by name and ignore a forwarded base_url/api_key, so forwarding is
safe even for bedrock's placeholder 'aws-sdk' key.

Verified via real-import E2E: _slot_runtime -> _resolve_task_provider_model
preserves openai-codex/xai-oauth/bedrock/anthropic/nous (+openrouter control) —
none collapse to custom. Tests updated to assert the pipeline invariant against
the real resolver instead of the removed whitelist's bare-return shape.
2026-06-30 18:59:45 -07:00
iizotov
6eca917631 fix(moa): route bedrock MoA slots through signed bedrock branch
_slot_runtime() resolved a bedrock slot to its bedrock-runtime base_url
plus the placeholder api_key "aws-sdk" and forwarded both to call_llm.
call_llm then treated it as a plain OpenAI-compatible endpoint and issued
an UNSIGNED bearer POST (no AWS SigV4 / IAM signing), so Bedrock returned
an empty/malformed ChatCompletion (choices=None) and the MoA aggregator
turn failed validation.

Add 'bedrock' to the name-preserve set alongside nous/openai-codex/
xai-oauth so bedrock slots are passed by provider name only, routing
through call_llm's dedicated SigV4-signed bedrock branch.

Affects any MoA preset using a bedrock aggregator or bedrock reference.
2026-06-30 17:45:45 -07:00
Chufeng Fan
4d43669921 fix(moa): route native anthropic OAuth references through provider branch
MoA's _slot_runtime() whitelists providers that must keep their provider
identity (so call_llm runs their provider branch) instead of being treated
as a plain custom endpoint via forwarded base_url/api_key. Native anthropic
was missing from this set.

Native anthropic subscription OAuth setup-tokens (sk-ant-oat*) require Bearer
auth plus the 'anthropic-beta: oauth-*' header, which only the anthropic
provider branch adds. Without the whitelist entry, the slot's base_url/api_key
were forwarded and call_llm sent the OAuth token as x-api-key, which Anthropic
rejects with a bare 429 (rate_limit_error with no quota details). This made
anthropic references in MoA presets fail every time.

Add 'anthropic' to the whitelist so native anthropic reference/aggregator
slots route through the provider branch. Extends upstream 9229d0db1 which
added 'nous' for the same reason.
2026-06-30 17:45:45 -07:00
Jeff Watts
4d2351a528 feat(moa): stream the aggregator response to the user
MoA sessions could not stream: the gateway streaming toggle was a no-op for
provider "moa", so users saw nothing until the entire response finished — minutes
of silence on long turns. The aggregator's reply was always fetched whole.

Root cause was twofold:
  1. conversation_loop hard-disabled streaming for provider in {"copilot-acp",
     "moa"} (MoA grouped with the ACP client, whose facade isn't a stream).
  2. MoAChatCompletions.create() fetched the aggregator response whole via
     call_llm(), which had no streaming mode.

For provider "moa", _create_request_openai_client() returns the MoAClient facade
itself, so the existing streaming consumer already calls
MoAChatCompletions.create(stream=True). We reuse that battle-tested consumer
(text-delta delivery, tool_call reassembly, stale-stream detection, non-streaming
fallback) instead of adding a parallel streaming path.

Changes:
  - call_llm() gains stream/stream_options. When streaming it returns the raw SDK
    stream iterator directly, bypassing _validate_llm_response and the
    temperature/max_tokens/payment fallback chain (which assume a complete
    response). The caller owns reassembly and fallback.
  - MoAChatCompletions.create() runs the references first (unchanged), then when
    stream=True returns the aggregator's raw stream, forwarding stream_options and
    the consumer's per-request read timeout. stream=False is byte-identical to
    before (no stream/stream_options/timeout forwarded).
  - conversation_loop streams MoA only when a display/TTS consumer is present;
    quiet/subagent/health-check paths keep the complete-response path.

Tests: tests/run_agent/test_moa_streaming.py — create() stream/non-stream
branches, stream_options + timeout forwarding, call_llm raw-stream return vs
validated non-stream. Existing MoA tests unchanged (20 passed).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-30 12:07:01 -07:00
teknium1
fe355d0a27 fix(moa): handle dict/str message shape in MoA response extraction
Sibling of #15795's context_compressor fix. agent/moa_loop.py used the
same response.choices[0].message.content access; while wrapped in
try/except (so no crash), a dict/str-shaped message silently returned
empty. Coerce defensively so the content is actually extracted.
2026-06-30 04:38:43 -07:00
liuhao1024
d76ca3a7f2 fix(moa): propagate api_mode from slot runtime to call_llm
Slot_runtime resolved the provider's real API surface (including api_mode)
but only forwarded base_url and api_key to call_llm, dropping api_mode.
This caused Copilot GPT-5.x reference slots to hit /chat/completions
instead of the Responses API, returning 400 unsupported_api_for_model.

- _slot_runtime: forward api_mode from resolve_runtime_provider
- call_llm: accept explicit api_mode param, override task config
- 4 regression tests for propagation, omission, and signature
2026-06-30 03:39:50 -07:00
Gille
9229d0db17 fix(moa): preserve Nous provider identity for references 2026-06-28 00:47:15 -07:00
Teknium
7c38249c79
feat(moa): references see full tool state + fire on every user/tool response (#54016)
The advisory reference view stripped all tool calls and tool results, so
reference models judged a task whose actions and results they never saw — and
references only fired once per user turn, never re-running as the agent's
state advanced through the tool loop.

Two fixes:
- _reference_messages() now PRESERVES the agent's tool calls and tool results,
  rendering them inline as text ([called tool: ...] / [tool result: ...]) so a
  reference gives an informed judgement on the real current state. Still emits
  zero tool-role messages and zero tool_calls arrays (strict providers reject
  those), and large tool results are previewed head+tail (4000-char budget).
  The required end-on-user shape is met by APPENDING a synthetic advisory user
  turn — not by deleting the agent's latest context (which the prior fix did).
- References now re-run on every state change — each new user message AND each
  new tool result — instead of once per user turn. The state-sensitive advisory
  signature drives the cache: new tool result = miss (re-run), identical-state
  re-call = hit (no re-run, no re-emit).

The acting aggregator still receives the full, untrimmed transcript.
2026-06-28 00:30:11 -07:00
Teknium
1fa44180b0
fix(moa): advisory references end on a user turn + get a reference-role system prompt (#54007)
* fix(moa): reference advisory view must end with a user turn

MoA reference calls failed with Anthropic models that don't support
assistant prefill (e.g. Claude Opus 4.8): '400 ... must end with a user
message'. The advisory view built by _reference_messages() kept the last
assistant turn's text while dropping the following tool result, leaving a
trailing assistant turn — which Anthropic (and OpenRouter->Anthropic)
interpret as an assistant prefill to continue. References are advisory and
must end on the user turn they answer.

Strip trailing assistant turns from the advisory view (preserving
intervening ones). Update the existing test that encoded the buggy shape
and add a mid-tool-loop regression test.

* feat(moa): give reference models an advisory-role system prompt

Reference models received the bare trimmed conversation with no role
framing, so they assumed they were the acting agent and refused ("I can't
access repositories/URLs from here") or tried to call tools they don't have.

Prepend a dedicated advisory system prompt to every reference call: the
model is an analyst, not the actor — it cannot execute, should not
apologize for lacking tools, and should reason about the presented state to
advise the aggregator/orchestrator on approach, next steps, tool-use
strategy, risks, and anything the acting agent missed. Its output is private
guidance for the aggregator, not a user-facing answer.
2026-06-27 22:52:25 -07:00
Gille
e7bb67332d fix(moa): preserve Codex slot routing 2026-06-27 14:20:51 -07:00
Teknium
3b44a3c8bb
feat(moa): show each reference model's output as a labelled block before the aggregator (#53793)
When a MoA preset is selected, each reference model's answer now renders in the
CLI as a thinking-style block labelled with its source model, BEFORE the
aggregator responds — so the mixture-of-agents process is visible instead of a
silent pause. The aggregator's response (and its tool actions) follow as normal.

Mechanism (shared seam, all surfaces):
- MoAChatCompletions/MoAClient take an optional reference_callback and emit
  'moa.reference' (index/count/label/text) per reference, then 'moa.aggregating'
  (aggregator label) once. agent_init wires this to the agent's
  tool_progress_callback, which every surface already consumes — so the events
  reach CLI/TUI/desktop/gateway with no new plumbing.
- CLI _on_tool_progress renders 'moa.reference' as a labelled '┊ ◇ Reference
  i/n — <model>' header + a thinking-style preview (reusing _emit_reasoning_
  preview), and 'moa.aggregating' as a spinner transition. Display-only; never
  touches message history (cache-safe).

Turn-scoped reference cache: the agent loop calls the facade once per tool-loop
iteration, but the advisory message view is identical across iterations within a
turn, so references are now run AND displayed once per user turn (keyed by the
advisory view's signature) instead of re-running/re-spamming on every iteration.
This also cuts reference API cost from O(iterations) back to O(turns).

Verified live via interactive PTY on the opus-gpt preset (gpt-5.5 + opus refs):
reference blocks render once per turn, labelled by model, before the aggregator;
fresh blocks on each new turn; aggregator tool actions still execute.

Follow-up: TUI/desktop rich rendering + gateway batched-summary already receive
the events via tool_progress_callback; their surface-specific renderers are a
separate change.
2026-06-27 12:45:23 -07:00
Teknium
02b32e2d7c
fix(moa): call reference + aggregator models through their provider's real route (#53580)
MoA was calling reference and aggregator models through a bare
call_llm(provider=slot["provider"], model=slot["model"]) with a forced
temperature and a forced max_tokens (the preset's hardcoded 4096). That left
base_url/api_key/api_mode unresolved — so the auxiliary auto-detector guessed
the API surface instead of using the provider's real runtime, and the 4096 cap
truncated long aggregator syntheses.

A MoA slot is just a model selection and must be called the same way any model
is called elsewhere. Each slot is now resolved through resolve_runtime_provider
(the canonical provider→api_mode/base_url/api_key resolver the CLI, gateway, and
delegate_task all use) via a new _slot_runtime() helper, and the resolved
endpoint is passed into call_llm. So a reference/aggregator gets its provider's
actual API surface — MiniMax → anthropic_messages, GPT-5/o-series →
max_completion_tokens, custom endpoints → their base_url — identical to how that
model is handled as the acting model.

MoA also no longer imposes its own output cap: max_tokens defaults to None
(omitted → the model's real maximum) for references and is passed through from
the caller for the aggregator. The preset's hardcoded 4096 is gone. The
max_tokens preset config field is left in place (config/web/desktop unchanged);
it is simply no longer applied as a forced cap.

Tests: slots route through resolve_runtime_provider with resolved base_url/
api_key; resolution errors fall back to bare provider/model; neither call
carries an output cap even when the preset config still contains max_tokens.
2026-06-27 04:39:42 -07:00
Teknium
c6575df927
feat(moa): expose MoA presets as selectable virtual models (#46081)
* feat(moa): expose MoA presets as selectable virtual models

Reconstructed onto current main (PR #46081's base had diverged with no common
ancestor, marking the PR dirty so CI never dispatched). MoA is now a virtual
provider: each named preset is a selectable model under provider 'moa', and the
preset's aggregator is the acting model that answers and calls tools.

Reference models fan out in parallel via a bounded ThreadPoolExecutor (the same
batch pattern delegate_task uses) — all references dispatched at once, collected
when every one finishes, then handed to the aggregator. Output order is
preserved, failures and the MoA-recursion guard stay isolated per reference.

- Removed the old mixture_of_agents model tool and moa toolset.
- Added moa as a virtual provider in the provider/model inventory.
- /moa is shortcut behavior over model selection (default preset / named preset
  / one-shot prompt).
- Dashboard + Desktop manage named presets; presets appear in model pickers.
- Parallel reference fan-out in agent/moa_loop.py with regression test.

* fix(moa): thread moa_config through _run_agent to _run_agent_inner

The reconstructed gateway MoA wiring declared moa_config on _run_agent (the
profile-scoping wrapper) and used it inside _run_agent_inner, but the wrapper
never forwarded it — _run_agent_inner had no such parameter, so the runtime hit
NameError: name 'moa_config' is not defined on the compression-failure session
sync path. Add moa_config to _run_agent_inner's signature and forward it from
both wrapper call sites (multiplex and non-multiplex). Caught by
tests/gateway/test_compression_failure_session_sync.py on CI shard test(4).

* fix(moa): classify moa as a virtual provider in the catalog

The moa virtual provider has no PROVIDER_REGISTRY/ProviderProfile entry, so
provider_catalog() fell through to the default auth_type="api_key" with no
env vars — tripping two catalog invariants:
  - test_provider_catalog: api_key providers must expose a credential env var
  - test_provider_parity: every hermes-model provider must be desktop-configurable

moa already declares auth_type="virtual" in HERMES_OVERLAYS; consult that
overlay as an auth_type fallback so the catalog reports moa as virtual (no real
credential, no network endpoint). Exempt virtual providers from the desktop
parity union check the same way 'custom' is exempt — derived from the catalog,
not a hardcoded slug, so future virtual providers are covered too.
2026-06-25 13:52:06 -07:00