fix(auxiliary): retry transient blips harder + isolate client cache per model (#56889)
Two related hardening fixes for auxiliary calls (which include MoA reference advisors — a pinned-model path where provider fallback is not a meaningful recovery): 1. Transient-transport retries: the same-provider retry on a connection reset / timeout / 5xx / 408 was a single attempt, then fallback. For a pinned aux call a second blip silently loses the call (root of the run2 double-advisor 'Connection error' collapse — a genuine upstream blip). Now retries N times with exponential backoff, N = auxiliary.transient_retries (default 2 -> 3 total attempts, clamped [0,6]). Compression-on-timeout fast-fail carve-out preserved. 2. Per-model client-cache isolation: _client_cache_key excluded the model, so two concurrent auxiliary calls to the same provider/base_url/key but different models (e.g. an opus + gpt-5.5 MoA fan-out) shared one cache entry and could race each other's client lifecycle. Model now participates in the key -> distinct clients, no cross-call races. Same-model reuse unchanged. - agent/auxiliary_client.py: _transient_retry_count() + backoff loop; model in _client_cache_key and both call sites. - hermes_cli/config.py: auxiliary.transient_retries default (2). - tests: new retry/isolation tests; updated 2 stale-expectation tests to the corrected behavior (per-model resolve; N-retry escalation). Backoff base is overridable (_TRANSIENT_RETRY_BACKOFF_BASE) so tests don't sleep.
This commit is contained in:
parent
71c0622122
commit
fb403a3a73
4 changed files with 178 additions and 20 deletions
|
|
@ -1465,6 +1465,13 @@ DEFAULT_CONFIG = {
|
|||
# Each aux task is independent — main-agent provider_routing and
|
||||
# openrouter.min_coding_score do NOT propagate to aux calls by design.
|
||||
"auxiliary": {
|
||||
# Same-provider retries for a transient transport blip (connection
|
||||
# reset / timeout / 5xx / 408) on ANY auxiliary call before falling
|
||||
# back. Default 2 (→ 3 total attempts), clamped [0,6]. Matters most for
|
||||
# pinned calls like MoA reference advisors, where provider fallback is
|
||||
# not a meaningful recovery, so an unretried blip silently loses the
|
||||
# call.
|
||||
"transient_retries": 2,
|
||||
"vision": {
|
||||
"provider": "auto", # auto | openrouter | nous | codex | custom
|
||||
"model": "", # e.g. "google/gemini-2.5-flash", "gpt-4o"
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue