hermes-agent/agent
adam91holt 8601c4d44c fix(codex): add time-to-first-byte watchdog for stalled Codex streams
The chatgpt.com/backend-api/codex endpoint has an intermittent failure mode
where it accepts the connection but never emits a single stream event — the
socket just hangs. Direct sequential probing reproduces it (0 events, no HTTP
status), and a fresh reconnect then succeeds in ~2s. Today the only guard is
the wall-clock stale timeout in interruptible_api_call, so a dead-on-arrival
connection is held for the full stale window (90-900s depending on context /
config) before the retry loop can reconnect — minutes of wasted wall time per
stall, at a rate of ~20% of calls during affected windows.

Add a TTFB watchdog scoped to the codex_responses path:

- codex_runtime.run_codex_stream stamps agent._codex_stream_last_event_ts on
  *every* stream event (not just output-text deltas), so reasoning-only and
  tool-call-only turns are not mistaken for a stall.
- interruptible_api_call resets that marker before the worker starts and, while
  it is still None, kills the connection once elapsed exceeds the TTFB cutoff
  (default 45s, tunable via HERMES_CODEX_TTFB_TIMEOUT_SECONDS, 0 disables). The
  raised TimeoutError flows through the existing retry path unchanged.

Once any event has arrived the stream is healthy and only the existing
wall-clock stale timeout applies, so legitimate long generations are never
interrupted. Gated to codex_responses; the chat_completions non-stream,
anthropic and bedrock branches have no first-event signal and are untouched.

Adds tests/agent/test_codex_ttfb_watchdog.py covering the stall kill, the
events-flowing pass-through, and the env-disable path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 05:34:42 -07:00
..
lsp
secret_sources perf(cli): cut hermes startup 63% — flip head-to-head vs codex (#31968) 2026-05-25 03:06:39 -07:00
transports fix(codex): size and propagate timeouts for Responses-API requests; lower stale defaults 2026-05-25 01:47:55 -07:00
__init__.py
account_usage.py
agent_init.py fix(cli): synchronize HERMES_SESSION_ID across environment and contextvar during session switches 2026-05-23 17:46:55 -07:00
agent_runtime_helpers.py fix(compressor): propagate api_mode and fix root logger calls 2026-05-23 17:38:19 -07:00
anthropic_adapter.py fix(security): close TOCTOU window when saving Claude Code OAuth credentials (#21152) 2026-05-24 17:45:12 -07:00
async_utils.py
auxiliary_client.py fix(model): include Premium+ in xAI OAuth label 2026-05-24 18:12:16 -07:00
azure_identity_adapter.py
background_review.py fix(background-review): allow pinned skills to be improved 2026-05-23 22:57:42 -07:00
bedrock_adapter.py
browser_provider.py
browser_registry.py
chat_completion_helpers.py fix(codex): add time-to-first-byte watchdog for stalled Codex streams 2026-05-25 05:34:42 -07:00
codex_responses_adapter.py fix(codex): size and propagate timeouts for Responses-API requests; lower stale defaults 2026-05-25 01:47:55 -07:00
codex_runtime.py fix(codex): add time-to-first-byte watchdog for stalled Codex streams 2026-05-25 05:34:42 -07:00
context_compressor.py fix(compressor): ABC compliance — total_tokens, api_mode, logger consistency 2026-05-23 17:38:19 -07:00
context_engine.py fix(compressor): ABC compliance — total_tokens, api_mode, logger consistency 2026-05-23 17:38:19 -07:00
context_references.py
conversation_compression.py fix(cli): synchronize HERMES_SESSION_ID across environment and contextvar during session switches 2026-05-23 17:46:55 -07:00
conversation_loop.py fix(streaming): route mid-tool-call partial-stream-stub through length continuation (#31998) (#32012) 2026-05-25 17:43:10 +05:30
copilot_acp_client.py
credential_persistence.py fix: avoid persisting borrowed credential secrets (#31416) 2026-05-25 00:32:08 -07:00
credential_pool.py fix: avoid persisting borrowed credential secrets (#31416) 2026-05-25 00:32:08 -07:00
credential_sources.py fix(model): include Premium+ in xAI OAuth label 2026-05-24 18:12:16 -07:00
curator.py
curator_backup.py fix(skills): prune dependency/venv dirs from all skill scanners (#30042) 2026-05-21 14:18:02 -07:00
display.py feat(cli): show todo progress as done/total fraction 2026-05-23 21:03:51 -07:00
error_classifier.py fix(error-classifier): treat 5xx request-validation errors as non-retryable 2026-05-24 15:15:37 -07:00
file_safety.py fix(security): block read_file on project-local .env files 2026-05-25 03:40:47 -07:00
gemini_cloudcode_adapter.py
gemini_native_adapter.py
gemini_schema.py
google_code_assist.py
google_oauth.py fix(security): guard os.chmod(parent) against / and top-level dirs 2026-05-20 22:56:55 -07:00
i18n.py
image_gen_provider.py fix(image_gen): cache xAI ephemeral URL responses to disk (#26942) (#31759) 2026-05-24 18:10:47 -07:00
image_gen_registry.py
image_routing.py fix(agent): consult supports_vision override in auto-mode routing 2026-05-20 23:27:10 -07:00
insights.py
iteration_budget.py
lmstudio_reasoning.py
manual_compression_feedback.py
markdown_tables.py
memory_manager.py
memory_provider.py
message_sanitization.py
model_metadata.py fix(compressor): propagate api_mode and fix root logger calls 2026-05-23 17:38:19 -07:00
models_dev.py fix(xai): resolve Grok Build context for OAuth 2026-05-22 13:05:36 -07:00
moonshot_schema.py
nous_rate_guard.py
onboarding.py
plugin_llm.py
portal_tags.py
process_bootstrap.py
prompt_builder.py refactor(ntfy): convert built-in adapter to platform plugin 2026-05-23 16:13:01 -07:00
prompt_caching.py
rate_limit_tracker.py
redact.py fix(debug): redact BlueBubbles webhook secrets 2026-05-24 15:43:48 -07:00
retry_utils.py
shell_hooks.py
skill_bundles.py
skill_commands.py
skill_preprocessing.py
skill_utils.py fix(skills): load Linux-tagged skills on Termux (android sys.platform) 2026-05-21 19:08:38 -07:00
stream_diag.py
subdirectory_hints.py
system_prompt.py fix(profiles): cross-profile soft guard on file-write tools + system-prompt hint (#31290) 2026-05-24 00:38:17 -07:00
think_scrubber.py
title_generator.py
tool_dispatch_helpers.py fix(agent): set tool_name on tool-result messages at construction time 2026-05-19 20:49:11 +01:00
tool_executor.py fix(cli): surface tool failures with specific error messages 2026-05-23 21:03:51 -07:00
tool_guardrails.py
tool_result_classification.py
trajectory.py
transcription_provider.py feat(stt): add register_transcription_provider() plugin hook 2026-05-25 01:41:19 -07:00
transcription_registry.py feat(stt): add register_transcription_provider() plugin hook 2026-05-25 01:41:19 -07:00
tts_provider.py feat(tts): add register_tts_provider() plugin hook (closes #30398) 2026-05-24 18:04:54 -07:00
tts_registry.py feat(tts): add register_tts_provider() plugin hook (closes #30398) 2026-05-24 18:04:54 -07:00
usage_pricing.py
video_gen_provider.py
video_gen_registry.py
web_search_provider.py
web_search_registry.py