From a726e8a81194a23c7ce3ba5773bff5bb08ba133f Mon Sep 17 00:00:00 2001 From: brooklyn! Date: Sun, 31 May 2026 10:36:57 -0500 Subject: [PATCH] fix(tui): auto-recover session on unexpected gateway death (+ persist lifecycle breadcrumbs) (#35893) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * fix(tui): persist gateway lifecycle breadcrumbs to crash log A backend SIGTERM (`=== SIGTERM received ===` in tui_gateway_crash.log) is always a parent action — `gw.kill()` (graceful-exit on a signal to Node, or an explicit /quit) or `start()` replacing a live child. #31051 added parent-side lifecycle breadcrumbs but left them in an in-memory CircularBuffer that dies with the process, so SIGTERM crash reports arrive with no parent context and no way to tell a signal-driven kill from a memory-critical `process.exit(137)` (which closes the child's stdin → clean EOF, not SIGTERM). Persist the death-explaining breadcrumbs (spawn / transport-exit / child-exit / replace-live-child / kill-reason / startup-timeout) plus the graceful-exit signal name and the memory-critical exit into the same crash log the Python side writes, so they interleave by timestamp next to the child's panic entry — making these recurring reports diagnosable. Gated off under VITEST so unit tests stay hermetic. * feat(tui): auto-recover the session when the gateway dies unexpectedly When a still-owned gateway child dies while the TUI is alive (a crash, OOM process.exit, or a SIGTERM/SIGHUP forwarded to it), the app currently nulls the session and drops to an inert "gateway exited" state — the user loses a long session and has to restart + re-run everything. That single behavior is most of the "TUI doesn't survive heavy work" complaint, independent of what does the killing. The 'exit' event only reaches this handler on an *unexpected* death: a user /quit calls process.exit before it fires, and a replaced child is identity- skipped in GatewayClient. So on exit we now respawn the gateway and resume the session that was live (history is persisted in SQLite) via a one-shot recoverSidRef the next gateway.ready consults before forging a new session. The in-flight reply is lost (it died with the process) but the session survives. Bounded to GATEWAY_RECOVERY_LIMIT (3) attempts per GATEWAY_RECOVERY_WINDOW_MS (60s) so a gateway that crash-loops on startup can't spawn-storm; past the budget we fall back to the inert state. * fix(tui): sanitize newlines + soften SIGTERM-cause claim in parentLog Address PR review: - recordParentLifecycle collapses embedded \r\n so a multi-line value (e.g. an error message) stays a single breadcrumb and can't masquerade as a separate entry or as the child's panic output sharing the crash log. - Reword the header: a backend SIGTERM is *usually* a parent action but can come straight from an external supervisor (s6, cgroup OOM, stray kill); the presence/absence of a [tui-parent] line before the child's panic is precisely what disambiguates the two. * fix(tui): clear sid during recovery + extract/test the recovery budget Address PR review: - Null `sid` immediately in the gateway exit handler. While the gateway is down (busy=false) the old sid would otherwise let sid-guarded effects (the 1.5s session.active_list poll, queue drain) fire RPCs at a dead/respawning gateway. recoverSidRef carries the session forward; resumeById restores sid on ready. - Extract the respawn budget into a pure evalRecovery() (gatewayRecovery.ts) and unit-test the bound: allows GATEWAY_RECOVERY_LIMIT within the window, blocks past it, and prunes attempts older than the window so recovery re-arms. * fix(tui): cap parent-log breadcrumb length (PR review) Truncate a single persisted breadcrumb to 4096 chars (matching GatewayClient's in-memory log-line cap) so a pathological value — e.g. a giant error string — can't bloat the shared crash log or add noticeable blocking on the synchronous append during a failure path. Covered by a test. * fix(tui): keep "recovering session…" status visible during resume (PR review) resumeById() synchronously sets status to 'resuming…' on entry, so the recovery branch now applies its 'recovering session…' label *after* calling resumeById — the distinct label sticks for the duration of the resume RPC (which later flips to 'ready') instead of being immediately clobbered. Test updated to assert the ordering. * fix(tui): keep recovery budget alive across a startup crash-loop (PR review) deadSid was read from getUiState().sid, which the first exit nulls — so if the respawned gateway crash-looped before gateway.ready (resumeById never restored sid), later exits saw null and abandoned the session after a single attempt, defeating the bounded retry budget. Lift the whole decision into a pure planGatewayRecovery() that falls back to the pending recoverSidRef target when the live sid is already cleared, and unit-test the crash-loop sequence (keeps retrying the same session up to the limit, then falls back to inert). Supersedes evalRecovery. * chore(tui): drop non-null assertion + clarify breadcrumb cap comment (PR review) - Recovery branch guards on `recoverSidRef && recoverSid` so the ref write needs no `!` assertion (avoids a future unsafe refactor). - Reword the parentLog cap comment: it slices the value to 4096 chars and appends a short truncation marker (so the written line is slightly longer), rather than implying a strict 4096-byte limit. * chore(tui): soften "absence ⇒ external signal" + "any in-flight reply" (PR review) - parentLog header: a missing [tui-parent] line only *suggests* an external signal (the logger is best-effort: VITEST-disabled, failed append swallowed), not a definitive conclusion. - Recovery notice says "any in-flight reply was lost" since the gateway can also exit while idle. --- .../createGatewayEventHandler.test.ts | 25 +++++++ ui-tui/src/__tests__/gatewayRecovery.test.ts | 47 ++++++++++++ ui-tui/src/__tests__/parentLog.test.ts | 75 +++++++++++++++++++ ui-tui/src/app/createGatewayEventHandler.ts | 19 ++++- ui-tui/src/app/gatewayRecovery.ts | 35 +++++++++ ui-tui/src/app/interfaces.ts | 4 + ui-tui/src/app/useMainApp.ts | 31 ++++++++ ui-tui/src/entry.tsx | 10 +++ ui-tui/src/gatewayClient.ts | 23 ++++-- ui-tui/src/lib/parentLog.ts | 57 ++++++++++++++ 10 files changed, 318 insertions(+), 8 deletions(-) create mode 100644 ui-tui/src/__tests__/gatewayRecovery.test.ts create mode 100644 ui-tui/src/__tests__/parentLog.test.ts create mode 100644 ui-tui/src/app/gatewayRecovery.ts create mode 100644 ui-tui/src/lib/parentLog.ts diff --git a/ui-tui/src/__tests__/createGatewayEventHandler.test.ts b/ui-tui/src/__tests__/createGatewayEventHandler.test.ts index 897875b2c..afebc4d10 100644 --- a/ui-tui/src/__tests__/createGatewayEventHandler.test.ts +++ b/ui-tui/src/__tests__/createGatewayEventHandler.test.ts @@ -681,6 +681,31 @@ describe('createGatewayEventHandler', () => { expect(resumeById).not.toHaveBeenCalled() }) + it('on gateway.ready after a crash, resumes the recovered session once and skips forge', async () => { + const appended: Msg[] = [] + const newSession = vi.fn() + const resumeById = vi.fn() + const ctx = buildCtx(appended) + + ctx.session.newSession = newSession + // Mimic resumeById's synchronous status write so the test proves the + // "recovering session…" label is applied *after* (and survives) it. + ctx.session.resumeById = resumeById.mockImplementation(() => patchUiState({ status: 'resuming…' })) + ctx.session.STARTUP_RESUME_ID = '' + ctx.session.recoverSidRef = ref('sess-crashed') + + const onEvent = createGatewayEventHandler(ctx) + + onEvent({ payload: {}, type: 'gateway.ready' } as any) + + await vi.waitFor(() => expect(resumeById).toHaveBeenCalledWith('sess-crashed')) + expect(newSession).not.toHaveBeenCalled() + // One-shot: the ref is consumed so a later ordinary restart forges/resumes + // per config instead of re-resuming the recovered session. + expect(ctx.session.recoverSidRef.current).toBeNull() + expect(getUiState().status).toBe('recovering session…') + }) + it('on gateway.ready with auto_resume on and a recent session, resumes it', async () => { const appended: Msg[] = [] const newSession = vi.fn() diff --git a/ui-tui/src/__tests__/gatewayRecovery.test.ts b/ui-tui/src/__tests__/gatewayRecovery.test.ts new file mode 100644 index 000000000..5ede200a9 --- /dev/null +++ b/ui-tui/src/__tests__/gatewayRecovery.test.ts @@ -0,0 +1,47 @@ +import { describe, expect, it } from 'vitest' + +import { GATEWAY_RECOVERY_LIMIT, GATEWAY_RECOVERY_WINDOW_MS, planGatewayRecovery } from '../app/gatewayRecovery.js' + +describe('planGatewayRecovery', () => { + it('recovers the live session and records the attempt', () => { + const plan = planGatewayRecovery('sess-1', null, [], 1000) + + expect(plan).toEqual({ attempts: [1000], recover: true, sid: 'sess-1' }) + }) + + it('does not recover when there is no session to resume', () => { + expect(planGatewayRecovery(null, null, [], 1000)).toEqual({ attempts: [], recover: false, sid: null }) + }) + + it('keeps retrying the recovery target through a startup crash-loop, bounded by the budget', () => { + // First exit: live sid present. + let attempts: number[] = [] + let plan = planGatewayRecovery('sess-1', null, attempts, 0) + + expect(plan.recover).toBe(true) + expect(plan.sid).toBe('sess-1') + attempts = plan.attempts + + // Respawn crash-loops before gateway.ready: live sid is now null, but the + // recovery target carries it forward so we keep trying up to the budget. + for (let i = 1; i < GATEWAY_RECOVERY_LIMIT; i++) { + plan = planGatewayRecovery(null, 'sess-1', attempts, i) + expect(plan.recover).toBe(true) + expect(plan.sid).toBe('sess-1') + attempts = plan.attempts + } + + // Budget exhausted: fall back to the inert state instead of spawn-storming. + plan = planGatewayRecovery(null, 'sess-1', attempts, GATEWAY_RECOVERY_LIMIT) + expect(plan.recover).toBe(false) + expect(plan.sid).toBe('sess-1') + }) + + it('prunes attempts older than the window so recovery re-arms', () => { + const old = Array.from({ length: GATEWAY_RECOVERY_LIMIT }, (_, i) => i) + const plan = planGatewayRecovery('sess-1', null, old, GATEWAY_RECOVERY_WINDOW_MS + 100) + + expect(plan.attempts).toEqual([GATEWAY_RECOVERY_WINDOW_MS + 100]) + expect(plan.recover).toBe(true) + }) +}) diff --git a/ui-tui/src/__tests__/parentLog.test.ts b/ui-tui/src/__tests__/parentLog.test.ts new file mode 100644 index 000000000..2a910c7cf --- /dev/null +++ b/ui-tui/src/__tests__/parentLog.test.ts @@ -0,0 +1,75 @@ +import { mkdtempSync, readFileSync, rmSync } from 'node:fs' +import { tmpdir } from 'node:os' +import { join } from 'node:path' + +import { afterEach, beforeEach, describe, expect, it, vi } from 'vitest' + +// parentLog gates itself off under VITEST so unit tests can't pollute a real +// ~/.hermes. To exercise the real persistence path we clear that gate, point +// HERMES_HOME at a temp dir, and re-import the module fresh (path + enabled +// flag are captured at module load). +const loadFresh = async (home: string) => { + vi.resetModules() + vi.stubEnv('VITEST', '') + vi.stubEnv('HERMES_HOME', home) + + return import('../lib/parentLog.js') +} + +describe('recordParentLifecycle', () => { + let home: string + + beforeEach(() => { + home = mkdtempSync(join(tmpdir(), 'hermes-parentlog-')) + }) + + afterEach(() => { + vi.unstubAllEnvs() + rmSync(home, { force: true, recursive: true }) + }) + + it('appends a timestamped breadcrumb to logs/tui_gateway_crash.log', async () => { + const { recordParentLifecycle } = await loadFresh(home) + + recordParentLifecycle('graceful-exit received signal=SIGHUP → killing gateway') + + const contents = readFileSync(join(home, 'logs', 'tui_gateway_crash.log'), 'utf8') + + expect(contents).toContain('[tui-parent]') + expect(contents).toContain('graceful-exit received signal=SIGHUP → killing gateway') + expect(contents).toMatch(/\d{4}-\d{2}-\d{2}T/) + }) + + it('collapses embedded newlines so a value stays one breadcrumb', async () => { + const { recordParentLifecycle } = await loadFresh(home) + + recordParentLifecycle('uncaughtException: boom\n at foo()\r\n at bar()') + + const lines = readFileSync(join(home, 'logs', 'tui_gateway_crash.log'), 'utf8').trimEnd().split('\n') + + expect(lines).toHaveLength(1) + expect(lines[0]).toContain('boom ↵ at foo() ↵ at bar()') + }) + + it('caps an oversized breadcrumb so it cannot bloat the shared crash log', async () => { + const { recordParentLifecycle } = await loadFresh(home) + + recordParentLifecycle('x'.repeat(10_000)) + + const line = readFileSync(join(home, 'logs', 'tui_gateway_crash.log'), 'utf8') + + expect(line).toContain('[truncated 10000 chars]') + expect(line.length).toBeLessThan(4_500) + }) + + it('is a no-op under VITEST so tests stay hermetic', async () => { + vi.resetModules() + vi.stubEnv('VITEST', 'true') + vi.stubEnv('HERMES_HOME', home) + + const { recordParentLifecycle } = await import('../lib/parentLog.js') + + expect(() => recordParentLifecycle('should not be written')).not.toThrow() + expect(() => readFileSync(join(home, 'logs', 'tui_gateway_crash.log'), 'utf8')).toThrow() + }) +}) diff --git a/ui-tui/src/app/createGatewayEventHandler.ts b/ui-tui/src/app/createGatewayEventHandler.ts index 70264b0c7..987518a44 100644 --- a/ui-tui/src/app/createGatewayEventHandler.ts +++ b/ui-tui/src/app/createGatewayEventHandler.ts @@ -76,7 +76,7 @@ const normalizeSubagentStatus = (status: unknown, fallback: SubagentStatus): Sub export function createGatewayEventHandler(ctx: GatewayEventHandlerContext): (ev: GatewayEvent) => void { const { rpc } = ctx.gateway - const { STARTUP_RESUME_ID, newSession, resumeById, setCatalog } = ctx.session + const { STARTUP_RESUME_ID, newSession, recoverSidRef, resumeById, setCatalog } = ctx.session const { bellOnComplete, stdout, sys } = ctx.system const { appendMessage, panel, setHistoryItems } = ctx.transcript const { setInput } = ctx.composer @@ -303,6 +303,23 @@ export function createGatewayEventHandler(ctx: GatewayEventHandlerContext): (ev: }) .catch((e: unknown) => turnController.pushActivity(`command catalog unavailable: ${rpcErrorMessage(e)}`, 'info')) + // Crash recovery: a respawn triggered by an unexpected gateway death + // resumes the session that was live, not a brand-new one. One-shot — the + // ref is cleared so an ordinary later restart still forges/resumes per + // config. No startup prompt here (this is mid-session, not a cold boot). + const recoverSid = recoverSidRef?.current + + if (recoverSidRef && recoverSid) { + recoverSidRef.current = null + resumeById(recoverSid) + // After resumeById: it synchronously sets status to 'resuming…' on entry, + // so override it here to keep the distinct "recovering" label visible for + // the duration of the resume RPC (which later flips status to 'ready'). + patchUiState({ status: 'recovering session…' }) + + return + } + if (STARTUP_RESUME_ID) { patchUiState({ status: 'resuming…' }) resumeById(STARTUP_RESUME_ID) diff --git a/ui-tui/src/app/gatewayRecovery.ts b/ui-tui/src/app/gatewayRecovery.ts new file mode 100644 index 000000000..f68cc5acf --- /dev/null +++ b/ui-tui/src/app/gatewayRecovery.ts @@ -0,0 +1,35 @@ +// Crash-recovery budget for the gateway exit handler. A gateway that +// crash-loops on startup must not let the TUI spawn-storm, so respawn+resume +// attempts are capped to GATEWAY_RECOVERY_LIMIT within a sliding +// GATEWAY_RECOVERY_WINDOW_MS; past the budget the app falls back to the inert +// "gateway exited" state. Kept pure (no refs/UI) so the bound — including the +// crash-loop case — is unit-testable. +export const GATEWAY_RECOVERY_LIMIT = 3 +export const GATEWAY_RECOVERY_WINDOW_MS = 60_000 + +export interface RecoveryPlan { + // Attempt timestamps to persist (the pruned window, plus `now` iff recovering). + attempts: number[] + recover: boolean + // Session to resume — the live sid, or the not-yet-consumed recovery target + // when the live sid was already cleared by a prior exit. + sid: null | string +} + +// Decide whether to respawn+resume after a gateway death. `liveSid` is the +// current session (nulled on the first exit); `recoverSid` is a pending +// recovery target carried across a respawn that died before gateway.ready — +// so a startup crash-loop keeps retrying the same session up to the budget +// instead of stranding it after one attempt. +export function planGatewayRecovery( + liveSid: null | string, + recoverSid: null | string, + attempts: number[], + now: number +): RecoveryPlan { + const sid = liveSid ?? recoverSid + const recent = attempts.filter(t => now - t < GATEWAY_RECOVERY_WINDOW_MS) + const recover = Boolean(sid) && recent.length < GATEWAY_RECOVERY_LIMIT + + return { attempts: recover ? [...recent, now] : recent, recover, sid } +} diff --git a/ui-tui/src/app/interfaces.ts b/ui-tui/src/app/interfaces.ts index 991b69fab..cbedac59c 100644 --- a/ui-tui/src/app/interfaces.ts +++ b/ui-tui/src/app/interfaces.ts @@ -239,6 +239,10 @@ export interface GatewayEventHandlerContext { STARTUP_RESUME_ID: string colsRef: MutableRefObject newSession: (msg?: string, title?: string) => void + // Set by useMainApp's exit handler to the session that was live when the + // gateway died unexpectedly; consumed once by the next `gateway.ready` so a + // respawn resumes that session instead of forging a fresh one. + recoverSidRef?: MutableRefObject resetSession: () => void resumeById: (id: string) => void setCatalog: StateSetter diff --git a/ui-tui/src/app/useMainApp.ts b/ui-tui/src/app/useMainApp.ts index cfa454383..43e8a2ed6 100644 --- a/ui-tui/src/app/useMainApp.ts +++ b/ui-tui/src/app/useMainApp.ts @@ -30,6 +30,7 @@ import type { Msg, PanelSection, SlashCatalog } from '../types.js' import { createGatewayEventHandler } from './createGatewayEventHandler.js' import { createSlashHandler } from './createSlashHandler.js' +import { planGatewayRecovery } from './gatewayRecovery.js' import { getInputSelection } from './inputSelectionStore.js' import { type GatewayRpc, type TranscriptRow } from './interfaces.js' import { $overlayState, patchOverlayState } from './overlayStore.js' @@ -200,6 +201,8 @@ export function useMainApp(gw: GatewayClient) { const terminalHintsShownRef = useRef(new Set()) const historyItemsRef = useRef(historyItems) const lastUserMsgRef = useRef(lastUserMsg) + const recoverSidRef = useRef(null) + const recoveryAtRef = useRef([]) const msgIdsRef = useRef(new WeakMap()) const msgIdSeqRef = useRef(0) const heightCachesRef = useRef(new Map>()) @@ -694,6 +697,7 @@ export function useMainApp(gw: GatewayClient) { STARTUP_RESUME_ID, colsRef, newSession: session.newSession, + recoverSidRef, resetSession: session.resetSession, resumeById: session.resumeById, setCatalog @@ -734,7 +738,34 @@ export function useMainApp(gw: GatewayClient) { const exitHandler = () => { turnController.reset() + + // A still-owned child dying while the TUI is alive is an *unexpected* + // death — a user /quit exits Node before this fires, and a replaced child + // is identity-skipped in GatewayClient. Rather than stranding a long + // session (the user's complaint), respawn the gateway and resume the + // persisted session via the next gateway.ready, so a single crash / OOM / + // signal doesn't lose their work. planGatewayRecovery bounds the attempts + // so a gateway that crash-loops on startup can't spawn-storm, and falls + // back to recoverSidRef when sid was already cleared by a prior exit. + const plan = planGatewayRecovery(getUiState().sid, recoverSidRef.current, recoveryAtRef.current, Date.now()) + + // Clear sid immediately: while the gateway is down, sid-guarded effects + // (session.active_list poll, queue drain) would otherwise fire RPCs at a + // dead/respawning gateway. recoverSidRef carries the session forward, and + // resumeById restores sid once the fresh gateway is ready. + recoveryAtRef.current = plan.attempts patchUiState({ busy: false, sid: null, status: 'gateway exited' }) + + if (plan.recover && plan.sid) { + recoverSidRef.current = plan.sid + turnController.pushActivity('gateway exited · recovering session…', 'warn') + sys('gateway exited — recovering your session (any in-flight reply was lost)') + gw.start() + + return + } + + recoverSidRef.current = null turnController.pushActivity('gateway exited · /logs to inspect', 'error') sys('error: gateway exited') } diff --git a/ui-tui/src/entry.tsx b/ui-tui/src/entry.tsx index 787f738f9..45b9f564c 100644 --- a/ui-tui/src/entry.tsx +++ b/ui-tui/src/entry.tsx @@ -11,6 +11,7 @@ import { setupGracefulExit } from './lib/gracefulExit.js' import { formatBytes, type HeapDumpResult, performHeapDump } from './lib/memory.js' import { type MemorySnapshot, startMemoryMonitor } from './lib/memoryMonitor.js' import { openExternalUrl } from './lib/openExternalUrl.js' +import { recordParentLifecycle } from './lib/parentLog.js' import { clampStdoutDimensions } from './lib/terminalDimensions.js' import { resetTerminalModes } from './lib/terminalModes.js' @@ -56,9 +57,14 @@ setupGracefulExit({ onError: (scope, err) => { const message = err instanceof Error ? `${err.name}: ${err.message}\n${err.stack ?? ''}` : String(err) + recordParentLifecycle(`${scope}: ${message.split('\n')[0]?.slice(0, 400) ?? ''}`) process.stderr.write(`hermes-tui lifecycle ${scope}: ${message.slice(0, 2000)}\n`) }, onSignal: signal => { + // The next line in the crash log is the child's `=== SIGTERM received ===` + // (gw.kill forwards SIGTERM regardless of which signal hit us) — this is + // what tells SIGHUP (terminal/SSH dropped) apart from a real SIGTERM. + recordParentLifecycle(`graceful-exit received signal=${signal} → killing gateway`) resetTerminalModes() process.stderr.write(`hermes-tui lifecycle: received ${signal}\n`) } @@ -66,6 +72,10 @@ setupGracefulExit({ const stopMemoryMonitor = startMemoryMonitor({ onCritical: (snap, dump) => { + // process.exit(137) closes the child's stdin → the gateway logs a clean + // EOF, NOT SIGTERM. Recording it here is the only way a crash report can + // attribute a death to Node OOM rather than a signal-driven kill. + recordParentLifecycle(`memory-critical process.exit(137) heap=${formatBytes(snap.heapUsed)} rss=${formatBytes(snap.rss)} dump=${dump?.heapPath ?? 'failed'}`) resetTerminalModes() process.stderr.write(`hermes-tui lifecycle: memory critical exit heap=${formatBytes(snap.heapUsed)} rss=${formatBytes(snap.rss)}\n`) process.stderr.write(dumpNotice(snap, dump)) diff --git a/ui-tui/src/gatewayClient.ts b/ui-tui/src/gatewayClient.ts index f3121152c..37bfa881a 100644 --- a/ui-tui/src/gatewayClient.ts +++ b/ui-tui/src/gatewayClient.ts @@ -6,6 +6,7 @@ import { createInterface } from 'node:readline' import type { GatewayEvent } from './gatewayTypes.js' import { CircularBuffer } from './lib/circularBuffer.js' +import { recordParentLifecycle } from './lib/parentLog.js' const MAX_GATEWAY_LOG_LINES = 200 const MAX_LOG_LINE_BYTES = 4096 @@ -237,7 +238,7 @@ export class GatewayClient extends EventEmitter { // readable on slow boots. const stderrTail = this.getLogTail(20) - this.pushLog(`[startup] timed out waiting for gateway.ready (python=${python}, cwd=${cwd})`) + this.lifecycle(`[startup] timed out waiting for gateway.ready (python=${python}, cwd=${cwd})`) this.publish({ type: 'gateway.start_timeout', payload: { cwd, python, stderr_tail: stderrTail } @@ -248,7 +249,7 @@ export class GatewayClient extends EventEmitter { private handleTransportExit(code: null | number, reason?: string) { this.clearReadyTimer() this.closeSidecarSocket() - this.pushLog(`[lifecycle] transport exit code=${code ?? 'null'} reason=${reason ?? 'none'}`) + this.lifecycle(`[lifecycle] transport exit code=${code ?? 'null'} reason=${reason ?? 'none'}`) this.rejectPending(new Error(reason || `gateway exited${code === null ? '' : ` (${code})`}`)) if (this.subscribed) { @@ -335,7 +336,7 @@ export class GatewayClient extends EventEmitter { env.PYTHONPATH = pyPath ? `${root}${delimiter}${pyPath}` : root this.startReadyTimer(python, cwd) this.proc = spawn(python, ['-m', 'tui_gateway.entry'], { cwd, env, stdio: ['pipe', 'pipe', 'pipe'] }) - this.pushLog(`[lifecycle] spawned gateway child ${describeChild(this.proc)} python=${python} cwd=${cwd}`) + this.lifecycle(`[lifecycle] spawned gateway child ${describeChild(this.proc)} python=${python} cwd=${cwd}`) this.stdoutRl = createInterface({ input: this.proc.stdout! }) this.stdoutRl.on('line', raw => { @@ -372,7 +373,7 @@ export class GatewayClient extends EventEmitter { const line = `[spawn] ${err.message}` - this.pushLog(`[lifecycle] child error ${describeChild(ownedProc)} message=${err.message}`) + this.lifecycle(`[lifecycle] child error ${describeChild(ownedProc)} message=${err.message}`) this.pushLog(line) this.publish({ type: 'gateway.stderr', payload: { line } }) // Detach the reference up front so the late `exit` event for @@ -396,7 +397,7 @@ export class GatewayClient extends EventEmitter { return } - this.pushLog(`[lifecycle] child exit ${describeChild(ownedProc)} code=${code ?? 'null'} signal=${signal ?? 'null'}`) + this.lifecycle(`[lifecycle] child exit ${describeChild(ownedProc)} code=${code ?? 'null'} signal=${signal ?? 'null'}`) this.handleTransportExit(code) }) } @@ -507,7 +508,7 @@ export class GatewayClient extends EventEmitter { this.resetStartupState() if (this.proc && !this.proc.killed && this.proc.exitCode === null) { - this.pushLog(`[lifecycle] replacing live gateway child ${describeChild(this.proc)}`) + this.lifecycle(`[lifecycle] replacing live gateway child ${describeChild(this.proc)}`) this.proc.kill() } @@ -564,6 +565,14 @@ export class GatewayClient extends EventEmitter { this.logs.push(truncateLine(line)) } + // Death-explaining breadcrumbs (spawn / exit / kill / replace) — kept in the + // in-memory tail for /logs AND persisted to the gateway crash log so the + // reason survives a parent exit and lands next to the child's SIGTERM panic. + private lifecycle(line: string) { + this.pushLog(line) + recordParentLifecycle(line) + } + private rejectPending(err: Error) { for (const p of this.pending.values()) { clearTimeout(p.timeout) @@ -717,7 +726,7 @@ export class GatewayClient extends EventEmitter { const proc = this.proc const killed = proc?.kill() - this.pushLog(`[lifecycle] GatewayClient.kill reason=${reason} ${describeChild(proc)} killResult=${killed ?? 'none'}`) + this.lifecycle(`[lifecycle] GatewayClient.kill reason=${reason} ${describeChild(proc)} killResult=${killed ?? 'none'}`) this.closeGatewaySocket() this.closeSidecarSocket() this.clearReadyTimer() diff --git a/ui-tui/src/lib/parentLog.ts b/ui-tui/src/lib/parentLog.ts new file mode 100644 index 000000000..24f458552 --- /dev/null +++ b/ui-tui/src/lib/parentLog.ts @@ -0,0 +1,57 @@ +import { appendFileSync, mkdirSync } from 'node:fs' +import { homedir } from 'node:os' +import { join } from 'node:path' + +// Mirror the Python gateway's panic log (tui_gateway/server.py::_CRASH_LOG) from +// the Node parent so lifecycle breadcrumbs interleave, by timestamp, with the +// child's `=== SIGTERM received ===` / `=== gateway exit ===` entries. +// +// A backend SIGTERM is *usually* a parent action — `gw.kill()` (graceful-exit on +// a signal to Node, or an explicit /quit) or `start()` replacing a live child — +// but it can also come straight from an external supervisor (s6, a cgroup OOM +// reaper, a stray `kill`) signalling the child directly. Telling those apart is +// exactly the point: #31051 left these breadcrumbs in an in-memory CircularBuffer +// that dies with the process, so SIGTERM crash reports arrived with no parent +// context. A `[tui-parent]` line immediately before the child's panic means a +// parent kill; its absence *suggests* an external signal — not definitive, +// since this logger is best-effort (disabled under VITEST, and a failed append +// is swallowed). Persisting the death-explaining events here is what makes that +// distinction (and a memory-critical `process.exit(137)`, which closes stdin → +// clean EOF, not SIGTERM) diagnosable after the fact. +const logDir = join(process.env.HERMES_HOME?.trim() || join(homedir(), '.hermes'), 'logs') +const CRASH_LOG = join(logDir, 'tui_gateway_crash.log') + +// Skipped under vitest so unit tests exercising start()/kill() can't write into +// a real ~/.hermes (tests must stay hermetic — see AGENTS.md). +const enabled = !process.env.VITEST +// Slice a single breadcrumb's value to MAX_BREADCRUMB chars (a short +// "[truncated …]" marker is appended, so the written line is slightly longer) +// so a pathological value (e.g. a giant error) can't bloat the shared crash log +// or add noticeable blocking on the synchronous append. Mirrors the spirit of +// GatewayClient's in-memory log-line cap. +const MAX_BREADCRUMB = 4096 +let warned = false + +export function recordParentLifecycle(line: string): void { + if (!enabled) { + return + } + + try { + // Collapse embedded newlines so a multi-line value (e.g. an error message) + // stays one breadcrumb and can't masquerade as a separate log entry or as + // the child's panic output sharing this file. + const oneLine = line.replace(/[\r\n]+/g, ' ↵ ') + + const capped = + oneLine.length > MAX_BREADCRUMB ? `${oneLine.slice(0, MAX_BREADCRUMB)}… [truncated ${oneLine.length} chars]` : oneLine + + mkdirSync(logDir, { recursive: true }) + appendFileSync(CRASH_LOG, `[tui-parent] ${new Date().toISOString()} ${capped}\n`) + } catch { + if (!warned) { + warned = true + process.stderr.write('hermes-tui: parent lifecycle log unavailable\n') + } + } +}