fix(tui): auto-recover session on unexpected gateway death (+ persist lifecycle breadcrumbs) (#35893)

* fix(tui): persist gateway lifecycle breadcrumbs to crash log

A backend SIGTERM (`=== SIGTERM received ===` in tui_gateway_crash.log) is
always a parent action — `gw.kill()` (graceful-exit on a signal to Node, or an
explicit /quit) or `start()` replacing a live child. #31051 added parent-side
lifecycle breadcrumbs but left them in an in-memory CircularBuffer that dies
with the process, so SIGTERM crash reports arrive with no parent context and no
way to tell a signal-driven kill from a memory-critical `process.exit(137)`
(which closes the child's stdin → clean EOF, not SIGTERM).

Persist the death-explaining breadcrumbs (spawn / transport-exit / child-exit /
replace-live-child / kill-reason / startup-timeout) plus the graceful-exit
signal name and the memory-critical exit into the same crash log the Python
side writes, so they interleave by timestamp next to the child's panic entry —
making these recurring reports diagnosable.

Gated off under VITEST so unit tests stay hermetic.

* feat(tui): auto-recover the session when the gateway dies unexpectedly

When a still-owned gateway child dies while the TUI is alive (a crash, OOM
process.exit, or a SIGTERM/SIGHUP forwarded to it), the app currently nulls the
session and drops to an inert "gateway exited" state — the user loses a long
session and has to restart + re-run everything. That single behavior is most of
the "TUI doesn't survive heavy work" complaint, independent of what does the
killing.

The 'exit' event only reaches this handler on an *unexpected* death: a user
/quit calls process.exit before it fires, and a replaced child is identity-
skipped in GatewayClient. So on exit we now respawn the gateway and resume the
session that was live (history is persisted in SQLite) via a one-shot
recoverSidRef the next gateway.ready consults before forging a new session. The
in-flight reply is lost (it died with the process) but the session survives.

Bounded to GATEWAY_RECOVERY_LIMIT (3) attempts per GATEWAY_RECOVERY_WINDOW_MS
(60s) so a gateway that crash-loops on startup can't spawn-storm; past the
budget we fall back to the inert state.

* fix(tui): sanitize newlines + soften SIGTERM-cause claim in parentLog

Address PR review:
- recordParentLifecycle collapses embedded \r\n so a multi-line value (e.g. an
  error message) stays a single breadcrumb and can't masquerade as a separate
  entry or as the child's panic output sharing the crash log.
- Reword the header: a backend SIGTERM is *usually* a parent action but can come
  straight from an external supervisor (s6, cgroup OOM, stray kill); the
  presence/absence of a [tui-parent] line before the child's panic is precisely
  what disambiguates the two.

* fix(tui): clear sid during recovery + extract/test the recovery budget

Address PR review:
- Null `sid` immediately in the gateway exit handler. While the gateway is down
  (busy=false) the old sid would otherwise let sid-guarded effects (the 1.5s
  session.active_list poll, queue drain) fire RPCs at a dead/respawning gateway.
  recoverSidRef carries the session forward; resumeById restores sid on ready.
- Extract the respawn budget into a pure evalRecovery() (gatewayRecovery.ts) and
  unit-test the bound: allows GATEWAY_RECOVERY_LIMIT within the window, blocks
  past it, and prunes attempts older than the window so recovery re-arms.

* fix(tui): cap parent-log breadcrumb length (PR review)

Truncate a single persisted breadcrumb to 4096 chars (matching GatewayClient's
in-memory log-line cap) so a pathological value — e.g. a giant error string —
can't bloat the shared crash log or add noticeable blocking on the synchronous
append during a failure path. Covered by a test.

* fix(tui): keep "recovering session…" status visible during resume (PR review)

resumeById() synchronously sets status to 'resuming…' on entry, so the
recovery branch now applies its 'recovering session…' label *after* calling
resumeById — the distinct label sticks for the duration of the resume RPC
(which later flips to 'ready') instead of being immediately clobbered. Test
updated to assert the ordering.

* fix(tui): keep recovery budget alive across a startup crash-loop (PR review)

deadSid was read from getUiState().sid, which the first exit nulls — so if the
respawned gateway crash-looped before gateway.ready (resumeById never restored
sid), later exits saw null and abandoned the session after a single attempt,
defeating the bounded retry budget.

Lift the whole decision into a pure planGatewayRecovery() that falls back to the
pending recoverSidRef target when the live sid is already cleared, and unit-test
the crash-loop sequence (keeps retrying the same session up to the limit, then
falls back to inert). Supersedes evalRecovery.

* chore(tui): drop non-null assertion + clarify breadcrumb cap comment (PR review)

- Recovery branch guards on `recoverSidRef && recoverSid` so the ref write needs
  no `!` assertion (avoids a future unsafe refactor).
- Reword the parentLog cap comment: it slices the value to 4096 chars and
  appends a short truncation marker (so the written line is slightly longer),
  rather than implying a strict 4096-byte limit.

* chore(tui): soften "absence ⇒ external signal" + "any in-flight reply" (PR review)

- parentLog header: a missing [tui-parent] line only *suggests* an external
  signal (the logger is best-effort: VITEST-disabled, failed append swallowed),
  not a definitive conclusion.
- Recovery notice says "any in-flight reply was lost" since the gateway can also
  exit while idle.
This commit is contained in:
brooklyn! 2026-05-31 10:36:57 -05:00 committed by GitHub
parent 04bb74c58e
commit a726e8a811
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
10 changed files with 318 additions and 8 deletions

View file

@ -681,6 +681,31 @@ describe('createGatewayEventHandler', () => {
expect(resumeById).not.toHaveBeenCalled()
})
it('on gateway.ready after a crash, resumes the recovered session once and skips forge', async () => {
const appended: Msg[] = []
const newSession = vi.fn()
const resumeById = vi.fn()
const ctx = buildCtx(appended)
ctx.session.newSession = newSession
// Mimic resumeById's synchronous status write so the test proves the
// "recovering session…" label is applied *after* (and survives) it.
ctx.session.resumeById = resumeById.mockImplementation(() => patchUiState({ status: 'resuming…' }))
ctx.session.STARTUP_RESUME_ID = ''
ctx.session.recoverSidRef = ref<null | string>('sess-crashed')
const onEvent = createGatewayEventHandler(ctx)
onEvent({ payload: {}, type: 'gateway.ready' } as any)
await vi.waitFor(() => expect(resumeById).toHaveBeenCalledWith('sess-crashed'))
expect(newSession).not.toHaveBeenCalled()
// One-shot: the ref is consumed so a later ordinary restart forges/resumes
// per config instead of re-resuming the recovered session.
expect(ctx.session.recoverSidRef.current).toBeNull()
expect(getUiState().status).toBe('recovering session…')
})
it('on gateway.ready with auto_resume on and a recent session, resumes it', async () => {
const appended: Msg[] = []
const newSession = vi.fn()

View file

@ -0,0 +1,47 @@
import { describe, expect, it } from 'vitest'
import { GATEWAY_RECOVERY_LIMIT, GATEWAY_RECOVERY_WINDOW_MS, planGatewayRecovery } from '../app/gatewayRecovery.js'
describe('planGatewayRecovery', () => {
it('recovers the live session and records the attempt', () => {
const plan = planGatewayRecovery('sess-1', null, [], 1000)
expect(plan).toEqual({ attempts: [1000], recover: true, sid: 'sess-1' })
})
it('does not recover when there is no session to resume', () => {
expect(planGatewayRecovery(null, null, [], 1000)).toEqual({ attempts: [], recover: false, sid: null })
})
it('keeps retrying the recovery target through a startup crash-loop, bounded by the budget', () => {
// First exit: live sid present.
let attempts: number[] = []
let plan = planGatewayRecovery('sess-1', null, attempts, 0)
expect(plan.recover).toBe(true)
expect(plan.sid).toBe('sess-1')
attempts = plan.attempts
// Respawn crash-loops before gateway.ready: live sid is now null, but the
// recovery target carries it forward so we keep trying up to the budget.
for (let i = 1; i < GATEWAY_RECOVERY_LIMIT; i++) {
plan = planGatewayRecovery(null, 'sess-1', attempts, i)
expect(plan.recover).toBe(true)
expect(plan.sid).toBe('sess-1')
attempts = plan.attempts
}
// Budget exhausted: fall back to the inert state instead of spawn-storming.
plan = planGatewayRecovery(null, 'sess-1', attempts, GATEWAY_RECOVERY_LIMIT)
expect(plan.recover).toBe(false)
expect(plan.sid).toBe('sess-1')
})
it('prunes attempts older than the window so recovery re-arms', () => {
const old = Array.from({ length: GATEWAY_RECOVERY_LIMIT }, (_, i) => i)
const plan = planGatewayRecovery('sess-1', null, old, GATEWAY_RECOVERY_WINDOW_MS + 100)
expect(plan.attempts).toEqual([GATEWAY_RECOVERY_WINDOW_MS + 100])
expect(plan.recover).toBe(true)
})
})

View file

@ -0,0 +1,75 @@
import { mkdtempSync, readFileSync, rmSync } from 'node:fs'
import { tmpdir } from 'node:os'
import { join } from 'node:path'
import { afterEach, beforeEach, describe, expect, it, vi } from 'vitest'
// parentLog gates itself off under VITEST so unit tests can't pollute a real
// ~/.hermes. To exercise the real persistence path we clear that gate, point
// HERMES_HOME at a temp dir, and re-import the module fresh (path + enabled
// flag are captured at module load).
const loadFresh = async (home: string) => {
vi.resetModules()
vi.stubEnv('VITEST', '')
vi.stubEnv('HERMES_HOME', home)
return import('../lib/parentLog.js')
}
describe('recordParentLifecycle', () => {
let home: string
beforeEach(() => {
home = mkdtempSync(join(tmpdir(), 'hermes-parentlog-'))
})
afterEach(() => {
vi.unstubAllEnvs()
rmSync(home, { force: true, recursive: true })
})
it('appends a timestamped breadcrumb to logs/tui_gateway_crash.log', async () => {
const { recordParentLifecycle } = await loadFresh(home)
recordParentLifecycle('graceful-exit received signal=SIGHUP → killing gateway')
const contents = readFileSync(join(home, 'logs', 'tui_gateway_crash.log'), 'utf8')
expect(contents).toContain('[tui-parent]')
expect(contents).toContain('graceful-exit received signal=SIGHUP → killing gateway')
expect(contents).toMatch(/\d{4}-\d{2}-\d{2}T/)
})
it('collapses embedded newlines so a value stays one breadcrumb', async () => {
const { recordParentLifecycle } = await loadFresh(home)
recordParentLifecycle('uncaughtException: boom\n at foo()\r\n at bar()')
const lines = readFileSync(join(home, 'logs', 'tui_gateway_crash.log'), 'utf8').trimEnd().split('\n')
expect(lines).toHaveLength(1)
expect(lines[0]).toContain('boom ↵ at foo() ↵ at bar()')
})
it('caps an oversized breadcrumb so it cannot bloat the shared crash log', async () => {
const { recordParentLifecycle } = await loadFresh(home)
recordParentLifecycle('x'.repeat(10_000))
const line = readFileSync(join(home, 'logs', 'tui_gateway_crash.log'), 'utf8')
expect(line).toContain('[truncated 10000 chars]')
expect(line.length).toBeLessThan(4_500)
})
it('is a no-op under VITEST so tests stay hermetic', async () => {
vi.resetModules()
vi.stubEnv('VITEST', 'true')
vi.stubEnv('HERMES_HOME', home)
const { recordParentLifecycle } = await import('../lib/parentLog.js')
expect(() => recordParentLifecycle('should not be written')).not.toThrow()
expect(() => readFileSync(join(home, 'logs', 'tui_gateway_crash.log'), 'utf8')).toThrow()
})
})

View file

@ -76,7 +76,7 @@ const normalizeSubagentStatus = (status: unknown, fallback: SubagentStatus): Sub
export function createGatewayEventHandler(ctx: GatewayEventHandlerContext): (ev: GatewayEvent) => void {
const { rpc } = ctx.gateway
const { STARTUP_RESUME_ID, newSession, resumeById, setCatalog } = ctx.session
const { STARTUP_RESUME_ID, newSession, recoverSidRef, resumeById, setCatalog } = ctx.session
const { bellOnComplete, stdout, sys } = ctx.system
const { appendMessage, panel, setHistoryItems } = ctx.transcript
const { setInput } = ctx.composer
@ -303,6 +303,23 @@ export function createGatewayEventHandler(ctx: GatewayEventHandlerContext): (ev:
})
.catch((e: unknown) => turnController.pushActivity(`command catalog unavailable: ${rpcErrorMessage(e)}`, 'info'))
// Crash recovery: a respawn triggered by an unexpected gateway death
// resumes the session that was live, not a brand-new one. One-shot — the
// ref is cleared so an ordinary later restart still forges/resumes per
// config. No startup prompt here (this is mid-session, not a cold boot).
const recoverSid = recoverSidRef?.current
if (recoverSidRef && recoverSid) {
recoverSidRef.current = null
resumeById(recoverSid)
// After resumeById: it synchronously sets status to 'resuming…' on entry,
// so override it here to keep the distinct "recovering" label visible for
// the duration of the resume RPC (which later flips status to 'ready').
patchUiState({ status: 'recovering session…' })
return
}
if (STARTUP_RESUME_ID) {
patchUiState({ status: 'resuming…' })
resumeById(STARTUP_RESUME_ID)

View file

@ -0,0 +1,35 @@
// Crash-recovery budget for the gateway exit handler. A gateway that
// crash-loops on startup must not let the TUI spawn-storm, so respawn+resume
// attempts are capped to GATEWAY_RECOVERY_LIMIT within a sliding
// GATEWAY_RECOVERY_WINDOW_MS; past the budget the app falls back to the inert
// "gateway exited" state. Kept pure (no refs/UI) so the bound — including the
// crash-loop case — is unit-testable.
export const GATEWAY_RECOVERY_LIMIT = 3
export const GATEWAY_RECOVERY_WINDOW_MS = 60_000
export interface RecoveryPlan {
// Attempt timestamps to persist (the pruned window, plus `now` iff recovering).
attempts: number[]
recover: boolean
// Session to resume — the live sid, or the not-yet-consumed recovery target
// when the live sid was already cleared by a prior exit.
sid: null | string
}
// Decide whether to respawn+resume after a gateway death. `liveSid` is the
// current session (nulled on the first exit); `recoverSid` is a pending
// recovery target carried across a respawn that died before gateway.ready —
// so a startup crash-loop keeps retrying the same session up to the budget
// instead of stranding it after one attempt.
export function planGatewayRecovery(
liveSid: null | string,
recoverSid: null | string,
attempts: number[],
now: number
): RecoveryPlan {
const sid = liveSid ?? recoverSid
const recent = attempts.filter(t => now - t < GATEWAY_RECOVERY_WINDOW_MS)
const recover = Boolean(sid) && recent.length < GATEWAY_RECOVERY_LIMIT
return { attempts: recover ? [...recent, now] : recent, recover, sid }
}

View file

@ -239,6 +239,10 @@ export interface GatewayEventHandlerContext {
STARTUP_RESUME_ID: string
colsRef: MutableRefObject<number>
newSession: (msg?: string, title?: string) => void
// Set by useMainApp's exit handler to the session that was live when the
// gateway died unexpectedly; consumed once by the next `gateway.ready` so a
// respawn resumes that session instead of forging a fresh one.
recoverSidRef?: MutableRefObject<null | string>
resetSession: () => void
resumeById: (id: string) => void
setCatalog: StateSetter<null | SlashCatalog>

View file

@ -30,6 +30,7 @@ import type { Msg, PanelSection, SlashCatalog } from '../types.js'
import { createGatewayEventHandler } from './createGatewayEventHandler.js'
import { createSlashHandler } from './createSlashHandler.js'
import { planGatewayRecovery } from './gatewayRecovery.js'
import { getInputSelection } from './inputSelectionStore.js'
import { type GatewayRpc, type TranscriptRow } from './interfaces.js'
import { $overlayState, patchOverlayState } from './overlayStore.js'
@ -200,6 +201,8 @@ export function useMainApp(gw: GatewayClient) {
const terminalHintsShownRef = useRef(new Set<string>())
const historyItemsRef = useRef(historyItems)
const lastUserMsgRef = useRef(lastUserMsg)
const recoverSidRef = useRef<null | string>(null)
const recoveryAtRef = useRef<number[]>([])
const msgIdsRef = useRef(new WeakMap<Msg, string>())
const msgIdSeqRef = useRef(0)
const heightCachesRef = useRef(new Map<string, Map<string, number>>())
@ -694,6 +697,7 @@ export function useMainApp(gw: GatewayClient) {
STARTUP_RESUME_ID,
colsRef,
newSession: session.newSession,
recoverSidRef,
resetSession: session.resetSession,
resumeById: session.resumeById,
setCatalog
@ -734,7 +738,34 @@ export function useMainApp(gw: GatewayClient) {
const exitHandler = () => {
turnController.reset()
// A still-owned child dying while the TUI is alive is an *unexpected*
// death — a user /quit exits Node before this fires, and a replaced child
// is identity-skipped in GatewayClient. Rather than stranding a long
// session (the user's complaint), respawn the gateway and resume the
// persisted session via the next gateway.ready, so a single crash / OOM /
// signal doesn't lose their work. planGatewayRecovery bounds the attempts
// so a gateway that crash-loops on startup can't spawn-storm, and falls
// back to recoverSidRef when sid was already cleared by a prior exit.
const plan = planGatewayRecovery(getUiState().sid, recoverSidRef.current, recoveryAtRef.current, Date.now())
// Clear sid immediately: while the gateway is down, sid-guarded effects
// (session.active_list poll, queue drain) would otherwise fire RPCs at a
// dead/respawning gateway. recoverSidRef carries the session forward, and
// resumeById restores sid once the fresh gateway is ready.
recoveryAtRef.current = plan.attempts
patchUiState({ busy: false, sid: null, status: 'gateway exited' })
if (plan.recover && plan.sid) {
recoverSidRef.current = plan.sid
turnController.pushActivity('gateway exited · recovering session…', 'warn')
sys('gateway exited — recovering your session (any in-flight reply was lost)')
gw.start()
return
}
recoverSidRef.current = null
turnController.pushActivity('gateway exited · /logs to inspect', 'error')
sys('error: gateway exited')
}

View file

@ -11,6 +11,7 @@ import { setupGracefulExit } from './lib/gracefulExit.js'
import { formatBytes, type HeapDumpResult, performHeapDump } from './lib/memory.js'
import { type MemorySnapshot, startMemoryMonitor } from './lib/memoryMonitor.js'
import { openExternalUrl } from './lib/openExternalUrl.js'
import { recordParentLifecycle } from './lib/parentLog.js'
import { clampStdoutDimensions } from './lib/terminalDimensions.js'
import { resetTerminalModes } from './lib/terminalModes.js'
@ -56,9 +57,14 @@ setupGracefulExit({
onError: (scope, err) => {
const message = err instanceof Error ? `${err.name}: ${err.message}\n${err.stack ?? ''}` : String(err)
recordParentLifecycle(`${scope}: ${message.split('\n')[0]?.slice(0, 400) ?? ''}`)
process.stderr.write(`hermes-tui lifecycle ${scope}: ${message.slice(0, 2000)}\n`)
},
onSignal: signal => {
// The next line in the crash log is the child's `=== SIGTERM received ===`
// (gw.kill forwards SIGTERM regardless of which signal hit us) — this is
// what tells SIGHUP (terminal/SSH dropped) apart from a real SIGTERM.
recordParentLifecycle(`graceful-exit received signal=${signal} → killing gateway`)
resetTerminalModes()
process.stderr.write(`hermes-tui lifecycle: received ${signal}\n`)
}
@ -66,6 +72,10 @@ setupGracefulExit({
const stopMemoryMonitor = startMemoryMonitor({
onCritical: (snap, dump) => {
// process.exit(137) closes the child's stdin → the gateway logs a clean
// EOF, NOT SIGTERM. Recording it here is the only way a crash report can
// attribute a death to Node OOM rather than a signal-driven kill.
recordParentLifecycle(`memory-critical process.exit(137) heap=${formatBytes(snap.heapUsed)} rss=${formatBytes(snap.rss)} dump=${dump?.heapPath ?? 'failed'}`)
resetTerminalModes()
process.stderr.write(`hermes-tui lifecycle: memory critical exit heap=${formatBytes(snap.heapUsed)} rss=${formatBytes(snap.rss)}\n`)
process.stderr.write(dumpNotice(snap, dump))

View file

@ -6,6 +6,7 @@ import { createInterface } from 'node:readline'
import type { GatewayEvent } from './gatewayTypes.js'
import { CircularBuffer } from './lib/circularBuffer.js'
import { recordParentLifecycle } from './lib/parentLog.js'
const MAX_GATEWAY_LOG_LINES = 200
const MAX_LOG_LINE_BYTES = 4096
@ -237,7 +238,7 @@ export class GatewayClient extends EventEmitter {
// readable on slow boots.
const stderrTail = this.getLogTail(20)
this.pushLog(`[startup] timed out waiting for gateway.ready (python=${python}, cwd=${cwd})`)
this.lifecycle(`[startup] timed out waiting for gateway.ready (python=${python}, cwd=${cwd})`)
this.publish({
type: 'gateway.start_timeout',
payload: { cwd, python, stderr_tail: stderrTail }
@ -248,7 +249,7 @@ export class GatewayClient extends EventEmitter {
private handleTransportExit(code: null | number, reason?: string) {
this.clearReadyTimer()
this.closeSidecarSocket()
this.pushLog(`[lifecycle] transport exit code=${code ?? 'null'} reason=${reason ?? 'none'}`)
this.lifecycle(`[lifecycle] transport exit code=${code ?? 'null'} reason=${reason ?? 'none'}`)
this.rejectPending(new Error(reason || `gateway exited${code === null ? '' : ` (${code})`}`))
if (this.subscribed) {
@ -335,7 +336,7 @@ export class GatewayClient extends EventEmitter {
env.PYTHONPATH = pyPath ? `${root}${delimiter}${pyPath}` : root
this.startReadyTimer(python, cwd)
this.proc = spawn(python, ['-m', 'tui_gateway.entry'], { cwd, env, stdio: ['pipe', 'pipe', 'pipe'] })
this.pushLog(`[lifecycle] spawned gateway child ${describeChild(this.proc)} python=${python} cwd=${cwd}`)
this.lifecycle(`[lifecycle] spawned gateway child ${describeChild(this.proc)} python=${python} cwd=${cwd}`)
this.stdoutRl = createInterface({ input: this.proc.stdout! })
this.stdoutRl.on('line', raw => {
@ -372,7 +373,7 @@ export class GatewayClient extends EventEmitter {
const line = `[spawn] ${err.message}`
this.pushLog(`[lifecycle] child error ${describeChild(ownedProc)} message=${err.message}`)
this.lifecycle(`[lifecycle] child error ${describeChild(ownedProc)} message=${err.message}`)
this.pushLog(line)
this.publish({ type: 'gateway.stderr', payload: { line } })
// Detach the reference up front so the late `exit` event for
@ -396,7 +397,7 @@ export class GatewayClient extends EventEmitter {
return
}
this.pushLog(`[lifecycle] child exit ${describeChild(ownedProc)} code=${code ?? 'null'} signal=${signal ?? 'null'}`)
this.lifecycle(`[lifecycle] child exit ${describeChild(ownedProc)} code=${code ?? 'null'} signal=${signal ?? 'null'}`)
this.handleTransportExit(code)
})
}
@ -507,7 +508,7 @@ export class GatewayClient extends EventEmitter {
this.resetStartupState()
if (this.proc && !this.proc.killed && this.proc.exitCode === null) {
this.pushLog(`[lifecycle] replacing live gateway child ${describeChild(this.proc)}`)
this.lifecycle(`[lifecycle] replacing live gateway child ${describeChild(this.proc)}`)
this.proc.kill()
}
@ -564,6 +565,14 @@ export class GatewayClient extends EventEmitter {
this.logs.push(truncateLine(line))
}
// Death-explaining breadcrumbs (spawn / exit / kill / replace) — kept in the
// in-memory tail for /logs AND persisted to the gateway crash log so the
// reason survives a parent exit and lands next to the child's SIGTERM panic.
private lifecycle(line: string) {
this.pushLog(line)
recordParentLifecycle(line)
}
private rejectPending(err: Error) {
for (const p of this.pending.values()) {
clearTimeout(p.timeout)
@ -717,7 +726,7 @@ export class GatewayClient extends EventEmitter {
const proc = this.proc
const killed = proc?.kill()
this.pushLog(`[lifecycle] GatewayClient.kill reason=${reason} ${describeChild(proc)} killResult=${killed ?? 'none'}`)
this.lifecycle(`[lifecycle] GatewayClient.kill reason=${reason} ${describeChild(proc)} killResult=${killed ?? 'none'}`)
this.closeGatewaySocket()
this.closeSidecarSocket()
this.clearReadyTimer()

View file

@ -0,0 +1,57 @@
import { appendFileSync, mkdirSync } from 'node:fs'
import { homedir } from 'node:os'
import { join } from 'node:path'
// Mirror the Python gateway's panic log (tui_gateway/server.py::_CRASH_LOG) from
// the Node parent so lifecycle breadcrumbs interleave, by timestamp, with the
// child's `=== SIGTERM received ===` / `=== gateway exit ===` entries.
//
// A backend SIGTERM is *usually* a parent action — `gw.kill()` (graceful-exit on
// a signal to Node, or an explicit /quit) or `start()` replacing a live child —
// but it can also come straight from an external supervisor (s6, a cgroup OOM
// reaper, a stray `kill`) signalling the child directly. Telling those apart is
// exactly the point: #31051 left these breadcrumbs in an in-memory CircularBuffer
// that dies with the process, so SIGTERM crash reports arrived with no parent
// context. A `[tui-parent]` line immediately before the child's panic means a
// parent kill; its absence *suggests* an external signal — not definitive,
// since this logger is best-effort (disabled under VITEST, and a failed append
// is swallowed). Persisting the death-explaining events here is what makes that
// distinction (and a memory-critical `process.exit(137)`, which closes stdin →
// clean EOF, not SIGTERM) diagnosable after the fact.
const logDir = join(process.env.HERMES_HOME?.trim() || join(homedir(), '.hermes'), 'logs')
const CRASH_LOG = join(logDir, 'tui_gateway_crash.log')
// Skipped under vitest so unit tests exercising start()/kill() can't write into
// a real ~/.hermes (tests must stay hermetic — see AGENTS.md).
const enabled = !process.env.VITEST
// Slice a single breadcrumb's value to MAX_BREADCRUMB chars (a short
// "[truncated …]" marker is appended, so the written line is slightly longer)
// so a pathological value (e.g. a giant error) can't bloat the shared crash log
// or add noticeable blocking on the synchronous append. Mirrors the spirit of
// GatewayClient's in-memory log-line cap.
const MAX_BREADCRUMB = 4096
let warned = false
export function recordParentLifecycle(line: string): void {
if (!enabled) {
return
}
try {
// Collapse embedded newlines so a multi-line value (e.g. an error message)
// stays one breadcrumb and can't masquerade as a separate log entry or as
// the child's panic output sharing this file.
const oneLine = line.replace(/[\r\n]+/g, ' ↵ ')
const capped =
oneLine.length > MAX_BREADCRUMB ? `${oneLine.slice(0, MAX_BREADCRUMB)}… [truncated ${oneLine.length} chars]` : oneLine
mkdirSync(logDir, { recursive: true })
appendFileSync(CRASH_LOG, `[tui-parent] ${new Date().toISOString()} ${capped}\n`)
} catch {
if (!warned) {
warned = true
process.stderr.write('hermes-tui: parent lifecycle log unavailable\n')
}
}
}