fix(tui): auto-recover session on unexpected gateway death (+ persist lifecycle breadcrumbs) (#35893)
* fix(tui): persist gateway lifecycle breadcrumbs to crash log A backend SIGTERM (`=== SIGTERM received ===` in tui_gateway_crash.log) is always a parent action — `gw.kill()` (graceful-exit on a signal to Node, or an explicit /quit) or `start()` replacing a live child. #31051 added parent-side lifecycle breadcrumbs but left them in an in-memory CircularBuffer that dies with the process, so SIGTERM crash reports arrive with no parent context and no way to tell a signal-driven kill from a memory-critical `process.exit(137)` (which closes the child's stdin → clean EOF, not SIGTERM). Persist the death-explaining breadcrumbs (spawn / transport-exit / child-exit / replace-live-child / kill-reason / startup-timeout) plus the graceful-exit signal name and the memory-critical exit into the same crash log the Python side writes, so they interleave by timestamp next to the child's panic entry — making these recurring reports diagnosable. Gated off under VITEST so unit tests stay hermetic. * feat(tui): auto-recover the session when the gateway dies unexpectedly When a still-owned gateway child dies while the TUI is alive (a crash, OOM process.exit, or a SIGTERM/SIGHUP forwarded to it), the app currently nulls the session and drops to an inert "gateway exited" state — the user loses a long session and has to restart + re-run everything. That single behavior is most of the "TUI doesn't survive heavy work" complaint, independent of what does the killing. The 'exit' event only reaches this handler on an *unexpected* death: a user /quit calls process.exit before it fires, and a replaced child is identity- skipped in GatewayClient. So on exit we now respawn the gateway and resume the session that was live (history is persisted in SQLite) via a one-shot recoverSidRef the next gateway.ready consults before forging a new session. The in-flight reply is lost (it died with the process) but the session survives. Bounded to GATEWAY_RECOVERY_LIMIT (3) attempts per GATEWAY_RECOVERY_WINDOW_MS (60s) so a gateway that crash-loops on startup can't spawn-storm; past the budget we fall back to the inert state. * fix(tui): sanitize newlines + soften SIGTERM-cause claim in parentLog Address PR review: - recordParentLifecycle collapses embedded \r\n so a multi-line value (e.g. an error message) stays a single breadcrumb and can't masquerade as a separate entry or as the child's panic output sharing the crash log. - Reword the header: a backend SIGTERM is *usually* a parent action but can come straight from an external supervisor (s6, cgroup OOM, stray kill); the presence/absence of a [tui-parent] line before the child's panic is precisely what disambiguates the two. * fix(tui): clear sid during recovery + extract/test the recovery budget Address PR review: - Null `sid` immediately in the gateway exit handler. While the gateway is down (busy=false) the old sid would otherwise let sid-guarded effects (the 1.5s session.active_list poll, queue drain) fire RPCs at a dead/respawning gateway. recoverSidRef carries the session forward; resumeById restores sid on ready. - Extract the respawn budget into a pure evalRecovery() (gatewayRecovery.ts) and unit-test the bound: allows GATEWAY_RECOVERY_LIMIT within the window, blocks past it, and prunes attempts older than the window so recovery re-arms. * fix(tui): cap parent-log breadcrumb length (PR review) Truncate a single persisted breadcrumb to 4096 chars (matching GatewayClient's in-memory log-line cap) so a pathological value — e.g. a giant error string — can't bloat the shared crash log or add noticeable blocking on the synchronous append during a failure path. Covered by a test. * fix(tui): keep "recovering session…" status visible during resume (PR review) resumeById() synchronously sets status to 'resuming…' on entry, so the recovery branch now applies its 'recovering session…' label *after* calling resumeById — the distinct label sticks for the duration of the resume RPC (which later flips to 'ready') instead of being immediately clobbered. Test updated to assert the ordering. * fix(tui): keep recovery budget alive across a startup crash-loop (PR review) deadSid was read from getUiState().sid, which the first exit nulls — so if the respawned gateway crash-looped before gateway.ready (resumeById never restored sid), later exits saw null and abandoned the session after a single attempt, defeating the bounded retry budget. Lift the whole decision into a pure planGatewayRecovery() that falls back to the pending recoverSidRef target when the live sid is already cleared, and unit-test the crash-loop sequence (keeps retrying the same session up to the limit, then falls back to inert). Supersedes evalRecovery. * chore(tui): drop non-null assertion + clarify breadcrumb cap comment (PR review) - Recovery branch guards on `recoverSidRef && recoverSid` so the ref write needs no `!` assertion (avoids a future unsafe refactor). - Reword the parentLog cap comment: it slices the value to 4096 chars and appends a short truncation marker (so the written line is slightly longer), rather than implying a strict 4096-byte limit. * chore(tui): soften "absence ⇒ external signal" + "any in-flight reply" (PR review) - parentLog header: a missing [tui-parent] line only *suggests* an external signal (the logger is best-effort: VITEST-disabled, failed append swallowed), not a definitive conclusion. - Recovery notice says "any in-flight reply was lost" since the gateway can also exit while idle.
This commit is contained in:
parent
04bb74c58e
commit
a726e8a811
10 changed files with 318 additions and 8 deletions
|
|
@ -681,6 +681,31 @@ describe('createGatewayEventHandler', () => {
|
|||
expect(resumeById).not.toHaveBeenCalled()
|
||||
})
|
||||
|
||||
it('on gateway.ready after a crash, resumes the recovered session once and skips forge', async () => {
|
||||
const appended: Msg[] = []
|
||||
const newSession = vi.fn()
|
||||
const resumeById = vi.fn()
|
||||
const ctx = buildCtx(appended)
|
||||
|
||||
ctx.session.newSession = newSession
|
||||
// Mimic resumeById's synchronous status write so the test proves the
|
||||
// "recovering session…" label is applied *after* (and survives) it.
|
||||
ctx.session.resumeById = resumeById.mockImplementation(() => patchUiState({ status: 'resuming…' }))
|
||||
ctx.session.STARTUP_RESUME_ID = ''
|
||||
ctx.session.recoverSidRef = ref<null | string>('sess-crashed')
|
||||
|
||||
const onEvent = createGatewayEventHandler(ctx)
|
||||
|
||||
onEvent({ payload: {}, type: 'gateway.ready' } as any)
|
||||
|
||||
await vi.waitFor(() => expect(resumeById).toHaveBeenCalledWith('sess-crashed'))
|
||||
expect(newSession).not.toHaveBeenCalled()
|
||||
// One-shot: the ref is consumed so a later ordinary restart forges/resumes
|
||||
// per config instead of re-resuming the recovered session.
|
||||
expect(ctx.session.recoverSidRef.current).toBeNull()
|
||||
expect(getUiState().status).toBe('recovering session…')
|
||||
})
|
||||
|
||||
it('on gateway.ready with auto_resume on and a recent session, resumes it', async () => {
|
||||
const appended: Msg[] = []
|
||||
const newSession = vi.fn()
|
||||
|
|
|
|||
47
ui-tui/src/__tests__/gatewayRecovery.test.ts
Normal file
47
ui-tui/src/__tests__/gatewayRecovery.test.ts
Normal file
|
|
@ -0,0 +1,47 @@
|
|||
import { describe, expect, it } from 'vitest'
|
||||
|
||||
import { GATEWAY_RECOVERY_LIMIT, GATEWAY_RECOVERY_WINDOW_MS, planGatewayRecovery } from '../app/gatewayRecovery.js'
|
||||
|
||||
describe('planGatewayRecovery', () => {
|
||||
it('recovers the live session and records the attempt', () => {
|
||||
const plan = planGatewayRecovery('sess-1', null, [], 1000)
|
||||
|
||||
expect(plan).toEqual({ attempts: [1000], recover: true, sid: 'sess-1' })
|
||||
})
|
||||
|
||||
it('does not recover when there is no session to resume', () => {
|
||||
expect(planGatewayRecovery(null, null, [], 1000)).toEqual({ attempts: [], recover: false, sid: null })
|
||||
})
|
||||
|
||||
it('keeps retrying the recovery target through a startup crash-loop, bounded by the budget', () => {
|
||||
// First exit: live sid present.
|
||||
let attempts: number[] = []
|
||||
let plan = planGatewayRecovery('sess-1', null, attempts, 0)
|
||||
|
||||
expect(plan.recover).toBe(true)
|
||||
expect(plan.sid).toBe('sess-1')
|
||||
attempts = plan.attempts
|
||||
|
||||
// Respawn crash-loops before gateway.ready: live sid is now null, but the
|
||||
// recovery target carries it forward so we keep trying up to the budget.
|
||||
for (let i = 1; i < GATEWAY_RECOVERY_LIMIT; i++) {
|
||||
plan = planGatewayRecovery(null, 'sess-1', attempts, i)
|
||||
expect(plan.recover).toBe(true)
|
||||
expect(plan.sid).toBe('sess-1')
|
||||
attempts = plan.attempts
|
||||
}
|
||||
|
||||
// Budget exhausted: fall back to the inert state instead of spawn-storming.
|
||||
plan = planGatewayRecovery(null, 'sess-1', attempts, GATEWAY_RECOVERY_LIMIT)
|
||||
expect(plan.recover).toBe(false)
|
||||
expect(plan.sid).toBe('sess-1')
|
||||
})
|
||||
|
||||
it('prunes attempts older than the window so recovery re-arms', () => {
|
||||
const old = Array.from({ length: GATEWAY_RECOVERY_LIMIT }, (_, i) => i)
|
||||
const plan = planGatewayRecovery('sess-1', null, old, GATEWAY_RECOVERY_WINDOW_MS + 100)
|
||||
|
||||
expect(plan.attempts).toEqual([GATEWAY_RECOVERY_WINDOW_MS + 100])
|
||||
expect(plan.recover).toBe(true)
|
||||
})
|
||||
})
|
||||
75
ui-tui/src/__tests__/parentLog.test.ts
Normal file
75
ui-tui/src/__tests__/parentLog.test.ts
Normal file
|
|
@ -0,0 +1,75 @@
|
|||
import { mkdtempSync, readFileSync, rmSync } from 'node:fs'
|
||||
import { tmpdir } from 'node:os'
|
||||
import { join } from 'node:path'
|
||||
|
||||
import { afterEach, beforeEach, describe, expect, it, vi } from 'vitest'
|
||||
|
||||
// parentLog gates itself off under VITEST so unit tests can't pollute a real
|
||||
// ~/.hermes. To exercise the real persistence path we clear that gate, point
|
||||
// HERMES_HOME at a temp dir, and re-import the module fresh (path + enabled
|
||||
// flag are captured at module load).
|
||||
const loadFresh = async (home: string) => {
|
||||
vi.resetModules()
|
||||
vi.stubEnv('VITEST', '')
|
||||
vi.stubEnv('HERMES_HOME', home)
|
||||
|
||||
return import('../lib/parentLog.js')
|
||||
}
|
||||
|
||||
describe('recordParentLifecycle', () => {
|
||||
let home: string
|
||||
|
||||
beforeEach(() => {
|
||||
home = mkdtempSync(join(tmpdir(), 'hermes-parentlog-'))
|
||||
})
|
||||
|
||||
afterEach(() => {
|
||||
vi.unstubAllEnvs()
|
||||
rmSync(home, { force: true, recursive: true })
|
||||
})
|
||||
|
||||
it('appends a timestamped breadcrumb to logs/tui_gateway_crash.log', async () => {
|
||||
const { recordParentLifecycle } = await loadFresh(home)
|
||||
|
||||
recordParentLifecycle('graceful-exit received signal=SIGHUP → killing gateway')
|
||||
|
||||
const contents = readFileSync(join(home, 'logs', 'tui_gateway_crash.log'), 'utf8')
|
||||
|
||||
expect(contents).toContain('[tui-parent]')
|
||||
expect(contents).toContain('graceful-exit received signal=SIGHUP → killing gateway')
|
||||
expect(contents).toMatch(/\d{4}-\d{2}-\d{2}T/)
|
||||
})
|
||||
|
||||
it('collapses embedded newlines so a value stays one breadcrumb', async () => {
|
||||
const { recordParentLifecycle } = await loadFresh(home)
|
||||
|
||||
recordParentLifecycle('uncaughtException: boom\n at foo()\r\n at bar()')
|
||||
|
||||
const lines = readFileSync(join(home, 'logs', 'tui_gateway_crash.log'), 'utf8').trimEnd().split('\n')
|
||||
|
||||
expect(lines).toHaveLength(1)
|
||||
expect(lines[0]).toContain('boom ↵ at foo() ↵ at bar()')
|
||||
})
|
||||
|
||||
it('caps an oversized breadcrumb so it cannot bloat the shared crash log', async () => {
|
||||
const { recordParentLifecycle } = await loadFresh(home)
|
||||
|
||||
recordParentLifecycle('x'.repeat(10_000))
|
||||
|
||||
const line = readFileSync(join(home, 'logs', 'tui_gateway_crash.log'), 'utf8')
|
||||
|
||||
expect(line).toContain('[truncated 10000 chars]')
|
||||
expect(line.length).toBeLessThan(4_500)
|
||||
})
|
||||
|
||||
it('is a no-op under VITEST so tests stay hermetic', async () => {
|
||||
vi.resetModules()
|
||||
vi.stubEnv('VITEST', 'true')
|
||||
vi.stubEnv('HERMES_HOME', home)
|
||||
|
||||
const { recordParentLifecycle } = await import('../lib/parentLog.js')
|
||||
|
||||
expect(() => recordParentLifecycle('should not be written')).not.toThrow()
|
||||
expect(() => readFileSync(join(home, 'logs', 'tui_gateway_crash.log'), 'utf8')).toThrow()
|
||||
})
|
||||
})
|
||||
|
|
@ -76,7 +76,7 @@ const normalizeSubagentStatus = (status: unknown, fallback: SubagentStatus): Sub
|
|||
|
||||
export function createGatewayEventHandler(ctx: GatewayEventHandlerContext): (ev: GatewayEvent) => void {
|
||||
const { rpc } = ctx.gateway
|
||||
const { STARTUP_RESUME_ID, newSession, resumeById, setCatalog } = ctx.session
|
||||
const { STARTUP_RESUME_ID, newSession, recoverSidRef, resumeById, setCatalog } = ctx.session
|
||||
const { bellOnComplete, stdout, sys } = ctx.system
|
||||
const { appendMessage, panel, setHistoryItems } = ctx.transcript
|
||||
const { setInput } = ctx.composer
|
||||
|
|
@ -303,6 +303,23 @@ export function createGatewayEventHandler(ctx: GatewayEventHandlerContext): (ev:
|
|||
})
|
||||
.catch((e: unknown) => turnController.pushActivity(`command catalog unavailable: ${rpcErrorMessage(e)}`, 'info'))
|
||||
|
||||
// Crash recovery: a respawn triggered by an unexpected gateway death
|
||||
// resumes the session that was live, not a brand-new one. One-shot — the
|
||||
// ref is cleared so an ordinary later restart still forges/resumes per
|
||||
// config. No startup prompt here (this is mid-session, not a cold boot).
|
||||
const recoverSid = recoverSidRef?.current
|
||||
|
||||
if (recoverSidRef && recoverSid) {
|
||||
recoverSidRef.current = null
|
||||
resumeById(recoverSid)
|
||||
// After resumeById: it synchronously sets status to 'resuming…' on entry,
|
||||
// so override it here to keep the distinct "recovering" label visible for
|
||||
// the duration of the resume RPC (which later flips status to 'ready').
|
||||
patchUiState({ status: 'recovering session…' })
|
||||
|
||||
return
|
||||
}
|
||||
|
||||
if (STARTUP_RESUME_ID) {
|
||||
patchUiState({ status: 'resuming…' })
|
||||
resumeById(STARTUP_RESUME_ID)
|
||||
|
|
|
|||
35
ui-tui/src/app/gatewayRecovery.ts
Normal file
35
ui-tui/src/app/gatewayRecovery.ts
Normal file
|
|
@ -0,0 +1,35 @@
|
|||
// Crash-recovery budget for the gateway exit handler. A gateway that
|
||||
// crash-loops on startup must not let the TUI spawn-storm, so respawn+resume
|
||||
// attempts are capped to GATEWAY_RECOVERY_LIMIT within a sliding
|
||||
// GATEWAY_RECOVERY_WINDOW_MS; past the budget the app falls back to the inert
|
||||
// "gateway exited" state. Kept pure (no refs/UI) so the bound — including the
|
||||
// crash-loop case — is unit-testable.
|
||||
export const GATEWAY_RECOVERY_LIMIT = 3
|
||||
export const GATEWAY_RECOVERY_WINDOW_MS = 60_000
|
||||
|
||||
export interface RecoveryPlan {
|
||||
// Attempt timestamps to persist (the pruned window, plus `now` iff recovering).
|
||||
attempts: number[]
|
||||
recover: boolean
|
||||
// Session to resume — the live sid, or the not-yet-consumed recovery target
|
||||
// when the live sid was already cleared by a prior exit.
|
||||
sid: null | string
|
||||
}
|
||||
|
||||
// Decide whether to respawn+resume after a gateway death. `liveSid` is the
|
||||
// current session (nulled on the first exit); `recoverSid` is a pending
|
||||
// recovery target carried across a respawn that died before gateway.ready —
|
||||
// so a startup crash-loop keeps retrying the same session up to the budget
|
||||
// instead of stranding it after one attempt.
|
||||
export function planGatewayRecovery(
|
||||
liveSid: null | string,
|
||||
recoverSid: null | string,
|
||||
attempts: number[],
|
||||
now: number
|
||||
): RecoveryPlan {
|
||||
const sid = liveSid ?? recoverSid
|
||||
const recent = attempts.filter(t => now - t < GATEWAY_RECOVERY_WINDOW_MS)
|
||||
const recover = Boolean(sid) && recent.length < GATEWAY_RECOVERY_LIMIT
|
||||
|
||||
return { attempts: recover ? [...recent, now] : recent, recover, sid }
|
||||
}
|
||||
|
|
@ -239,6 +239,10 @@ export interface GatewayEventHandlerContext {
|
|||
STARTUP_RESUME_ID: string
|
||||
colsRef: MutableRefObject<number>
|
||||
newSession: (msg?: string, title?: string) => void
|
||||
// Set by useMainApp's exit handler to the session that was live when the
|
||||
// gateway died unexpectedly; consumed once by the next `gateway.ready` so a
|
||||
// respawn resumes that session instead of forging a fresh one.
|
||||
recoverSidRef?: MutableRefObject<null | string>
|
||||
resetSession: () => void
|
||||
resumeById: (id: string) => void
|
||||
setCatalog: StateSetter<null | SlashCatalog>
|
||||
|
|
|
|||
|
|
@ -30,6 +30,7 @@ import type { Msg, PanelSection, SlashCatalog } from '../types.js'
|
|||
|
||||
import { createGatewayEventHandler } from './createGatewayEventHandler.js'
|
||||
import { createSlashHandler } from './createSlashHandler.js'
|
||||
import { planGatewayRecovery } from './gatewayRecovery.js'
|
||||
import { getInputSelection } from './inputSelectionStore.js'
|
||||
import { type GatewayRpc, type TranscriptRow } from './interfaces.js'
|
||||
import { $overlayState, patchOverlayState } from './overlayStore.js'
|
||||
|
|
@ -200,6 +201,8 @@ export function useMainApp(gw: GatewayClient) {
|
|||
const terminalHintsShownRef = useRef(new Set<string>())
|
||||
const historyItemsRef = useRef(historyItems)
|
||||
const lastUserMsgRef = useRef(lastUserMsg)
|
||||
const recoverSidRef = useRef<null | string>(null)
|
||||
const recoveryAtRef = useRef<number[]>([])
|
||||
const msgIdsRef = useRef(new WeakMap<Msg, string>())
|
||||
const msgIdSeqRef = useRef(0)
|
||||
const heightCachesRef = useRef(new Map<string, Map<string, number>>())
|
||||
|
|
@ -694,6 +697,7 @@ export function useMainApp(gw: GatewayClient) {
|
|||
STARTUP_RESUME_ID,
|
||||
colsRef,
|
||||
newSession: session.newSession,
|
||||
recoverSidRef,
|
||||
resetSession: session.resetSession,
|
||||
resumeById: session.resumeById,
|
||||
setCatalog
|
||||
|
|
@ -734,7 +738,34 @@ export function useMainApp(gw: GatewayClient) {
|
|||
|
||||
const exitHandler = () => {
|
||||
turnController.reset()
|
||||
|
||||
// A still-owned child dying while the TUI is alive is an *unexpected*
|
||||
// death — a user /quit exits Node before this fires, and a replaced child
|
||||
// is identity-skipped in GatewayClient. Rather than stranding a long
|
||||
// session (the user's complaint), respawn the gateway and resume the
|
||||
// persisted session via the next gateway.ready, so a single crash / OOM /
|
||||
// signal doesn't lose their work. planGatewayRecovery bounds the attempts
|
||||
// so a gateway that crash-loops on startup can't spawn-storm, and falls
|
||||
// back to recoverSidRef when sid was already cleared by a prior exit.
|
||||
const plan = planGatewayRecovery(getUiState().sid, recoverSidRef.current, recoveryAtRef.current, Date.now())
|
||||
|
||||
// Clear sid immediately: while the gateway is down, sid-guarded effects
|
||||
// (session.active_list poll, queue drain) would otherwise fire RPCs at a
|
||||
// dead/respawning gateway. recoverSidRef carries the session forward, and
|
||||
// resumeById restores sid once the fresh gateway is ready.
|
||||
recoveryAtRef.current = plan.attempts
|
||||
patchUiState({ busy: false, sid: null, status: 'gateway exited' })
|
||||
|
||||
if (plan.recover && plan.sid) {
|
||||
recoverSidRef.current = plan.sid
|
||||
turnController.pushActivity('gateway exited · recovering session…', 'warn')
|
||||
sys('gateway exited — recovering your session (any in-flight reply was lost)')
|
||||
gw.start()
|
||||
|
||||
return
|
||||
}
|
||||
|
||||
recoverSidRef.current = null
|
||||
turnController.pushActivity('gateway exited · /logs to inspect', 'error')
|
||||
sys('error: gateway exited')
|
||||
}
|
||||
|
|
|
|||
|
|
@ -11,6 +11,7 @@ import { setupGracefulExit } from './lib/gracefulExit.js'
|
|||
import { formatBytes, type HeapDumpResult, performHeapDump } from './lib/memory.js'
|
||||
import { type MemorySnapshot, startMemoryMonitor } from './lib/memoryMonitor.js'
|
||||
import { openExternalUrl } from './lib/openExternalUrl.js'
|
||||
import { recordParentLifecycle } from './lib/parentLog.js'
|
||||
import { clampStdoutDimensions } from './lib/terminalDimensions.js'
|
||||
import { resetTerminalModes } from './lib/terminalModes.js'
|
||||
|
||||
|
|
@ -56,9 +57,14 @@ setupGracefulExit({
|
|||
onError: (scope, err) => {
|
||||
const message = err instanceof Error ? `${err.name}: ${err.message}\n${err.stack ?? ''}` : String(err)
|
||||
|
||||
recordParentLifecycle(`${scope}: ${message.split('\n')[0]?.slice(0, 400) ?? ''}`)
|
||||
process.stderr.write(`hermes-tui lifecycle ${scope}: ${message.slice(0, 2000)}\n`)
|
||||
},
|
||||
onSignal: signal => {
|
||||
// The next line in the crash log is the child's `=== SIGTERM received ===`
|
||||
// (gw.kill forwards SIGTERM regardless of which signal hit us) — this is
|
||||
// what tells SIGHUP (terminal/SSH dropped) apart from a real SIGTERM.
|
||||
recordParentLifecycle(`graceful-exit received signal=${signal} → killing gateway`)
|
||||
resetTerminalModes()
|
||||
process.stderr.write(`hermes-tui lifecycle: received ${signal}\n`)
|
||||
}
|
||||
|
|
@ -66,6 +72,10 @@ setupGracefulExit({
|
|||
|
||||
const stopMemoryMonitor = startMemoryMonitor({
|
||||
onCritical: (snap, dump) => {
|
||||
// process.exit(137) closes the child's stdin → the gateway logs a clean
|
||||
// EOF, NOT SIGTERM. Recording it here is the only way a crash report can
|
||||
// attribute a death to Node OOM rather than a signal-driven kill.
|
||||
recordParentLifecycle(`memory-critical process.exit(137) heap=${formatBytes(snap.heapUsed)} rss=${formatBytes(snap.rss)} dump=${dump?.heapPath ?? 'failed'}`)
|
||||
resetTerminalModes()
|
||||
process.stderr.write(`hermes-tui lifecycle: memory critical exit heap=${formatBytes(snap.heapUsed)} rss=${formatBytes(snap.rss)}\n`)
|
||||
process.stderr.write(dumpNotice(snap, dump))
|
||||
|
|
|
|||
|
|
@ -6,6 +6,7 @@ import { createInterface } from 'node:readline'
|
|||
|
||||
import type { GatewayEvent } from './gatewayTypes.js'
|
||||
import { CircularBuffer } from './lib/circularBuffer.js'
|
||||
import { recordParentLifecycle } from './lib/parentLog.js'
|
||||
|
||||
const MAX_GATEWAY_LOG_LINES = 200
|
||||
const MAX_LOG_LINE_BYTES = 4096
|
||||
|
|
@ -237,7 +238,7 @@ export class GatewayClient extends EventEmitter {
|
|||
// readable on slow boots.
|
||||
const stderrTail = this.getLogTail(20)
|
||||
|
||||
this.pushLog(`[startup] timed out waiting for gateway.ready (python=${python}, cwd=${cwd})`)
|
||||
this.lifecycle(`[startup] timed out waiting for gateway.ready (python=${python}, cwd=${cwd})`)
|
||||
this.publish({
|
||||
type: 'gateway.start_timeout',
|
||||
payload: { cwd, python, stderr_tail: stderrTail }
|
||||
|
|
@ -248,7 +249,7 @@ export class GatewayClient extends EventEmitter {
|
|||
private handleTransportExit(code: null | number, reason?: string) {
|
||||
this.clearReadyTimer()
|
||||
this.closeSidecarSocket()
|
||||
this.pushLog(`[lifecycle] transport exit code=${code ?? 'null'} reason=${reason ?? 'none'}`)
|
||||
this.lifecycle(`[lifecycle] transport exit code=${code ?? 'null'} reason=${reason ?? 'none'}`)
|
||||
this.rejectPending(new Error(reason || `gateway exited${code === null ? '' : ` (${code})`}`))
|
||||
|
||||
if (this.subscribed) {
|
||||
|
|
@ -335,7 +336,7 @@ export class GatewayClient extends EventEmitter {
|
|||
env.PYTHONPATH = pyPath ? `${root}${delimiter}${pyPath}` : root
|
||||
this.startReadyTimer(python, cwd)
|
||||
this.proc = spawn(python, ['-m', 'tui_gateway.entry'], { cwd, env, stdio: ['pipe', 'pipe', 'pipe'] })
|
||||
this.pushLog(`[lifecycle] spawned gateway child ${describeChild(this.proc)} python=${python} cwd=${cwd}`)
|
||||
this.lifecycle(`[lifecycle] spawned gateway child ${describeChild(this.proc)} python=${python} cwd=${cwd}`)
|
||||
|
||||
this.stdoutRl = createInterface({ input: this.proc.stdout! })
|
||||
this.stdoutRl.on('line', raw => {
|
||||
|
|
@ -372,7 +373,7 @@ export class GatewayClient extends EventEmitter {
|
|||
|
||||
const line = `[spawn] ${err.message}`
|
||||
|
||||
this.pushLog(`[lifecycle] child error ${describeChild(ownedProc)} message=${err.message}`)
|
||||
this.lifecycle(`[lifecycle] child error ${describeChild(ownedProc)} message=${err.message}`)
|
||||
this.pushLog(line)
|
||||
this.publish({ type: 'gateway.stderr', payload: { line } })
|
||||
// Detach the reference up front so the late `exit` event for
|
||||
|
|
@ -396,7 +397,7 @@ export class GatewayClient extends EventEmitter {
|
|||
return
|
||||
}
|
||||
|
||||
this.pushLog(`[lifecycle] child exit ${describeChild(ownedProc)} code=${code ?? 'null'} signal=${signal ?? 'null'}`)
|
||||
this.lifecycle(`[lifecycle] child exit ${describeChild(ownedProc)} code=${code ?? 'null'} signal=${signal ?? 'null'}`)
|
||||
this.handleTransportExit(code)
|
||||
})
|
||||
}
|
||||
|
|
@ -507,7 +508,7 @@ export class GatewayClient extends EventEmitter {
|
|||
this.resetStartupState()
|
||||
|
||||
if (this.proc && !this.proc.killed && this.proc.exitCode === null) {
|
||||
this.pushLog(`[lifecycle] replacing live gateway child ${describeChild(this.proc)}`)
|
||||
this.lifecycle(`[lifecycle] replacing live gateway child ${describeChild(this.proc)}`)
|
||||
this.proc.kill()
|
||||
}
|
||||
|
||||
|
|
@ -564,6 +565,14 @@ export class GatewayClient extends EventEmitter {
|
|||
this.logs.push(truncateLine(line))
|
||||
}
|
||||
|
||||
// Death-explaining breadcrumbs (spawn / exit / kill / replace) — kept in the
|
||||
// in-memory tail for /logs AND persisted to the gateway crash log so the
|
||||
// reason survives a parent exit and lands next to the child's SIGTERM panic.
|
||||
private lifecycle(line: string) {
|
||||
this.pushLog(line)
|
||||
recordParentLifecycle(line)
|
||||
}
|
||||
|
||||
private rejectPending(err: Error) {
|
||||
for (const p of this.pending.values()) {
|
||||
clearTimeout(p.timeout)
|
||||
|
|
@ -717,7 +726,7 @@ export class GatewayClient extends EventEmitter {
|
|||
const proc = this.proc
|
||||
const killed = proc?.kill()
|
||||
|
||||
this.pushLog(`[lifecycle] GatewayClient.kill reason=${reason} ${describeChild(proc)} killResult=${killed ?? 'none'}`)
|
||||
this.lifecycle(`[lifecycle] GatewayClient.kill reason=${reason} ${describeChild(proc)} killResult=${killed ?? 'none'}`)
|
||||
this.closeGatewaySocket()
|
||||
this.closeSidecarSocket()
|
||||
this.clearReadyTimer()
|
||||
|
|
|
|||
57
ui-tui/src/lib/parentLog.ts
Normal file
57
ui-tui/src/lib/parentLog.ts
Normal file
|
|
@ -0,0 +1,57 @@
|
|||
import { appendFileSync, mkdirSync } from 'node:fs'
|
||||
import { homedir } from 'node:os'
|
||||
import { join } from 'node:path'
|
||||
|
||||
// Mirror the Python gateway's panic log (tui_gateway/server.py::_CRASH_LOG) from
|
||||
// the Node parent so lifecycle breadcrumbs interleave, by timestamp, with the
|
||||
// child's `=== SIGTERM received ===` / `=== gateway exit ===` entries.
|
||||
//
|
||||
// A backend SIGTERM is *usually* a parent action — `gw.kill()` (graceful-exit on
|
||||
// a signal to Node, or an explicit /quit) or `start()` replacing a live child —
|
||||
// but it can also come straight from an external supervisor (s6, a cgroup OOM
|
||||
// reaper, a stray `kill`) signalling the child directly. Telling those apart is
|
||||
// exactly the point: #31051 left these breadcrumbs in an in-memory CircularBuffer
|
||||
// that dies with the process, so SIGTERM crash reports arrived with no parent
|
||||
// context. A `[tui-parent]` line immediately before the child's panic means a
|
||||
// parent kill; its absence *suggests* an external signal — not definitive,
|
||||
// since this logger is best-effort (disabled under VITEST, and a failed append
|
||||
// is swallowed). Persisting the death-explaining events here is what makes that
|
||||
// distinction (and a memory-critical `process.exit(137)`, which closes stdin →
|
||||
// clean EOF, not SIGTERM) diagnosable after the fact.
|
||||
const logDir = join(process.env.HERMES_HOME?.trim() || join(homedir(), '.hermes'), 'logs')
|
||||
const CRASH_LOG = join(logDir, 'tui_gateway_crash.log')
|
||||
|
||||
// Skipped under vitest so unit tests exercising start()/kill() can't write into
|
||||
// a real ~/.hermes (tests must stay hermetic — see AGENTS.md).
|
||||
const enabled = !process.env.VITEST
|
||||
// Slice a single breadcrumb's value to MAX_BREADCRUMB chars (a short
|
||||
// "[truncated …]" marker is appended, so the written line is slightly longer)
|
||||
// so a pathological value (e.g. a giant error) can't bloat the shared crash log
|
||||
// or add noticeable blocking on the synchronous append. Mirrors the spirit of
|
||||
// GatewayClient's in-memory log-line cap.
|
||||
const MAX_BREADCRUMB = 4096
|
||||
let warned = false
|
||||
|
||||
export function recordParentLifecycle(line: string): void {
|
||||
if (!enabled) {
|
||||
return
|
||||
}
|
||||
|
||||
try {
|
||||
// Collapse embedded newlines so a multi-line value (e.g. an error message)
|
||||
// stays one breadcrumb and can't masquerade as a separate log entry or as
|
||||
// the child's panic output sharing this file.
|
||||
const oneLine = line.replace(/[\r\n]+/g, ' ↵ ')
|
||||
|
||||
const capped =
|
||||
oneLine.length > MAX_BREADCRUMB ? `${oneLine.slice(0, MAX_BREADCRUMB)}… [truncated ${oneLine.length} chars]` : oneLine
|
||||
|
||||
mkdirSync(logDir, { recursive: true })
|
||||
appendFileSync(CRASH_LOG, `[tui-parent] ${new Date().toISOString()} ${capped}\n`)
|
||||
} catch {
|
||||
if (!warned) {
|
||||
warned = true
|
||||
process.stderr.write('hermes-tui: parent lifecycle log unavailable\n')
|
||||
}
|
||||
}
|
||||
}
|
||||
Loading…
Add table
Add a link
Reference in a new issue