Heartbeat File Liveness Probe¶

Problem¶

Current liveness probe uses exec kill -0 1 which only checks PID alive. Cannot distinguish "busy" from "hung/deadlocked." A stuck asyncio loop with live PID passes the probe indefinitely.

Impact¶

Hung executor pods remain in pool, consuming consumer group slot but never completing work.

Proposed Solution¶

Executor writes /tmp/executor-heartbeat with timestamp every 30s during active sessions. Probe checks: PID alive AND heartbeat < 120s old. Stale heartbeat = unhealthy.

Acceptance Criteria¶

Hung executor killed within ~3 minutes
Normal long-running sessions (1-5min) don't trigger false kills