Heartbeat File Liveness Probe¶
Problem¶
Current liveness probe uses exec kill -0 1 which only checks PID alive. Cannot distinguish "busy" from "hung/deadlocked." A stuck asyncio loop with live PID passes the probe indefinitely.
Impact¶
Hung executor pods remain in pool, consuming consumer group slot but never completing work.
Proposed Solution¶
Executor writes /tmp/executor-heartbeat with timestamp every 30s during active sessions. Probe checks: PID alive AND heartbeat < 120s old. Stale heartbeat = unhealthy.
Acceptance Criteria¶
- Hung executor killed within ~3 minutes
- Normal long-running sessions (1-5min) don't trigger false kills