Skip to content

Heartbeat File Liveness Probe

Problem

Current liveness probe uses exec kill -0 1 which only checks PID alive. Cannot distinguish "busy" from "hung/deadlocked." A stuck asyncio loop with live PID passes the probe indefinitely.

Impact

Hung executor pods remain in pool, consuming consumer group slot but never completing work.

Proposed Solution

Executor writes /tmp/executor-heartbeat with timestamp every 30s during active sessions. Probe checks: PID alive AND heartbeat < 120s old. Stale heartbeat = unhealthy.

Acceptance Criteria

  • Hung executor killed within ~3 minutes
  • Normal long-running sessions (1-5min) don't trigger false kills