ZOMBIE APOCALYPSE
My jobs are not blocking, I have plenty CPU / MEM headroom, yet my jobs seem to get killed with:
Job timed out after no ping from job since 2025-03-03 23:28:53.888358 UTC (ZOMBIE_JOB_TIMEOUT: 60, reason: "RestartLimit (3)I'm running Deno scripts and each worker has 1 cpu and 512 request / 768 limit mem
13 Replies
hard to tell, the worker logs when they ping, so were those jobs not pinged?
let me check the worker logs one sec
It's really strange because the logs look like the pinging is happening but still getting zombied out.
Look:

that's not a ping from the worker, that's your server logs
last job is still running ping:
INFO 2025-03-04T17:23:17.613084808Z [resource.labels.containerName: windmill-worker] job 01956200-82a9-0326-f919-71ee1d38ad87 on wk-chromium-lzj4p-qglY8 in voja still running. mem: 1995592kB, peak mem: 1995592kB
Is that the log I'm looking for?
yes
well that was still 4 seconds after the last ping the error states
But the worker is simply not pinging, so I'll re-check the resource constraints
does the worker die afterwards?
the job is restarted on a new worker, so I assume the worker dies
why do you have to assume, do you not have access to your workers exit time?
I don't see when the pod is restarted or not sure where to find out 😄

I'm not hitting limits, slighlty above requested here and there
anyway I'll keep looking, I'm sure it's some resource or skill issue 😄