pixeleet
pixeleet•5d ago

ZOMBIE APOCALYPSE

My jobs are not blocking, I have plenty CPU / MEM headroom, yet my jobs seem to get killed with:
Job timed out after no ping from job since 2025-03-03 23:28:53.888358 UTC (ZOMBIE_JOB_TIMEOUT: 60, reason: "RestartLimit (3)
I'm running Deno scripts and each worker has 1 cpu and 512 request / 768 limit mem
13 Replies
rubenf
rubenf•5d ago
hard to tell, the worker logs when they ping, so were those jobs not pinged?
pixeleet
pixeleetOP•5d ago
let me check the worker logs one sec
pixeleet
pixeleetOP•5d ago
It's really strange because the logs look like the pinging is happening but still getting zombied out. Look:
No description
rubenf
rubenf•5d ago
that's not a ping from the worker, that's your server logs
pixeleet
pixeleetOP•5d ago
last job is still running ping: INFO 2025-03-04T17:23:17.613084808Z [resource.labels.containerName: windmill-worker] job 01956200-82a9-0326-f919-71ee1d38ad87 on wk-chromium-lzj4p-qglY8 in voja still running. mem: 1995592kB, peak mem: 1995592kB Is that the log I'm looking for?
rubenf
rubenf•5d ago
yes
pixeleet
pixeleetOP•5d ago
well that was still 4 seconds after the last ping the error states But the worker is simply not pinging, so I'll re-check the resource constraints
rubenf
rubenf•5d ago
does the worker die afterwards?
pixeleet
pixeleetOP•5d ago
the job is restarted on a new worker, so I assume the worker dies
rubenf
rubenf•5d ago
why do you have to assume, do you not have access to your workers exit time?
pixeleet
pixeleetOP•5d ago
I don't see when the pod is restarted or not sure where to find out 😄
pixeleet
pixeleetOP•5d ago
No description
pixeleet
pixeleetOP•5d ago
I'm not hitting limits, slighlty above requested here and there anyway I'll keep looking, I'm sure it's some resource or skill issue 😄

Did you find this page helpful?