MrOsdev
MrOsdev5d ago

Scheduled script stops scheduling after wait for executor

Hi! We are experiencing some issues on one of our Windmill instances. It runs a few scheduled jobs and for a still unknown reason sometimes the scheduled jobs wait a very long time for an executor. Also sometimes it seems the run time is displayed as a negative number (which i'm guessing may be some sort of overflow due to crazy long wait times). The jobs do seem to eventually run just fine, even if the run time is broken like that. The problem is that after such a "broken" job runs, the schedule stops working and no further jobs are scheduled. We are trying to figure out what the cause behind the large executor wait time may be (as the instance is not under heavy load at all and the jobs don't take very long to execute). I guess the problem consists of two parts - The long executor wait time for whatever reason (despite the actual worker load being low) - And the second reason which i want to ask about here: Why does the schedule seem to stop working after a job waits a while for an executor? Is this a known issue or possibly even expected behavior? Sadly i have not managed to reproduce this issue on another Windmill instance..
No description
No description
24 Replies
rubenf
rubenf5d ago
what version is that instance on? and what are the workers logs showing, anything suspicious?
MrOsdev
MrOsdevOP5d ago
I can't tell you the exact version for sure right now but it was last updated march 20th around 15:00UTC so I'm quite sure it should be windmill-full:1.478.1 This is to be investigated, I'll check on that
MrOsdev
MrOsdevOP4d ago
hey @rubenf , i have checked the version and it is indeed 1.478.1 I started another schedule on the same instance today again, hourly this time with the same issue occouring at the 4th run I investigated the logs a bit and it appears the script gets executed far too early? on an hourly schedule it got executed 4 minutes after the previous run.... But i see no errors in the logs around the time this happened Lets call the jobs: 14:03 -> "Job 4" 12:00 -> "Job 3" 11:00 -> "Job 2" 10:00 -> "Job 1" Job 1 from frontend requests:
"created_at": "2025-03-27T08:06:51.965920Z",
"started_at": "2025-03-27T09:00:00.007401Z",
"created_at": "2025-03-27T08:06:51.965920Z",
"started_at": "2025-03-27T09:00:00.007401Z",
actually fetched and ran on a worker at 2025-03-27T09:00:00.008204Z Job 2 from frontend requests:
"created_at": "2025-03-27T09:00:25.776197Z",
"started_at": "2025-03-27T10:00:00.005832Z",
"created_at": "2025-03-27T09:00:25.776197Z",
"started_at": "2025-03-27T10:00:00.005832Z",
actually fetched and ran on a worker at 2025-03-27T10:00:00.007376Z Job 3 from frontend requests:
"created_at": "2025-03-27T10:00:21.357364Z",
"started_at": "2025-03-27T11:00:00.003525Z",
"created_at": "2025-03-27T10:00:21.357364Z",
"started_at": "2025-03-27T11:00:00.003525Z",
actually fetched and ran on a worker at 2025-03-27T11:00:00.004517Z Job 4 from frontend requests:
"created_at": "2025-03-27T11:00:24.435919Z",
"started_at": "2025-03-27T13:03:37.678Z",
"self_wait_time_ms": 3817678
"created_at": "2025-03-27T11:00:24.435919Z",
"started_at": "2025-03-27T13:03:37.678Z",
"self_wait_time_ms": 3817678
actually fetched and ran on a worker at 2025-03-27T11:03:41.741176Z unusual parts of the worker logs:
2025-03-27T11:03:41.741121Z WARN [...] pull took more than 0.1s (0.113271933) this is a sign that the database is undersized for this load. empty: false, err: true
2025-03-27T11:03:41.741176Z INFO [...] fetched job [...]
2025-03-27T11:03:41.742277Z WARN [...] job ... waited for an executor for a significant amount of time. Recording value wait_time=3817678ms
2025-03-27T11:04:04.617075Z INFO [...] Schedule ... scheduling next job for ... in [...]
2025-03-27T11:04:04.620214Z INFO [...] Job for schedule ... at 2025-03-27 12:00:00 UTC already exists [...]
2025-03-27T11:03:41.741121Z WARN [...] pull took more than 0.1s (0.113271933) this is a sign that the database is undersized for this load. empty: false, err: true
2025-03-27T11:03:41.741176Z INFO [...] fetched job [...]
2025-03-27T11:03:41.742277Z WARN [...] job ... waited for an executor for a significant amount of time. Recording value wait_time=3817678ms
2025-03-27T11:04:04.617075Z INFO [...] Schedule ... scheduling next job for ... in [...]
2025-03-27T11:04:04.620214Z INFO [...] Job for schedule ... at 2025-03-27 12:00:00 UTC already exists [...]
No description
MrOsdev
MrOsdevOP4d ago
So it seems the job is actually executed far too early instead of too late?
rubenf
rubenf4d ago
do you have the scheduled_for of those jobs? it's the most important part
MrOsdev
MrOsdevOP4d ago
the column scheduled_for in v2_job_queue? The v2_job_queue table is empty
rubenf
rubenf4d ago
it's in v2_job actually it's not ...
MrOsdev
MrOsdevOP4d ago
Yeah it's not..
rubenf
rubenf4d ago
however, you can find it in the logs of the previous tick grep the previous:
2025-03-27T11:04:04.617075Z INFO [...] Schedule ... scheduling next job for ... in [...]
2025-03-27T11:04:04.620214Z INFO [...] Job for schedule ... at 2025-03-27 12:00:00 UTC already exists [...]
2025-03-27T11:04:04.617075Z INFO [...] Schedule ... scheduling next job for ... in [...]
2025-03-27T11:04:04.620214Z INFO [...] Job for schedule ... at 2025-03-27 12:00:00 UTC already exists [...]
actually it won't be there
MrOsdev
MrOsdevOP4d ago
Yeah it isn't there either..
rubenf
rubenf4d ago
at what time was the previous: 2025-03-27T11:04:04.617075Z INFO [...] Schedule ... scheduling next job for ... in [...] the only explanation I have is that the clock of your db changed at 11am, it started to believe it was 1pm
MrOsdev
MrOsdevOP4d ago
I searched for 27 11:00:00 and 27T11:00:00 in the whole database and the logs and could not seem to find anything relevant...
rubenf
rubenf4d ago
Can you search for: scheduling next job for also are you on a single vps or do you have distributed workers?
MrOsdev
MrOsdevOP4d ago
2025-03-27T11:00:24.435531Z This is a single VPS, the server, workers and the database run there
rubenf
rubenf4d ago
so yeah given the above, I think your postgresql somehow changed time can you run select now() in your db ?
MrOsdev
MrOsdevOP4d ago
I cannot run anything in the database right now sadly.. The next opinion for that would be monday
rubenf
rubenf4d ago
but here is why I believe that is the case the started_at shows 13:XX the started_at corresponds to now() for when it was pulled (in windmill every time is relative to the database)
MrOsdev
MrOsdevOP4d ago
I have just now checked the postgres logs, time seems to be progressing normally in the logs at least around the time it happened there as one would expect at least
rubenf
rubenf4d ago
but that's not the postgres time that's the localtime at which the logs were emitted if at that point in time, the database believed it was 13:XX, then it would indeed pull the job, then schedule the next one (at which point it seems to have believed again it was 11:XX) and then realized the job already existed (itself)
MrOsdev
MrOsdevOP4d ago
I see! This is very helpful insight already, I'll look into any weird DB behaviour then
rubenf
rubenf4d ago
I have no other explanation for started_at suddenly being very shifted compared to your worker time as compared to the other jobs I can also tell you that scheduled_for was 12:00 because of the way waiting_time is computed
MrOsdev
MrOsdevOP4d ago
What's definitely interesting is that the shift seems to be very close to two hours, hinting it may be some timezone issue somewhere Thank you so much for your help! I'll be looking into this and see if i can find the problem
rubenf
rubenf4d ago
adding scheduled_for in the logs for next time
MrOsdev
MrOsdevOP4d ago
Thank you!!! I'll report back when the issue is figured out^^

Did you find this page helpful?