MrOsdev•3mo ago

Scheduled script stops scheduling after wait for executor

Hi! We are experiencing some issues on one of our Windmill instances. It runs a few scheduled jobs and for a still unknown reason sometimes the scheduled jobs wait a very long time for an executor. Also sometimes it seems the run time is displayed as a negative number (which i'm guessing may be some sort of overflow due to crazy long wait times). The jobs do seem to eventually run just fine, even if the run time is broken like that. The problem is that after such a "broken" job runs, the schedule stops working and no further jobs are scheduled. We are trying to figure out what the cause behind the large executor wait time may be (as the instance is not under heavy load at all and the jobs don't take very long to execute). I guess the problem consists of two parts - The long executor wait time for whatever reason (despite the actual worker load being low) - And the second reason which i want to ask about here: Why does the schedule seem to stop working after a job waits a while for an executor? Is this a known issue or possibly even expected behavior? Sadly i have not managed to reproduce this issue on another Windmill instance..

24 Replies

rubenf•3mo ago

what version is that instance on? and what are the workers logs showing, anything suspicious?

MrOsdevOP•3mo ago

I can't tell you the exact version for sure right now but it was last updated march 20th around 15:00UTC so I'm quite sure it should be windmill-full:1.478.1 This is to be investigated, I'll check on that

MrOsdevOP•3mo ago

hey @rubenf , i have checked the version and it is indeed 1.478.1 I started another schedule on the same instance today again, hourly this time with the same issue occouring at the 4th run I investigated the logs a bit and it appears the script gets executed far too early? on an hourly schedule it got executed 4 minutes after the previous run.... But i see no errors in the logs around the time this happened Lets call the jobs: 14:03 -> "Job 4" 12:00 -> "Job 3" 11:00 -> "Job 2" 10:00 -> "Job 1" Job 1 from frontend requests:

"created_at": "2025-03-27T08:06:51.965920Z",
"started_at": "2025-03-27T09:00:00.007401Z",

"created_at": "2025-03-27T08:06:51.965920Z",
"started_at": "2025-03-27T09:00:00.007401Z",

actually fetched and ran on a worker at 2025-03-27T09:00:00.008204Z Job 2 from frontend requests:

"created_at": "2025-03-27T09:00:25.776197Z",
"started_at": "2025-03-27T10:00:00.005832Z",

"created_at": "2025-03-27T09:00:25.776197Z",
"started_at": "2025-03-27T10:00:00.005832Z",

actually fetched and ran on a worker at 2025-03-27T10:00:00.007376Z Job 3 from frontend requests:

"created_at": "2025-03-27T10:00:21.357364Z",
"started_at": "2025-03-27T11:00:00.003525Z",

"created_at": "2025-03-27T10:00:21.357364Z",
"started_at": "2025-03-27T11:00:00.003525Z",

actually fetched and ran on a worker at 2025-03-27T11:00:00.004517Z Job 4 from frontend requests:

"created_at": "2025-03-27T11:00:24.435919Z",
"started_at": "2025-03-27T13:03:37.678Z",
"self_wait_time_ms": 3817678

"created_at": "2025-03-27T11:00:24.435919Z",
"started_at": "2025-03-27T13:03:37.678Z",
"self_wait_time_ms": 3817678

actually fetched and ran on a worker at 2025-03-27T11:03:41.741176Z unusual parts of the worker logs:

2025-03-27T11:03:41.741121Z WARN [...] pull took more than 0.1s (0.113271933) this is a sign that the database is undersized for this load. empty: false, err: true
2025-03-27T11:03:41.741176Z INFO [...] fetched job [...]
2025-03-27T11:03:41.742277Z WARN [...] job ... waited for an executor for a significant amount of time. Recording value wait_time=3817678ms
2025-03-27T11:04:04.617075Z INFO [...] Schedule ... scheduling next job for ... in [...]
2025-03-27T11:04:04.620214Z INFO [...] Job for schedule ... at 2025-03-27 12:00:00 UTC already exists [...]

2025-03-27T11:03:41.741121Z WARN [...] pull took more than 0.1s (0.113271933) this is a sign that the database is undersized for this load. empty: false, err: true
2025-03-27T11:03:41.741176Z INFO [...] fetched job [...]
2025-03-27T11:03:41.742277Z WARN [...] job ... waited for an executor for a significant amount of time. Recording value wait_time=3817678ms
2025-03-27T11:04:04.617075Z INFO [...] Schedule ... scheduling next job for ... in [...]
2025-03-27T11:04:04.620214Z INFO [...] Job for schedule ... at 2025-03-27 12:00:00 UTC already exists [...]

MrOsdevOP•3mo ago

So it seems the job is actually executed far too early instead of too late?

rubenf•3mo ago

do you have the scheduled_for of those jobs? it's the most important part

MrOsdevOP•3mo ago

the column scheduled_for in v2_job_queue? The v2_job_queue table is empty

rubenf•3mo ago

it's in v2_job actually it's not ...

MrOsdevOP•3mo ago

Yeah it's not..

rubenf•3mo ago

however, you can find it in the logs of the previous tick grep the previous:

2025-03-27T11:04:04.617075Z INFO [...] Schedule ... scheduling next job for ... in [...]
2025-03-27T11:04:04.620214Z INFO [...] Job for schedule ... at 2025-03-27 12:00:00 UTC already exists [...]

2025-03-27T11:04:04.617075Z INFO [...] Schedule ... scheduling next job for ... in [...]
2025-03-27T11:04:04.620214Z INFO [...] Job for schedule ... at 2025-03-27 12:00:00 UTC already exists [...]

actually it won't be there

MrOsdevOP•3mo ago

Yeah it isn't there either..

rubenf•3mo ago

at what time was the previous: 2025-03-27T11:04:04.617075Z INFO [...] Schedule ... scheduling next job for ... in [...] the only explanation I have is that the clock of your db changed at 11am, it started to believe it was 1pm

MrOsdevOP•3mo ago

I searched for 27 11:00:00 and 27T11:00:00 in the whole database and the logs and could not seem to find anything relevant...

rubenf•3mo ago

Can you search for: scheduling next job for also are you on a single vps or do you have distributed workers?

MrOsdevOP•3mo ago

2025-03-27T11:00:24.435531Z This is a single VPS, the server, workers and the database run there

rubenf•3mo ago

so yeah given the above, I think your postgresql somehow changed time can you run select now() in your db ?

MrOsdevOP•3mo ago

I cannot run anything in the database right now sadly.. The next opinion for that would be monday

rubenf•3mo ago

but here is why I believe that is the case the started_at shows 13:XX the started_at corresponds to now() for when it was pulled (in windmill every time is relative to the database)

MrOsdevOP•3mo ago

I have just now checked the postgres logs, time seems to be progressing normally in the logs at least around the time it happened there as one would expect at least

rubenf•3mo ago

but that's not the postgres time that's the localtime at which the logs were emitted if at that point in time, the database believed it was 13:XX, then it would indeed pull the job, then schedule the next one (at which point it seems to have believed again it was 11:XX) and then realized the job already existed (itself)

MrOsdevOP•3mo ago

I see! This is very helpful insight already, I'll look into any weird DB behaviour then

rubenf•3mo ago

I have no other explanation for started_at suddenly being very shifted compared to your worker time as compared to the other jobs I can also tell you that scheduled_for was 12:00 because of the way waiting_time is computed

MrOsdevOP•3mo ago

What's definitely interesting is that the shift seems to be very close to two hours, hinting it may be some timezone issue somewhere Thank you so much for your help! I'll be looking into this and see if i can find the problem

rubenf•3mo ago

adding scheduled_for in the logs for next time

MrOsdevOP•3mo ago

Thank you!!! I'll report back when the issue is figured out^^

Scheduled script stops scheduling after wait for executor

Did you find this page helpful?