Sindre
Sindre9mo ago

Force cancel one by one or delete all by

Force cancel one by one or delete all by delete them from the jobs table
56 Replies
Alper
Alper9mo ago
cancel one by one works, but i have around 200 of those, that would take quite some time i'll try the postgres approach which table do i need to clean?
Sindre
Sindre9mo ago
you need to clean out the rows that are stalled from the job or jobs table not all rows, just the 200 you want.
Alper
Alper9mo ago
strange, i don't have a job or jobs table
Sindre
Sindre9mo ago
sorry, I think it's the queue table. Could be it's the completed_job table. Not sure on the internals but I did do this one time. It's the table where you find your jobs at least 🙂 so search for an id that is stale and see where it is 🙂 Not sure if the internals has changed or If I'm just so tired that I make stuff up!
Alper
Alper9mo ago
Thanks, i removed the pending jobs from the queue table, but i still get the same issue:
not allowed to overlap, scheduling next iteration
@rubenf do you have any idea how to fix this?
rubenf
rubenf9mo ago
Are you on latest ? We did a small fix a few commits ago on it
Alper
Alper9mo ago
My version is a week old or so. Will update and let you know took some time because i wrote an ansible script to properly turn off all workers on different hosts, update them, restart the main windmill instance, wait for it to be online and then restart the workers runs now on recent version and looks to work fine
rubenf
rubenf9mo ago
👍 You don't need to wait for it to be online The shared lock will take care of it for you Scale down worker then you can start rolling out all the new versions, servers and workers On ee we will start adding more observability
Alper
Alper9mo ago
nice, good to know
Alper
Alper9mo ago
still got a problem though: schedules just stop being triggered after a couple (dozen) of minutes. they still show themselves as enabled, but don't seem to do anything anymore. see screenshot where you can see the regular busy schedule on the left which gets gradually thinner to the right. they all complete successfully in a reasonable time (up to 6 minutes)
No description
Alper
Alper9mo ago
this is about 6 different schedules running different flows btw
rubenf
rubenf9mo ago
@Alper that is really strange and would need a thorough investigation Do you have info about the last runs of each of those schedules ? one possible explanation, you have no more workers available ?
Alper
Alper9mo ago
every single one of them has normal output in their last run i have enough workers
Jobs waiting for a worker 0
rubenf
rubenf9mo ago
Each of them has had their last run complete ? And each of them is set to no flow overlap ?
Alper
Alper9mo ago
i have one schedule that is completely unaffected by this and still continues to work. this one is the newest, so i have the suspicion that if i recreate the older schedules it could fix it. but for the sake of troubleshooting i wanted to wait
Each of them has had their last run complete ?
yes
And each of them is set to no flow overlap ?
yes. but i also tried with overlap toggle disabled, same thing happens
rubenf
rubenf9mo ago
This is extremely strange, on the runs page you do not see any next runs scheduled for any of the schedules ?
Alper
Alper9mo ago
no, nothing i mean, not nothing. i see some scheduled ones, but not the frequent schedules that stop working after a while
rubenf
rubenf9mo ago
Do you have the logs of the workers ? Can you check if you have any error logs around the time of the last run ? might be worth trying grepping for "ERROR" but also more specifically:
Error during handle_maybe_scheduled_job
Error during handle_maybe_scheduled_job
Alper
Alper9mo ago
i found this:
2024-02-11T21:39:58.025668Z ERROR windmill::monitor: Error deleting token: error returned from database: deadlock detected
2024-02-11T21:39:58.025668Z ERROR windmill::monitor: Error deleting token: error returned from database: deadlock detected
rubenf
rubenf9mo ago
no other errors ?
Alper
Alper9mo ago
in another worker:
2024-02-11T18:58:57.861979Z ERROR windmill_worker::worker: failed to vacuum queue: error returned from database: could not resize shared memory segment "/PostgreSQL.411090232" to 67145088 bytes: No space left on device worker=wk-remote-abio-891678f7b982-ELOkZ
2024-02-11T18:58:57.861979Z ERROR windmill_worker::worker: failed to vacuum queue: error returned from database: could not resize shared memory segment "/PostgreSQL.411090232" to 67145088 bytes: No space left on device worker=wk-remote-abio-891678f7b982-ELOkZ
which is strange because that device has still 48% available space (around 36 GB)
rubenf
rubenf9mo ago
that device is the database ? because that error message would be that your db is full which would explain somewhat your issue
Alper
Alper9mo ago
the device with the windmill db has enough free space atm - maybe there was a spike at some point filling it with lots of data i guess i'll need to enable one schedule after the other and observe what happens monitor disks and worker logs
rubenf
rubenf9mo ago
Stack Overflow
pq: could not resize shared memory segment. No space left on device
I have on a dashboard, a number of panels (numbering around 6)to display data points chart making queries to dockerised instance of PostgreSQL database. Panels were working fine until very recently,
rubenf
rubenf9mo ago
that still doesn't explain your issue and it happened before oyur last schedule run It should be impossible for this to happen without error logs. The next run is scheduled as soon as the flow start, this is most surprising.
Alper
Alper9mo ago
if you want to we can have call at some point to check it together. i am not in a rush to fix it fast
rubenf
rubenf9mo ago
If you can reproduce this with a single flow, and that flow to be fully sharable that would be very helpful but it shouldn't be flow dependent but yes if you can reproduce this, let's have a call
Alper
Alper9mo ago
it happens every time I re-enable all schedules. let me investigate a bit more thoroughly first and try to narrow down the issue. i'll come back once i got more insight
rubenf
rubenf9mo ago
thanks! on very latest, there is a bit more log around schedules you can grep for:
tracing::info!("Schedule {schedule_path} scheduling next job for {script_path} in {w_id}",);
tracing::info!("Schedule {schedule_path} scheduling next job for {script_path} in {w_id}",);
Alper
Alper9mo ago
thanks, will check that too
rubenf
rubenf9mo ago
then you can look at the logs around it, there will be either an error or an audit log mentionning having pushed the next job but yeah schedules are designed to be invincible (if the database ACID guarantees hold true) which is why this is most surprising @Alper were you able to find anything yet?
Alper
Alper9mo ago
nope. i recreated all schedules that were affected and they just stop working after some time (between 10 minutes to 3 hours)
Alper
Alper9mo ago
on the schedules page you can see that all are green, but then they just stop scheduling the next one
No description
rubenf
rubenf9mo ago
does that happen when you create just one ? Do you have the logs showing anything relevant? as in, do you see anything next to:
scheduling next job for
scheduling next job for
Alper
Alper9mo ago
No description
Alper
Alper9mo ago
# docker compose logs | grep "scheduling next job for"
# docker compose logs | grep "scheduling next job for"
returns an empty result
rubenf
rubenf9mo ago
are you on latest?
Alper
Alper9mo ago
i was yesterday
rubenf
rubenf9mo ago
hmm that's pretty strange
Alper
Alper9mo ago
CE v1.268.0-1-gd487a773f i didn't try activating only one yet. just did that now, let's see what happens
rubenf
rubenf9mo ago
I think you're just one version before the one with more tracing could you bump windmill and try that?
Alper
Alper9mo ago
sure, let's see if my ansible playbook works 😄
rubenf
rubenf9mo ago
thanks
Alper
Alper9mo ago
i'm now on CE v1.268.0-7-g0e7de63c4
rubenf
rubenf9mo ago
🎉
Alper
Alper9mo ago
plan is to activate one more schedule every 12 hours so that i should have all six running in 3 days
rubenf
rubenf9mo ago
just so you know this is the most important issue that we still need to root cause so this is top priority for us as soon as we have the debug logs so feel free to send lots of info our way it's not acceptable that schedule would just stop working out of the blue and we have customers with thousands of schedules in prod
Alper
Alper8mo ago
got it i just activated schedule number 3 no interruptions so far i can also give you access to my windmill instances if i can't figure it out on my own activated schedule #4 - no problems so far activated #5 - still works started finally schedule number 6 - looks good so far @rubenf Sorry for not updating for a long time. Current state: I had to deactivate schedule number 6 for now because it was faulty. I didn't continue with fixing it because i didn't have time so far for the project. In the meantime the 5 other schedules run fine. The schedules are set to run every minute and to not overlap. Will write again once I fixed #6 and activated that as well.
rubenf
rubenf8mo ago
Thanks a lot!
Alper
Alper8mo ago
Fixed the script for schedule #6 and activated it. Will report tomorrow if everything still runs. I suspect that I won't be able to reproduce the problem.
rubenf
rubenf8mo ago
What was the script doing so bad it could cause this ?
Alper
Alper8mo ago
basically crawling pages and retrying if rate limited multiple workers on different hosts for cycling ip's sometimes scripts got stuck forever no idea why it happened again. with some hours difference schedule after schedule stopped working. only one remained working at this point at least i think i identified the schedule that breaks all of them but i have no idea how to troubleshoot this not sure if you want to but we could schedule some time to meet and take a look together
rubenf
rubenf8mo ago
Sure what version are you on ?
Alper
Alper8mo ago
v1.268.0-7-g0e7de63c4
rubenf
rubenf8mo ago
I'm available now if you'd like update: this was caused by an overused DB + lack of progress monitor which is now available in the latest version
Alper
Alper8mo ago
can confirm it works now by extending number of max db connections and upgrading to latest version