Sindre•16mo ago

Force cancel one by one or delete all by

Force cancel one by one or delete all by delete them from the jobs table

56 Replies

Alper•16mo ago

cancel one by one works, but i have around 200 of those, that would take quite some time i'll try the postgres approach which table do i need to clean?

SindreOP•16mo ago

you need to clean out the rows that are stalled from the job or jobs table not all rows, just the 200 you want.

Alper•16mo ago

strange, i don't have a job or jobs table

SindreOP•16mo ago

sorry, I think it's the queue table. Could be it's the completed_job table. Not sure on the internals but I did do this one time. It's the table where you find your jobs at least 🙂 so search for an id that is stale and see where it is 🙂 Not sure if the internals has changed or If I'm just so tired that I make stuff up!

Alper•16mo ago

Thanks, i removed the pending jobs from the queue table, but i still get the same issue:

not allowed to overlap, scheduling next iteration

@rubenf do you have any idea how to fix this?

rubenf•16mo ago

Are you on latest ? We did a small fix a few commits ago on it

Alper•16mo ago

My version is a week old or so. Will update and let you know took some time because i wrote an ansible script to properly turn off all workers on different hosts, update them, restart the main windmill instance, wait for it to be online and then restart the workers runs now on recent version and looks to work fine

rubenf•16mo ago

👍 You don't need to wait for it to be online The shared lock will take care of it for you Scale down worker then you can start rolling out all the new versions, servers and workers On ee we will start adding more observability

Alper•16mo ago

nice, good to know

Alper•16mo ago

still got a problem though: schedules just stop being triggered after a couple (dozen) of minutes. they still show themselves as enabled, but don't seem to do anything anymore. see screenshot where you can see the regular busy schedule on the left which gets gradually thinner to the right. they all complete successfully in a reasonable time (up to 6 minutes)

Alper•16mo ago

this is about 6 different schedules running different flows btw

rubenf•16mo ago

@Alper that is really strange and would need a thorough investigation Do you have info about the last runs of each of those schedules ? one possible explanation, you have no more workers available ?

Alper•16mo ago

every single one of them has normal output in their last run i have enough workers

Jobs waiting for a worker 0

rubenf•16mo ago

Each of them has had their last run complete ? And each of them is set to no flow overlap ?

Alper•16mo ago

i have one schedule that is completely unaffected by this and still continues to work. this one is the newest, so i have the suspicion that if i recreate the older schedules it could fix it. but for the sake of troubleshooting i wanted to wait

Each of them has had their last run complete ?

yes

And each of them is set to no flow overlap ?

yes. but i also tried with overlap toggle disabled, same thing happens

rubenf•16mo ago

This is extremely strange, on the runs page you do not see any next runs scheduled for any of the schedules ?

Alper•16mo ago

no, nothing i mean, not nothing. i see some scheduled ones, but not the frequent schedules that stop working after a while

rubenf•16mo ago

Do you have the logs of the workers ? Can you check if you have any error logs around the time of the last run ? might be worth trying grepping for "ERROR" but also more specifically:

Error during handle_maybe_scheduled_job

Error during handle_maybe_scheduled_job

Alper•16mo ago

i found this:

2024-02-11T21:39:58.025668Z ERROR windmill::monitor: Error deleting token: error returned from database: deadlock detected

2024-02-11T21:39:58.025668Z ERROR windmill::monitor: Error deleting token: error returned from database: deadlock detected

rubenf•16mo ago

no other errors ?

Alper•16mo ago

in another worker:

2024-02-11T18:58:57.861979Z ERROR windmill_worker::worker: failed to vacuum queue: error returned from database: could not resize shared memory segment "/PostgreSQL.411090232" to 67145088 bytes: No space left on device worker=wk-remote-abio-891678f7b982-ELOkZ

2024-02-11T18:58:57.861979Z ERROR windmill_worker::worker: failed to vacuum queue: error returned from database: could not resize shared memory segment "/PostgreSQL.411090232" to 67145088 bytes: No space left on device worker=wk-remote-abio-891678f7b982-ELOkZ

which is strange because that device has still 48% available space (around 36 GB)

rubenf•16mo ago

that device is the database ? because that error message would be that your db is full which would explain somewhat your issue

Alper•16mo ago

the device with the windmill db has enough free space atm - maybe there was a spike at some point filling it with lots of data i guess i'll need to enable one schedule after the other and observe what happens monitor disks and worker logs

rubenf•16mo ago

Looks like it could be an issue with the shared memory: https://stackoverflow.com/questions/56751565/pq-could-not-resize-shared-memory-segment-no-space-left-on-device

Stack Overflow

pq: could not resize shared memory segment. No space left on device

I have on a dashboard, a number of panels (numbering around 6)to display data points chart making queries to dockerised instance of PostgreSQL database. Panels were working fine until very recently,

rubenf•16mo ago

that still doesn't explain your issue and it happened before oyur last schedule run It should be impossible for this to happen without error logs. The next run is scheduled as soon as the flow start, this is most surprising.

Alper•16mo ago

if you want to we can have call at some point to check it together. i am not in a rush to fix it fast

rubenf•16mo ago

If you can reproduce this with a single flow, and that flow to be fully sharable that would be very helpful but it shouldn't be flow dependent but yes if you can reproduce this, let's have a call

Alper•16mo ago

it happens every time I re-enable all schedules. let me investigate a bit more thoroughly first and try to narrow down the issue. i'll come back once i got more insight

rubenf•16mo ago

thanks! on very latest, there is a bit more log around schedules you can grep for:

    tracing::info!("Schedule {schedule_path} scheduling next job for {script_path} in {w_id}",);

    tracing::info!("Schedule {schedule_path} scheduling next job for {script_path} in {w_id}",);

Alper•16mo ago

thanks, will check that too

rubenf•16mo ago

then you can look at the logs around it, there will be either an error or an audit log mentionning having pushed the next job but yeah schedules are designed to be invincible (if the database ACID guarantees hold true) which is why this is most surprising @Alper were you able to find anything yet?

Alper•16mo ago

nope. i recreated all schedules that were affected and they just stop working after some time (between 10 minutes to 3 hours)

Alper•16mo ago

on the schedules page you can see that all are green, but then they just stop scheduling the next one

rubenf•16mo ago

does that happen when you create just one ? Do you have the logs showing anything relevant? as in, do you see anything next to:

scheduling next job for

scheduling next job for

Alper•16mo ago

# docker compose logs | grep "scheduling next job for"

# docker compose logs | grep "scheduling next job for"

returns an empty result

rubenf•16mo ago

are you on latest?

Alper•16mo ago

i was yesterday

rubenf•16mo ago

hmm that's pretty strange

Alper•16mo ago

CE v1.268.0-1-gd487a773f i didn't try activating only one yet. just did that now, let's see what happens

rubenf•16mo ago

I think you're just one version before the one with more tracing could you bump windmill and try that?

Alper•16mo ago

sure, let's see if my ansible playbook works 😄

rubenf•16mo ago

thanks

Alper•16mo ago

i'm now on CE v1.268.0-7-g0e7de63c4

rubenf•16mo ago

🎉

Alper•16mo ago

plan is to activate one more schedule every 12 hours so that i should have all six running in 3 days

rubenf•16mo ago

just so you know this is the most important issue that we still need to root cause so this is top priority for us as soon as we have the debug logs so feel free to send lots of info our way it's not acceptable that schedule would just stop working out of the blue and we have customers with thousands of schedules in prod

Alper•16mo ago

got it i just activated schedule number 3 no interruptions so far i can also give you access to my windmill instances if i can't figure it out on my own activated schedule #4 - no problems so far activated #5 - still works started finally schedule number 6 - looks good so far @rubenf Sorry for not updating for a long time. Current state: I had to deactivate schedule number 6 for now because it was faulty. I didn't continue with fixing it because i didn't have time so far for the project. In the meantime the 5 other schedules run fine. The schedules are set to run every minute and to not overlap. Will write again once I fixed #6 and activated that as well.

rubenf•16mo ago

Thanks a lot!

Alper•16mo ago

Fixed the script for schedule #6 and activated it. Will report tomorrow if everything still runs. I suspect that I won't be able to reproduce the problem.

rubenf•16mo ago

What was the script doing so bad it could cause this ?

Alper•16mo ago

basically crawling pages and retrying if rate limited multiple workers on different hosts for cycling ip's sometimes scripts got stuck forever no idea why it happened again. with some hours difference schedule after schedule stopped working. only one remained working at this point at least i think i identified the schedule that breaks all of them but i have no idea how to troubleshoot this not sure if you want to but we could schedule some time to meet and take a look together

rubenf•16mo ago

Sure what version are you on ?

Alper•16mo ago

v1.268.0-7-g0e7de63c4

rubenf•16mo ago

I'm available now if you'd like update: this was caused by an overused DB + lack of progress monitor which is now available in the latest version

Alper•16mo ago

can confirm it works now by extending number of max db connections and upgrading to latest version

Force cancel one by one or delete all by

Did you find this page helpful?