Force cancel one by one or delete all by
Force cancel one by one or delete all by delete them from the jobs table
56 Replies
cancel one by one works, but i have around 200 of those, that would take quite some time
i'll try the postgres approach
which table do i need to clean?
you need to clean out the rows that are stalled from the job or jobs table
not all rows, just the 200 you want.
strange, i don't have a job or jobs table
sorry, I think it's the queue table. Could be it's the completed_job table. Not sure on the internals but I did do this one time.
It's the table where you find your jobs at least 🙂 so search for an id that is stale and see where it is 🙂
Not sure if the internals has changed or If I'm just so tired that I make stuff up!
Thanks, i removed the pending jobs from the queue table, but i still get the same issue:
not allowed to overlap, scheduling next iteration@rubenf do you have any idea how to fix this?
Are you on latest ?
We did a small fix a few commits ago on it
My version is a week old or so. Will update and let you know
took some time because i wrote an ansible script to properly turn off all workers on different hosts, update them, restart the main windmill instance, wait for it to be online and then restart the workers
runs now on recent version and looks to work fine
👍
You don't need to wait for it to be online
The shared lock will take care of it for you
Scale down worker then you can start rolling out all the new versions, servers and workers
On ee we will start adding more observability
nice, good to know
still got a problem though: schedules just stop being triggered after a couple (dozen) of minutes. they still show themselves as enabled, but don't seem to do anything anymore.
see screenshot where you can see the regular busy schedule on the left which gets gradually thinner to the right. they all complete successfully in a reasonable time (up to 6 minutes)
this is about 6 different schedules running different flows btw
@Alper that is really strange and would need a thorough investigation
Do you have info about the last runs of each of those schedules ?
one possible explanation, you have no more workers available ?
every single one of them has normal output in their last run
i have enough workers
Jobs waiting for a worker 0
Each of them has had their last run complete ?
And each of them is set to no flow overlap ?
i have one schedule that is completely unaffected by this and still continues to work. this one is the newest, so i have the suspicion that if i recreate the older schedules it could fix it. but for the sake of troubleshooting i wanted to wait
Each of them has had their last run complete ?yes
And each of them is set to no flow overlap ?yes. but i also tried with overlap toggle disabled, same thing happens
This is extremely strange, on the runs page you do not see any next runs scheduled for any of the schedules ?
no, nothing
i mean, not nothing. i see some scheduled ones, but not the frequent schedules that stop working after a while
Do you have the logs of the workers ?
Can you check if you have any error logs around the time of the last run ?
might be worth trying grepping for "ERROR" but also more specifically:
i found this:
no other errors ?
in another worker:
which is strange because that device has still 48% available space (around 36 GB)
that device is the database ?
because that error message would be that your db is full
which would explain somewhat your issue
the device with the windmill db has enough free space atm - maybe there was a spike at some point filling it with lots of data
i guess i'll need to enable one schedule after the other and observe what happens
monitor disks and worker logs
Looks like it could be an issue with the shared memory: https://stackoverflow.com/questions/56751565/pq-could-not-resize-shared-memory-segment-no-space-left-on-device
Stack Overflow
pq: could not resize shared memory segment. No space left on device
I have on a dashboard, a number of panels (numbering around 6)to display data points chart making queries to dockerised instance of PostgreSQL database.
Panels were working fine until very recently,
that still doesn't explain your issue
and it happened before oyur last schedule run
It should be impossible for this to happen without error logs. The next run is scheduled as soon as the flow start, this is most surprising.
if you want to we can have call at some point to check it together. i am not in a rush to fix it fast
If you can reproduce this with a single flow, and that flow to be fully sharable that would be very helpful
but it shouldn't be flow dependent
but yes if you can reproduce this, let's have a call
it happens every time I re-enable all schedules. let me investigate a bit more thoroughly first and try to narrow down the issue. i'll come back once i got more insight
thanks!
on very latest, there is a bit more log around schedules
you can grep for:
thanks, will check that too
then you can look at the logs around it, there will be either an error or an audit log mentionning having pushed the next job
but yeah schedules are designed to be invincible (if the database ACID guarantees hold true) which is why this is most surprising
@Alper were you able to find anything yet?
nope. i recreated all schedules that were affected and they just stop working after some time (between 10 minutes to 3 hours)
on the schedules page you can see that all are green, but then they just stop scheduling the next one
does that happen when you create just one ? Do you have the logs showing anything relevant?
as in, do you see anything next to:
returns an empty result
are you on latest?
i was yesterday
hmm that's pretty strange
CE v1.268.0-1-gd487a773f
i didn't try activating only one yet. just did that now, let's see what happens
I think you're just one version before the one with more tracing
could you bump windmill and try that?
sure, let's see if my ansible playbook works 😄
thanks
i'm now on CE v1.268.0-7-g0e7de63c4
🎉
plan is to activate one more schedule every 12 hours
so that i should have all six running in 3 days
just so you know this is the most important issue that we still need to root cause so this is top priority for us as soon as we have the debug logs
so feel free to send lots of info our way
it's not acceptable that schedule would just stop working out of the blue and we have customers with thousands of schedules in prod
got it
i just activated schedule number 3
no interruptions so far
i can also give you access to my windmill instances if i can't figure it out on my own
activated schedule #4 - no problems so far
activated #5 - still works
started finally schedule number 6 - looks good so far
@rubenf Sorry for not updating for a long time. Current state: I had to deactivate schedule number 6 for now because it was faulty. I didn't continue with fixing it because i didn't have time so far for the project. In the meantime the 5 other schedules run fine. The schedules are set to run every minute and to not overlap.
Will write again once I fixed #6 and activated that as well.
Thanks a lot!
Fixed the script for schedule #6 and activated it. Will report tomorrow if everything still runs. I suspect that I won't be able to reproduce the problem.
What was the script doing so bad it could cause this ?
basically crawling pages and retrying if rate limited
multiple workers on different hosts for cycling ip's
sometimes scripts got stuck forever
no idea why
it happened again. with some hours difference schedule after schedule stopped working. only one remained working at this point
at least i think i identified the schedule that breaks all of them
but i have no idea how to troubleshoot this
not sure if you want to but we could schedule some time to meet and take a look together
Sure
what version are you on ?
v1.268.0-7-g0e7de63c4
I'm available now if you'd like
update: this was caused by an overused DB + lack of progress monitor which is now available in the latest version
can confirm it works now by extending number of max db connections and upgrading to latest version