crash
windmilldev-db-1 | 2023-11-29 19:07:23.202 UTC [1] LOG: server process (PID 439) was terminated by signal 9: Killed
windmilldev-db-1 | 2023-11-29 19:07:23.202 UTC [1] DETAIL: Failed process was running: UPDATE queue
windmilldev-db-1 | SET running = true
windmilldev-db-1 | , started_at = coalesce(started_at, now())
windmilldev-db-1 | , last_ping = now()
windmilldev-db-1 | , suspend_until = null
windmilldev-db-1 | WHERE id = (
windmilldev-db-1 | SELECT id
windmilldev-db-1 | FROM queue
windmilldev-db-1 | WHERE running = false AND scheduled_for <= now() AND tag = ANY($1)
windmilldev-db-1 | ORDER BY priority DESC NULLS LAST, scheduled_for, created_at
windmilldev-db-1 | FOR UPDATE SKIP LOCKED
windmilldev-db-1 | LIMIT 1
windmilldev-db-1 | )
windmilldev-db-1 | RETURNING *
Any idea what is cuasing this? self hosted
45 Replies
Signal 9 means that it received a signal to shutdown/kill.
Doing a quick search on google with signal 9 and docker, suggest that you do not have enough memory. https://duckduckgo.com/?q=signal+9+docker&t=ffab&ia=web
signal 9 docker at DuckDuckGo
DuckDuckGo. Privacy, Simplified.
hmm, its been working fine, i wonder why i all of sudden dont have enough memory
So when I got 8 gigs of ram, when i do docker compose up the ram skyrockets to the limit
have you set any resouces on the pods? See docker-compose in windmill as an example:
GitHub
windmill/docker-compose.yml at main · windmill-labs/windmill
Open-source developer platform to turn scripts into workflows and UIs. Open-source alternative to Airplane and Retool. - windmill-labs/windmill
I think it has to do with a migration to new windmill versio
i changed memory for works to 1024, and now it seems that my works are constantly restarting
can you try to stop all workers, and only have the db and 1 instance of the server running and see you you get the same error?
No errors, and hardly any ram use
And in the logs do you see any info about migration on server?
Could you please start one worker and show us the logs?
windmilldev-windmill_server-1 | 2023-11-29T20:04:55.629048Z INFO windmill: Last migration version: Some(20231128105015). Starting potential migration of the db if first connection on a new windmill version (can take a while depending on the migration) ...
windmilldev-windmill_server-1 | 2023-11-29T20:04:55.635642Z INFO windmill: Completed potential migration of the db. Last migration version: Some(20231128105015)
windmilldev-windmill_server-1 | 2023-11-29T20:04:55.635695Z INFO windmill:
windmilldev-windmill_server-1 | ##############################
windmilldev-windmill_server-1 | Windmill Community Edition v1.216.0-31-g72bb15f6a
windmilldev-windmill_server-1 | ##############################
Seems like migration is ok. Let's hope the worker logs have a error in them.
Looks like its just repeating this
db keeps saying connection to client lost
i just told docker compose to use image 1.216.0 and it seems to be working fine
went back to main and its doing it again
strange. try again tomorrow then. maybe ruben can help if it still fails tomorrow on master.
I keep getting this output over and over again: windmilldev-db-1 | 2023-11-30 15:43:29.785 UTC [188] LOG: could not send data to client: Connection reset by peer
windmilldev-db-1 | 2023-11-30 15:43:29.785 UTC [188] STATEMENT: UPDATE queue
windmilldev-db-1 | SET running = true
windmilldev-db-1 | , started_at = coalesce(started_at, now())
windmilldev-db-1 | , last_ping = now()
windmilldev-db-1 | , suspend_until = null
windmilldev-db-1 | WHERE id = (
windmilldev-db-1 | SELECT id
windmilldev-db-1 | FROM queue
windmilldev-db-1 | WHERE running = false AND scheduled_for <= now() AND tag = ANY($1)
windmilldev-db-1 | ORDER BY priority DESC NULLS LAST, scheduled_for, created_at
windmilldev-db-1 | FOR UPDATE SKIP LOCKED
windmilldev-db-1 | LIMIT 1
windmilldev-db-1 | )
windmilldev-db-1 | RETURNING *
windmilldev-db-1 | 2023-11-30 15:43:30.055 UTC [188] FATAL: connection to client lost
Just upgraded to 1.218 and still same issue, seems worker is using memory, os is killing it and its restarting. I get above message from db over and over
worker exits with code 137
Try to ping ruben for help
@rubenf
how big is your vm ?
8 gig 2 cpu
do you have anything else than windmill on it ?
no, just windmill
is the worker crashing after executing a job?
i only have 1 job that runs on the hour
that uses like < 3mb
hmm ok so not that
htop is showing im using 633mb of 7.75g right now
ive limited it to only 1 worker and 1 native at the moment
if you look at the log output, it seems to be occuring with a listen event from the db
the crash happen at start or after the job get executed ?
so it seems to crash upon deserializing the job row
and based on the fact that it's an oom, for some reason that job is huge ?
Its possible, i was parsing some very large csv files
can you psql in the db and lookup the jobs in the queue
ah yeah don't do that as an input
pull them from within the job
well, it wasnt setup a a job
nor was it input, the file was the input
or filename, but its worked for days so not sure why it would cause issue now
yes that's what i'm saying, don't take big file input directly
file is bigger than it was before ?
anyway, only way to debug this, psql and inspect your queue table
ok
so i was using javascript in bun to process a large csv file, it was using streams and got to about 230mb
you can also look up the jobs through the server
was that file taken as an input or outputted as a result ?
it was read through a stream, modified then wrote back to a stream
so file from a shared folded on the vm, then written back out to a shared folder on the VM
yeah that wouldn't impact the queue table
so I would just suggest to look what's inside the queue table
ok, ill take a look, would a large logout cuase an issue when testing a script during editing?
so like, console.log() many many times
depends how many is many
more than 1 billion yes
around a million, probably not
should be less then million
Ill look at the table and see if i see anything
well using pgadmin i cant view any rows on that table, i get bad request error.
can i truncate that table?
I would recommend playing around with that table in the sql playground of pgadmin or use psql
I think i got it fixed, there was 4 records in the table, two of which were from the script we talked about. the log between the two was about 100mb. I just deleted the two records and all seems well now.
if that's the reason, then I probably have a fix for this to fail in less spectacular ways
I cant verify it, but i think what is i had run the script without limiting the number of rows processed. So it ran for 30 seconds or so and generated huge output since i was debugging. I hit the cancel button and the browser crashed for a bit then came back. Im guessing the output was to big. The table size was 1.94mb and the TOAST size was 904mb. so something large in there. Went i looked at queue table, and saw the script in there, i just deleted the two rows since that script was just a test and not scheduled or anything.
Thx for the help
yeah so the fix will be to not try to pull the full log but truncate it at a certain length
this is special code path where the job has to be retried
and pulling the full logs when the logs are huge doesn't seem right
that's why I needed the investigation, it's really hard to guess what are the issues 🙂