IceCactus•2y ago

crash

windmilldev-db-1 | 2023-11-29 19:07:23.202 UTC [1] LOG: server process (PID 439) was terminated by signal 9: Killed windmilldev-db-1 | 2023-11-29 19:07:23.202 UTC [1] DETAIL: Failed process was running: UPDATE queue windmilldev-db-1 | SET running = true windmilldev-db-1 | , started_at = coalesce(started_at, now()) windmilldev-db-1 | , last_ping = now() windmilldev-db-1 | , suspend_until = null windmilldev-db-1 | WHERE id = ( windmilldev-db-1 | SELECT id windmilldev-db-1 | FROM queue windmilldev-db-1 | WHERE running = false AND scheduled_for <= now() AND tag = ANY($1) windmilldev-db-1 | ORDER BY priority DESC NULLS LAST, scheduled_for, created_at windmilldev-db-1 | FOR UPDATE SKIP LOCKED windmilldev-db-1 | LIMIT 1 windmilldev-db-1 | ) windmilldev-db-1 | RETURNING * Any idea what is cuasing this? self hosted

45 Replies

Sindre•2y ago

Signal 9 means that it received a signal to shutdown/kill. Doing a quick search on google with signal 9 and docker, suggest that you do not have enough memory. https://duckduckgo.com/?q=signal+9+docker&t=ffab&ia=web

signal 9 docker at DuckDuckGo

DuckDuckGo. Privacy, Simplified.

IceCactusOP•2y ago

hmm, its been working fine, i wonder why i all of sudden dont have enough memory So when I got 8 gigs of ram, when i do docker compose up the ram skyrockets to the limit

Sindre•2y ago

have you set any resouces on the pods? See docker-compose in windmill as an example:

Sindre•2y ago

https://github.com/windmill-labs/windmill/blob/main/docker-compose.yml#L71

GitHub

windmill/docker-compose.yml at main · windmill-labs/windmill

Open-source developer platform to turn scripts into workflows and UIs. Open-source alternative to Airplane and Retool. - windmill-labs/windmill

IceCactusOP•2y ago

I think it has to do with a migration to new windmill versio i changed memory for works to 1024, and now it seems that my works are constantly restarting

Sindre•2y ago

can you try to stop all workers, and only have the db and 1 instance of the server running and see you you get the same error?

IceCactusOP•2y ago

No errors, and hardly any ram use

Sindre•2y ago

And in the logs do you see any info about migration on server? Could you please start one worker and show us the logs?

IceCactusOP•2y ago

windmilldev-windmill_server-1 | 2023-11-29T20:04:55.629048Z INFO windmill: Last migration version: Some(20231128105015). Starting potential migration of the db if first connection on a new windmill version (can take a while depending on the migration) ... windmilldev-windmill_server-1 | 2023-11-29T20:04:55.635642Z INFO windmill: Completed potential migration of the db. Last migration version: Some(20231128105015) windmilldev-windmill_server-1 | 2023-11-29T20:04:55.635695Z INFO windmill: windmilldev-windmill_server-1 | ############################## windmilldev-windmill_server-1 | Windmill Community Edition v1.216.0-31-g72bb15f6a windmilldev-windmill_server-1 | ##############################

Sindre•2y ago

Seems like migration is ok. Let's hope the worker logs have a error in them.

IceCactusOP•2y ago

Looks like its just repeating this

IceCactusOP•2y ago

message.txt

IceCactusOP•2y ago

db keeps saying connection to client lost i just told docker compose to use image 1.216.0 and it seems to be working fine went back to main and its doing it again

Sindre•2y ago

strange. try again tomorrow then. maybe ruben can help if it still fails tomorrow on master.

IceCactusOP•2y ago

I keep getting this output over and over again: windmilldev-db-1 | 2023-11-30 15:43:29.785 UTC [188] LOG: could not send data to client: Connection reset by peer windmilldev-db-1 | 2023-11-30 15:43:29.785 UTC [188] STATEMENT: UPDATE queue windmilldev-db-1 | SET running = true windmilldev-db-1 | , started_at = coalesce(started_at, now()) windmilldev-db-1 | , last_ping = now() windmilldev-db-1 | , suspend_until = null windmilldev-db-1 | WHERE id = ( windmilldev-db-1 | SELECT id windmilldev-db-1 | FROM queue windmilldev-db-1 | WHERE running = false AND scheduled_for <= now() AND tag = ANY($1) windmilldev-db-1 | ORDER BY priority DESC NULLS LAST, scheduled_for, created_at windmilldev-db-1 | FOR UPDATE SKIP LOCKED windmilldev-db-1 | LIMIT 1 windmilldev-db-1 | ) windmilldev-db-1 | RETURNING * windmilldev-db-1 | 2023-11-30 15:43:30.055 UTC [188] FATAL: connection to client lost Just upgraded to 1.218 and still same issue, seems worker is using memory, os is killing it and its restarting. I get above message from db over and over worker exits with code 137

IceCactusOP•2y ago

message.txt

Sindre•2y ago

Try to ping ruben for help

IceCactusOP•2y ago

@rubenf

rubenf•2y ago

how big is your vm ?

IceCactusOP•2y ago

8 gig 2 cpu

rubenf•2y ago

do you have anything else than windmill on it ?

IceCactusOP•2y ago

no, just windmill

rubenf•2y ago

is the worker crashing after executing a job?

IceCactusOP•2y ago

i only have 1 job that runs on the hour that uses like < 3mb

rubenf•2y ago

hmm ok so not that

IceCactusOP•2y ago

htop is showing im using 633mb of 7.75g right now ive limited it to only 1 worker and 1 native at the moment if you look at the log output, it seems to be occuring with a listen event from the db

rubenf•2y ago

the crash happen at start or after the job get executed ?

IceCactusOP•2y ago

message.txt

rubenf•2y ago

so it seems to crash upon deserializing the job row and based on the fact that it's an oom, for some reason that job is huge ?

IceCactusOP•2y ago

Its possible, i was parsing some very large csv files

rubenf•2y ago

can you psql in the db and lookup the jobs in the queue ah yeah don't do that as an input pull them from within the job

IceCactusOP•2y ago

well, it wasnt setup a a job nor was it input, the file was the input or filename, but its worked for days so not sure why it would cause issue now

rubenf•2y ago

yes that's what i'm saying, don't take big file input directly file is bigger than it was before ? anyway, only way to debug this, psql and inspect your queue table

IceCactusOP•2y ago

ok so i was using javascript in bun to process a large csv file, it was using streams and got to about 230mb

rubenf•2y ago

you can also look up the jobs through the server was that file taken as an input or outputted as a result ?

IceCactusOP•2y ago

it was read through a stream, modified then wrote back to a stream so file from a shared folded on the vm, then written back out to a shared folder on the VM

rubenf•2y ago

yeah that wouldn't impact the queue table so I would just suggest to look what's inside the queue table

IceCactusOP•2y ago

ok, ill take a look, would a large logout cuase an issue when testing a script during editing? so like, console.log() many many times

rubenf•2y ago

depends how many is many more than 1 billion yes around a million, probably not

IceCactusOP•2y ago

should be less then million Ill look at the table and see if i see anything well using pgadmin i cant view any rows on that table, i get bad request error. can i truncate that table?

rubenf•2y ago

I would recommend playing around with that table in the sql playground of pgadmin or use psql

IceCactusOP•2y ago

I think i got it fixed, there was 4 records in the table, two of which were from the script we talked about. the log between the two was about 100mb. I just deleted the two records and all seems well now.

rubenf•2y ago

if that's the reason, then I probably have a fix for this to fail in less spectacular ways

IceCactusOP•2y ago

I cant verify it, but i think what is i had run the script without limiting the number of rows processed. So it ran for 30 seconds or so and generated huge output since i was debugging. I hit the cancel button and the browser crashed for a bit then came back. Im guessing the output was to big. The table size was 1.94mb and the TOAST size was 904mb. so something large in there. Went i looked at queue table, and saw the script in there, i just deleted the two rows since that script was just a test and not scheduled or anything. Thx for the help

rubenf•2y ago

yeah so the fix will be to not try to pull the full log but truncate it at a certain length this is special code path where the job has to be retried and pulling the full logs when the logs are huge doesn't seem right that's why I needed the investigation, it's really hard to guess what are the issues 🙂

crash

Did you find this page helpful?