BertP•3y ago

I seem to have some zombie runs

I seem to have some zombie runs triggered by schedule. this is about a flow that i was editing, but had not shut down the schedule first. The common symptoms are "flow is running", step 1 of 2 completed. step one produces an array of identifiers, step two is a for loop. job details say "Job loading.." for step 2. No errors in worker logs, i suppose also no worker starvation. The flow has 2 steps within the for-loop with retries enabled, but the first run that hangs shows no indication of being retried.

13 Replies

BertPOP•3y ago

2023-08-16T12:45:00.908654Z  INFO windmill_worker::worker: fetched job 0189fe5c-2caa-1bfb-be33-4dc2b7c0f50d, root job: none worker=wk-b660936c0a6c-MY9lW workspace_id=e-power-nieuwerkerken id=0189fe5c-2caa-1bfb-be33-4dc2b7c0f50d root_id=none

thread 'tokio-runtime-worker' panicked at 'could not create job dir: Os { code: 17, kind: AlreadyExists, message: "File exists" }', /windmill/windmill-worker/src/worker.rs:689:26

note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Error: task 50 panicked

2023-08-16T12:45:01.692046Z  INFO windmill: Connecting to database...

2023-08-16T12:45:01.737229Z  INFO windmill: Database connected

2023-08-16T12:45:01.745235Z  INFO windmill: 

##############################

Windmill Community Edition v1.148.0-7-g0af264f6

##############################

2023-08-16T12:45:00.908654Z  INFO windmill_worker::worker: fetched job 0189fe5c-2caa-1bfb-be33-4dc2b7c0f50d, root job: none worker=wk-b660936c0a6c-MY9lW workspace_id=e-power-nieuwerkerken id=0189fe5c-2caa-1bfb-be33-4dc2b7c0f50d root_id=none

thread 'tokio-runtime-worker' panicked at 'could not create job dir: Os { code: 17, kind: AlreadyExists, message: "File exists" }', /windmill/windmill-worker/src/worker.rs:689:26

note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Error: task 50 panicked

2023-08-16T12:45:01.692046Z  INFO windmill: Connecting to database...

2023-08-16T12:45:01.737229Z  INFO windmill: Database connected

2023-08-16T12:45:01.745235Z  INFO windmill: 

##############################

Windmill Community Edition v1.148.0-7-g0af264f6

##############################

every 5 minutes a new zombie spawns, and this seems to be the related message the flow has a shared directory activated this flow is ported through 2 months of upgrades, i see schedules now also have a path. maybe that's related

rubenf•3y ago

Schedules always had a path I just did a fix where the error you're seeing would not appear So that error would have happened prior to the fix if the same job was-re-run on the same worker which can happen in some cases but i'd like to make sure we root cause completely Could you give me all the logs related to job: 0189fe5c-2caa-1bfb-be33-4dc2b7c0f50d you can grep your logs with that id

BertPOP•3y ago

1 of 3 workers had these lines in their log

windmill_0189fe5c-2c...

BertPOP•3y ago

script/flow data is not sensitive, names are just random words thrown together

rubenf•3y ago

do you use any sleep/suspend in that flow ?

BertPOP•3y ago

No, this started happening after I added a shared directory + exponential backoffs on two steps

BertPOP•3y ago

to clarify; no sleep or suspends. and these symptoms appeared after I added a shared directory + exponential backoff on the two scripts inside the for loop fetch updated invoices returns an empty array. at a high level the for loop should "just" complete because of nothing to do. The done indicator gives 100% completed, but the step breakdown says the for loop component hasn't run yet. what this lowlevel means i assume you can deduct from the logs?

rubenf•3y ago

i'm still confused why that flow is being fetched as a job twice I'm on it but I have some clues about what's happening, the fix I did earlier should owrk

BertPOP•3y ago

i suppose that fix wasn't pushed into the CE docker image before 10:30 today? i'll pull the newest images once they become available

rubenf•3y ago

I just did the fix

BertPOP•3y ago

oki

rubenf•3y ago

Ok I found the issue and it was an edge-case that should be properly handled now my fix is sufficient

BertPOP•3y ago

thanks for the help!

I seem to have some zombie runs

Did you find this page helpful?