BertP
BertP15mo ago

I seem to have some zombie runs

I seem to have some zombie runs triggered by schedule. this is about a flow that i was editing, but had not shut down the schedule first. The common symptoms are "flow is running", step 1 of 2 completed. step one produces an array of identifiers, step two is a for loop. job details say "Job loading.." for step 2. No errors in worker logs, i suppose also no worker starvation. The flow has 2 steps within the for-loop with retries enabled, but the first run that hangs shows no indication of being retried.
13 Replies
BertP
BertP15mo ago
2023-08-16T12:45:00.908654Z INFO windmill_worker::worker: fetched job 0189fe5c-2caa-1bfb-be33-4dc2b7c0f50d, root job: none worker=wk-b660936c0a6c-MY9lW workspace_id=e-power-nieuwerkerken id=0189fe5c-2caa-1bfb-be33-4dc2b7c0f50d root_id=none

thread 'tokio-runtime-worker' panicked at 'could not create job dir: Os { code: 17, kind: AlreadyExists, message: "File exists" }', /windmill/windmill-worker/src/worker.rs:689:26

note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Error: task 50 panicked

2023-08-16T12:45:01.692046Z INFO windmill: Connecting to database...

2023-08-16T12:45:01.737229Z INFO windmill: Database connected

2023-08-16T12:45:01.745235Z INFO windmill:

##############################

Windmill Community Edition v1.148.0-7-g0af264f6

##############################
2023-08-16T12:45:00.908654Z INFO windmill_worker::worker: fetched job 0189fe5c-2caa-1bfb-be33-4dc2b7c0f50d, root job: none worker=wk-b660936c0a6c-MY9lW workspace_id=e-power-nieuwerkerken id=0189fe5c-2caa-1bfb-be33-4dc2b7c0f50d root_id=none

thread 'tokio-runtime-worker' panicked at 'could not create job dir: Os { code: 17, kind: AlreadyExists, message: "File exists" }', /windmill/windmill-worker/src/worker.rs:689:26

note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Error: task 50 panicked

2023-08-16T12:45:01.692046Z INFO windmill: Connecting to database...

2023-08-16T12:45:01.737229Z INFO windmill: Database connected

2023-08-16T12:45:01.745235Z INFO windmill:

##############################

Windmill Community Edition v1.148.0-7-g0af264f6

##############################
every 5 minutes a new zombie spawns, and this seems to be the related message the flow has a shared directory activated this flow is ported through 2 months of upgrades, i see schedules now also have a path. maybe that's related
rubenf
rubenf15mo ago
Schedules always had a path I just did a fix where the error you're seeing would not appear So that error would have happened prior to the fix if the same job was-re-run on the same worker which can happen in some cases but i'd like to make sure we root cause completely Could you give me all the logs related to job: 0189fe5c-2caa-1bfb-be33-4dc2b7c0f50d you can grep your logs with that id
BertP
BertP15mo ago
1 of 3 workers had these lines in their log
BertP
BertP15mo ago
script/flow data is not sensitive, names are just random words thrown together
rubenf
rubenf15mo ago
do you use any sleep/suspend in that flow ?
BertP
BertP15mo ago
No, this started happening after I added a shared directory + exponential backoffs on two steps
No description
No description
BertP
BertP15mo ago
to clarify; no sleep or suspends. and these symptoms appeared after I added a shared directory + exponential backoff on the two scripts inside the for loop fetch updated invoices returns an empty array. at a high level the for loop should "just" complete because of nothing to do. The done indicator gives 100% completed, but the step breakdown says the for loop component hasn't run yet. what this lowlevel means i assume you can deduct from the logs?
rubenf
rubenf15mo ago
i'm still confused why that flow is being fetched as a job twice I'm on it but I have some clues about what's happening, the fix I did earlier should owrk
BertP
BertP15mo ago
i suppose that fix wasn't pushed into the CE docker image before 10:30 today? i'll pull the newest images once they become available
rubenf
rubenf15mo ago
I just did the fix
BertP
BertP15mo ago
oki
rubenf
rubenf15mo ago
Ok I found the issue and it was an edge-case that should be properly handled now my fix is sufficient
BertP
BertP15mo ago
thanks for the help!