I seem to have some zombie runs
I seem to have some zombie runs triggered by schedule. this is about a flow that i was editing, but had not shut down the schedule first. The common symptoms are "flow is running", step 1 of 2 completed. step one produces an array of identifiers, step two is a for loop. job details say "Job loading.." for step 2. No errors in worker logs, i suppose also no worker starvation. The flow has 2 steps within the for-loop with retries enabled, but the first run that hangs shows no indication of being retried.
13 Replies
every 5 minutes a new zombie spawns, and this seems to be the related message
the flow has a shared directory activated
this flow is ported through 2 months of upgrades, i see schedules now also have a path. maybe that's related
Schedules always had a path
I just did a fix where the error you're seeing would not appear
So that error would have happened prior to the fix if the same job was-re-run on the same worker
which can happen in some cases but i'd like to make sure we root cause completely
Could you give me all the logs related to job:
0189fe5c-2caa-1bfb-be33-4dc2b7c0f50d
you can grep your logs with that id1 of 3 workers had these lines in their log
script/flow data is not sensitive, names are just random words thrown together
do you use any sleep/suspend in that flow ?
No, this started happening after I added a shared directory + exponential backoffs on two steps
to clarify; no sleep or suspends. and these symptoms appeared after I added a shared directory + exponential backoff on the two scripts inside the for loop
fetch updated invoices returns an empty array. at a high level the for loop should "just" complete because of nothing to do. The done indicator gives 100% completed, but the step breakdown says the for loop component hasn't run yet. what this lowlevel means i assume you can deduct from the logs?
i'm still confused why that flow is being fetched as a job twice
I'm on it but I have some clues about what's happening, the fix I did earlier should owrk
i suppose that fix wasn't pushed into the CE docker image before 10:30 today? i'll pull the newest images once they become available
I just did the fix
oki
Ok I found the issue and it was an edge-case that should be properly handled now
my fix is sufficient
thanks for the help!