Tiago Serafim
Tiago Serafim7mo ago

Deadlock and "was unable to make the last transition"

I'm running EE v1.303.4. There's this flow that only uses the REST scripts, one of them is a call to OpenAI's chat completion cached by Clouflare. I noticed that when testing this flow with the results already cached by Cloudflare, they return almost instantly, and the the flow executes very fast with the errors in the screen shot. I did not even set the error handler and the it shows up with the error message InternalErr: Sql error: error returned from database: deadlock detected. In the console I only see the same error as in the UI: Flow 018eb3b7-2601-ffbd-c7c0-72933c171ae7 cancelled as one of the parallel branch 018eb3b7-2655-648a-161f-c4cb1314d5e9 was unable to make the last transition Since it's on my dev machine, I don't think it's related to lack of compute resources. This happens even with lower values of parallelism in the forloop such as 5.
No description
21 Replies
rubenf
rubenf7mo ago
@Tiago Serafim we actually very likely solved that today on v1.304.0 actually had an issue on latest release, but 1.304.1 should work
Tiago Serafim
Tiago Serafim7mo ago
Thank you so much!
Tiago Serafim
Tiago Serafim7mo ago
Now on "EE v1.304.2-7-g587824ccf", still getting some strange errors. The flow is being reported as successful on the Runs page, but it's the outer loops is shown as red and some of the iterations are returning the error on the second screenshot. cc @rubenf
No description
No description
rubenf
rubenf7mo ago
Is that the same flow run or a new flow run ?
Tiago Serafim
Tiago Serafim7mo ago
Same
rubenf
rubenf7mo ago
can you see if you have the same error on a new flow run we have fixed that a new flow run would not enter into a deadlock state and this is just our monitoring alerting that it has errored a flow that didn't progress
Tiago Serafim
Tiago Serafim7mo ago
Thanks, will check as soon as I get back to the PC Sorry, do you mean that I should not click on "Run Again" and instead input the same parameters on a new run?
Tiago Serafim
Tiago Serafim7mo ago
Tried again by clicking on Run, and pasting a different input. The attached screenshot is from the subflow. The iterator was stuck on 500/500 since its start, and the screen was getting spammed with toasts with messages saying that it couldn't fetch job details. Aftewards, the browser tab crashed. After opening, the subflow showed this error and I manually cancelled the outer flow (the outer flow sub-divided the 2000 items input into 4, 500 itens items to the subflow.)
No description
Tiago Serafim
Tiago Serafim7mo ago
Also, this is running on latest since I stoped the docker on my local dev machine and started it again before trying again
rubenf
rubenf7mo ago
What's your parralelism ?
Tiago Serafim
Tiago Serafim7mo ago
I bumped it back to 50 today morning, but on saturday it was erroring even with 5
rubenf
rubenf7mo ago
that error is with 50 right ?
Tiago Serafim
Tiago Serafim7mo ago
The errors today, yes
rubenf
rubenf7mo ago
how consistently do you have that error ?
Tiago Serafim
Tiago Serafim7mo ago
In the first run today I got the error in 2 from the 4 outer loops iterations. In this new last run, I got it on the first and cancelled. Since It's running live on OpenAIs, I'm weary of trying too much and burning credits.
rubenf
rubenf7mo ago
yup I can reproduce some of that issue Will attempt to fix further
Tiago Serafim
Tiago Serafim7mo ago
Thank you! I'm using this for the native workers:
replicas: 64
resources:
limits:
cpus: "0.1"
memory: 128M
replicas: 64
resources:
limits:
cpus: "0.1"
memory: 128M
Don't know if might be too tight for the 50 parallelism.
rubenf
rubenf7mo ago
@Tiago Serafim you should try with latest releases, quite a few improvements
Tiago Serafim
Tiago Serafim7mo ago
Thanks, will do! Now it works, thanks! The only thing I still notice is iteration counter in the parallel subflow is always stuck at N/N.
rubenf
rubenf7mo ago
it's not stuck there there is literally N flows started they're just not all progressing :>
Tiago Serafim
Tiago Serafim7mo ago
Understood, I thought the counter was supposed to show the progress of the completed jobs. Thanks!