rubenfR
Windmill2y ago
9 replies
rubenf
I will write a blog post asap on it, but this weekend we achieved a very cool property for a distributed sytem:
- Scripts were always 100% reliable as in, they would either execute and be completed with a success or failure, or retried if the worker crashed at ANY point (and I really mean any, even mid transaction, that's the beauty on relying on the beast that is Postgresql). It was achieved using atomic statement for pulling jobs and writing back their progress timestamps, regularly and on completion.
- Flows were 99% reliable but had some extremely ephemeral point-in-time where if a crash happened, a flow could be stuck forever. Those events were so rare and unlikely on a modern infra that we didn't prioritize improving that but that is now done:
Flows are now guaranteed to complete when they are scheduled given that enough workers are there to process them. This is done through a series of atomic statements in the right places of the finite state machine that runs the flows. If such crash on the machine happen, the flow will be guaranteed to progress in a finite amount of time and propagate the error back up, and then have it be treated by error handlers if any making windmill 100% observable.
Was this page helpful?