rubenf
rubenf9mo ago

I will write a blog post asap on it, but

I will write a blog post asap on it, but this weekend we achieved a very cool property for a distributed sytem: - Scripts were always 100% reliable as in, they would either execute and be completed with a success or failure, or retried if the worker crashed at ANY point (and I really mean any, even mid transaction, that's the beauty on relying on the beast that is Postgresql). It was achieved using atomic statement for pulling jobs and writing back their progress timestamps, regularly and on completion. - Flows were 99% reliable but had some extremely ephemeral point-in-time where if a crash happened, a flow could be stuck forever. Those events were so rare and unlikely on a modern infra that we didn't prioritize improving that but that is now done: Flows are now guaranteed to complete when they are scheduled given that enough workers are there to process them. This is done through a series of atomic statements in the right places of the finite state machine that runs the flows. If such crash on the machine happen, the flow will be guaranteed to progress in a finite amount of time and propagate the error back up, and then have it be treated by error handlers if any making windmill 100% observable.
7 Replies
andness
andness9mo ago
This is very interesting but I miss one clarification: this only applies to scripts that are part of a flow and have retries enabled right?
rubenf
rubenfOP9mo ago
No it applies to all flows It's not about flow failing because the script errored, it's about nodes/machines literally crashing without windmill being informed of it Windmill will now handle it properly 100% of the time
andness
andness9mo ago
Ok so you can actually restart the script exactly at the point it stopped? Like ...
db.execute("drop table foo")
--- CRASH ---
db.execute("create table bar")
db.execute("drop table foo")
--- CRASH ---
db.execute("create table bar")
And you will not retry the drop table but continue with the create table?
rubenf
rubenfOP9mo ago
Not exactly where it stopped, it's restarted
andness
andness9mo ago
Yeah so if the script isn't written to be restartable it will fail then (since it tries to drop the now non-existing table foo).
rubenf
rubenfOP9mo ago
Yes, idempotency right now still need to be implemented at the user level
andness
andness9mo ago
For data pipelines I generally try to write them so that they are idempotent and restartable since that makes for a more robust system, but a risk with this auto-restarting is that you get duplicated data maybe if you're not aware? Like if a flow uses the first script to determine where to start loading data say, and then a second script does the loading. If you restart the second script you risk loading all the data doubly. When I implement things like that I try to make the flow as a whole restartable, and having it automatically restart one step like that would actually undermine the idempotency I've buillt in)