rubenf•2y ago

I will write a blog post asap on it, but

I will write a blog post asap on it, but this weekend we achieved a very cool property for a distributed sytem: - Scripts were always 100% reliable as in, they would either execute and be completed with a success or failure, or retried if the worker crashed at ANY point (and I really mean any, even mid transaction, that's the beauty on relying on the beast that is Postgresql). It was achieved using atomic statement for pulling jobs and writing back their progress timestamps, regularly and on completion. - Flows were 99% reliable but had some extremely ephemeral point-in-time where if a crash happened, a flow could be stuck forever. Those events were so rare and unlikely on a modern infra that we didn't prioritize improving that but that is now done: Flows are now guaranteed to complete when they are scheduled given that enough workers are there to process them. This is done through a series of atomic statements in the right places of the finite state machine that runs the flows. If such crash on the machine happen, the flow will be guaranteed to progress in a finite amount of time and propagate the error back up, and then have it be treated by error handlers if any making windmill 100% observable.

7 Replies

andness•2y ago

This is very interesting but I miss one clarification: this only applies to scripts that are part of a flow and have retries enabled right?

rubenfOP•2y ago

No it applies to all flows It's not about flow failing because the script errored, it's about nodes/machines literally crashing without windmill being informed of it Windmill will now handle it properly 100% of the time

andness•2y ago

Ok so you can actually restart the script exactly at the point it stopped? Like ...

db.execute("drop table foo")
--- CRASH ---
db.execute("create table bar")

db.execute("drop table foo")
--- CRASH ---
db.execute("create table bar")

And you will not retry the drop table but continue with the create table?

rubenfOP•2y ago

Not exactly where it stopped, it's restarted

andness•2y ago

Yeah so if the script isn't written to be restartable it will fail then (since it tries to drop the now non-existing table foo).

rubenfOP•2y ago

Yes, idempotency right now still need to be implemented at the user level

andness•2y ago

For data pipelines I generally try to write them so that they are idempotent and restartable since that makes for a more robust system, but a risk with this auto-restarting is that you get duplicated data maybe if you're not aware? Like if a flow uses the first script to determine where to start loading data say, and then a second script does the loading. If you restart the second script you risk loading all the data doubly. When I implement things like that I try to make the flow as a whole restartable, and having it automatically restart one step like that would actually undermine the idempotency I've buillt in)

I will write a blog post asap on it, but

Did you find this page helpful?