DuckLake Speed
This mostly a general question to see if my issue is simple (and on my end).
I got DuckLake working for simple table creation and insert scripts. I can view the table and records inserted.
But, it’s very slow to take any action (like, 10-15 seconds to do any kind of write action and 5-10 seconds for a read action - with one table and less than 50 records - and only writing one or two records at a time). I understand the difference between a standard transactional database and this, but still seems slower than I’d expect.
I’m using CE, using docker-compose, hosting all containers on a VM on Hetzner with 8gb of ram, and using Hetzners s3 compatible object storage.
I know that 8gb isn’t much for numerous containers all working together, but with all other actions the performance is perfectly acceptable.
So, is it possible the server specs are the issue? Something different with Hetzners object storage? Misconfiguration? Possible I just don’t understand how DuckLake is expected to work?
Thanks for your help!
5 Replies
Just thought of something else. I am using a Germany based Hetzner server from the US. When I’m performing actions like HTTP requests that results in a predictable delay in the hundreds of milliseconds range. When working with DuckLake, is there enough happening on the front end that would contribute to it feeling slow? (10-15 seconds still seems slower than I’d expect, but depending on how many round trips there are I could see it resulting in noticeable slowness).
(Context: I stood up my own instance partially because I am very much enjoying Windmill, and also because we are seriously considering moving to it in the college IT team I work in — self-hosted EE. The performance of DuckLake won’t factor into deciding one way or the other - but it definitely might be something we use for data warehousing type work if it meets our needs).
I will do more detailed benchmarks tomorrow to give a better answer
I have some caching optimisations left to do to avoid some round trips to the wmill database in the S3 Proxy,
but the DuckDB executor should be very fast (no additional round trips to db/backend)
Datalake is expected to be slow, even without the windmill integration
But 15 seconds does seem like a lot, I don’t recall it being so slow for such a small dataset
It's very probably due to S3 roundtrips
I could not reproduce such high latency
When fully local (MinIO S3), it's really really fast, insertion takes ~300ms in a scenario similar to yours (simple table with 50 rows inserted individually), same for read.
It starts to get a little slower on AWS S3 but more like ~2.5s
How is the performance if you try to write to S3 in DuckDB outside of ducklake:
And outside of DuckDB:
If these are slow too, then most probably it's a problem with your setup that makes roundtrips to S3 slow for some reason
That’s for checking on that. I’ll probably have a chance to try that later today.
If we did use it, it would be with an on premise s3 compatible storage, so nice to see the performance on a local setup (on a local network would be a little slower, but still wouldn’t have far too travel).
I can confirm high latency. I am currently using external Neon DB (Azure Germany West Central, Frankfurt) and Cloudflare R2 (EEUR). The following script takes 8-20 seconds to complete.
I reproduced that very high latency, it's caused by the remote catalog DB. I will investigate
Ducklake is still an early Windmill feature, thanks for reporting
Hello,
I investigated a bit and couldn't achieve worse performance on Windmill than vanilla DuckDB
The following :
took ~17s in both windmill and the duckdb CLI (replacing the attach statement with the correct syntax, same DB and same S3 both remote)
This is acceptable for 3 inserts, on Snowflake it's been much slower than that in my experience
There must be some bottleneck between your workers and the S3 remote or the catalog db
Did you try the script I provided in TypeScript(Bun) to check that
Another note, it is indeed much slower on the Windmill cloud than my computer (but they run the same binary), and the common factor i see is that your setup and the Windmill cloud are in the US while my catalog DB, S3 and myself are in the EU