tl_jacob
tl_jacob3mo ago

Understanding worker memory logs

We are running Windmill on ECS (Windmill Enterprise Edition v1.480.0) and seeing OOM errors after enabling S3 log forwarding. After inspecting the logs, we see messages like
{"timestamp":"2025-06-13T16:24:27.339628Z","level":"INFO","message":"ping update, memory: container=1742MB, windmill=1842MB","worker":"<worker_id>","hostname":"<host_name>","target":"windmill_worker::worker"}
{"timestamp":"2025-06-13T16:24:27.339628Z","level":"INFO","message":"ping update, memory: container=1742MB, windmill=1842MB","worker":"<worker_id>","hostname":"<host_name>","target":"windmill_worker::worker"}
What is the distinction between "container" and "windmill" memory?
18 Replies
rubenf
rubenf3mo ago
the 1842MB bit is suspicious indeed windmill workers shouldn't use more than 100MB (for windmill itself) when you said you enabled s3 log forwarding, what do you mean exactly?
tl_jacob
tl_jacobOP3mo ago
@rubenf Turned on + configured "Instance object Storage"
No description
rubenf
rubenf3mo ago
It doesn't do just logs. We will need more info to investigate but we will want to get to the bottom of this in your logs, is it a sudden rise for the windmill part or did it slowly increase over time (the windmill=X part)
tl_jacob
tl_jacobOP3mo ago
@rubenf it's a sudden spike
tl_jacob
tl_jacobOP3mo ago
Here is the memory usage graph from the Metrics tab of this run
No description
tl_jacob
tl_jacobOP3mo ago
amending the first plot one second...done
tl_jacob
tl_jacobOP3mo ago
No description
tl_jacob
tl_jacobOP3mo ago
^ plot includes data from messages like this also {"timestamp":"2025-06-13T16:24:26.252261Z","level":"INFO","message":"job 01976a1a-1acb-229e-b7a0-91bc32573e21 on <worker> in <workspace> worker memory snapshot 2094640kB/1886576kB","target":"windmill_worker::handle_child","span":{"otel.name":"python run","name":"run_subprocess"}} We've also seen cases where the windmill memory spikes after a job runs successfully https://gist.github.com/treeline-jacob/44fdc08ff8f37c28a2d165e3d928c460
rubenf
rubenf3mo ago
the metrics one report the memory usage of the fork and not of windmill itself is that job producing tons of logs very fast?
tl_jacob
tl_jacobOP3mo ago
@rubenf no it's not generating a ton of logs very quickly The jobs also run in my local dockerized windmill environment without OOM'ing
rubenf
rubenf3mo ago
but do you see the same pattern where windmill=Xmb logs increase up to 2GB ? and does that happen only when s3 storage is set and not when it is not set?
tl_jacob
tl_jacobOP3mo ago
1. (local environment) I don't see the same pattern where windmill=Xmb logs spike to 2GB locally. s3 storage is unset here. 2. (ECS windmill) the windmill=Xmb logs stay flat ~17Mb with s3 storage is not set
rubenf
rubenf3mo ago
so it only happens when you set s3 storage. Does the windmill=X gets much lower after the execution of that job? would you be able to reproduce with a script you can share with us and that we could run ourselves
tl_jacob
tl_jacobOP3mo ago
@rubenf yes I'm only seeing this when s3 storage is set, but I'm having trouble reliably reproducing it.
Does the windmill=X gets much lower after the execution of that job?
When windmill=X spikes that high, the ecs task OOM's and restarts.
would you be able to reproduce with a script you can share with us and that we could run ourselves
Having trouble reliably producing it, but will try my best
tl_jacob
tl_jacobOP3mo ago
@rubenf okay I've discovered that the spike in windmill memory comes from the process that sends piptars to the s3 python dependency cache. Created a github issue with steps to reproduce here: https://github.com/windmill-labs/windmill/issues/5968#issue-3154973913
GitHub
bug: Memory spike during piptar upload · Issue #5968 · windmill-l...
Describe the bug We enabled Instance object storage and started noticing workers failing with OOM errors even though our windmill scripts were not consuming much memory at all. After an investigati...
rubenf
rubenf3mo ago
Thanks a lot, that's very precious
tl_jacob
tl_jacobOP3mo ago
@rubenf thank you for working on this issue so quickly!! Should we expect the memory spike to be fixed / improved in 1.500.0?
rubenf
rubenf3mo ago
@tl_jacob yes, should be fixed I wasn't able to fully reproduce but from first principle, uploading all piptars in parallel didn't make sense and could result in what you've seen which is what got improved a next improvement would be to ensure that no single piptar can take too much memory on building/upload

Did you find this page helpful?