tl_jacob•6mo ago

Understanding worker memory logs

We are running Windmill on ECS (Windmill Enterprise Edition v1.480.0) and seeing OOM errors after enabling S3 log forwarding. After inspecting the logs, we see messages like

{"timestamp":"2025-06-13T16:24:27.339628Z","level":"INFO","message":"ping update, memory: container=1742MB, windmill=1842MB","worker":"<worker_id>","hostname":"<host_name>","target":"windmill_worker::worker"}

{"timestamp":"2025-06-13T16:24:27.339628Z","level":"INFO","message":"ping update, memory: container=1742MB, windmill=1842MB","worker":"<worker_id>","hostname":"<host_name>","target":"windmill_worker::worker"}

What is the distinction between "container" and "windmill" memory?

18 Replies

rubenf•6mo ago

the 1842MB bit is suspicious indeed windmill workers shouldn't use more than 100MB (for windmill itself) when you said you enabled s3 log forwarding, what do you mean exactly?

tl_jacobOP•6mo ago

@rubenf Turned on + configured "Instance object Storage"

rubenf•6mo ago

It doesn't do just logs. We will need more info to investigate but we will want to get to the bottom of this in your logs, is it a sudden rise for the windmill part or did it slowly increase over time (the windmill=X part)

tl_jacobOP•6mo ago

@rubenf it's a sudden spike

tl_jacobOP•6mo ago

Here is the memory usage graph from the Metrics tab of this run

tl_jacobOP•6mo ago

amending the first plot one second...done

tl_jacobOP•6mo ago

^ plot includes data from messages like this also {"timestamp":"2025-06-13T16:24:26.252261Z","level":"INFO","message":"job 01976a1a-1acb-229e-b7a0-91bc32573e21 on <worker> in <workspace> worker memory snapshot 2094640kB/1886576kB","target":"windmill_worker::handle_child","span":{"otel.name":"python run","name":"run_subprocess"}} We've also seen cases where the windmill memory spikes after a job runs successfully https://gist.github.com/treeline-jacob/44fdc08ff8f37c28a2d165e3d928c460

rubenf•6mo ago

the metrics one report the memory usage of the fork and not of windmill itself is that job producing tons of logs very fast?

tl_jacobOP•6mo ago

@rubenf no it's not generating a ton of logs very quickly The jobs also run in my local dockerized windmill environment without OOM'ing

rubenf•6mo ago

but do you see the same pattern where windmill=Xmb logs increase up to 2GB ? and does that happen only when s3 storage is set and not when it is not set?

tl_jacobOP•6mo ago

1. (local environment) I don't see the same pattern where windmill=Xmb logs spike to 2GB locally. s3 storage is unset here. 2. (ECS windmill) the windmill=Xmb logs stay flat ~17Mb with s3 storage is not set

rubenf•6mo ago

so it only happens when you set s3 storage. Does the windmill=X gets much lower after the execution of that job? would you be able to reproduce with a script you can share with us and that we could run ourselves

tl_jacobOP•6mo ago

@rubenf yes I'm only seeing this when s3 storage is set, but I'm having trouble reliably reproducing it.

Does the windmill=X gets much lower after the execution of that job?

When windmill=X spikes that high, the ecs task OOM's and restarts.

would you be able to reproduce with a script you can share with us and that we could run ourselves

Having trouble reliably producing it, but will try my best

tl_jacobOP•6mo ago

@rubenf okay I've discovered that the spike in windmill memory comes from the process that sends piptars to the s3 python dependency cache. Created a github issue with steps to reproduce here: https://github.com/windmill-labs/windmill/issues/5968#issue-3154973913

GitHub

bug: Memory spike during piptar upload · Issue #5968 · windmill-l...

Describe the bug We enabled Instance object storage and started noticing workers failing with OOM errors even though our windmill scripts were not consuming much memory at all. After an investigati...

rubenf•6mo ago

Thanks a lot, that's very precious

tl_jacobOP•6mo ago

@rubenf thank you for working on this issue so quickly!! Should we expect the memory spike to be fixed / improved in 1.500.0?

rubenf•6mo ago

@tl_jacob yes, should be fixed I wasn't able to fully reproduce but from first principle, uploading all piptars in parallel didn't make sense and could result in what you've seen which is what got improved a next improvement would be to ensure that no single piptar can take too much memory on building/upload

Understanding worker memory logs

Did you find this page helpful?