develoco•2w ago

Workers show up in service logs, but not in the Workers UI

I have connected one native worker (ECS service) and one GPU worker (on-prem server, connected to VPN/VPC). I have added routing to my VPN config and can connect to the PG DB using pgAdmin and psql when ssh-ed into both of these machines. Docker logs for the worker containers show normal connection to PG instance and, most importantly, workers show in the service logs (attached screenshot). The problem is the Workers screen shows 0 worker for all three worker groups (screenshot attached). This is also preventing me to create a root admin user, since the Deno job which is (I assume) supposed to execute to do this doesn't find an available worker. I can see that my job queue has some jobs, probably reflecting the fact I tried to create the root user 3 times without success. I am curious are there any additional ports I need to open except 5432? If not, how can I further debug the issue? Thanks!

60 Replies

develocoOP•2w ago

Here's the proof queue in the fresh server/db has some jobs (admin account creation intents) and logs from the on-prem worker, showing the DB connection happened successfully. Interestingly, there are warnings about DB being undersized. Could this be relevant?

develocoOP•2w ago

Not sure if this console output is relevant. The first one is EE feature, but the one below it seems suspicious... My network tab only shows a single EE endpoint failing (/list) otherwise all normal.

rubenf•2w ago

what is that bad request response exactly?

develocoOP•2w ago

Not sure, it seems like there is only a single bad request in the whole network tab, and that's the EE /list endpoint (expected), so it seems like the second console log comes from that request? :/

rubenf•2w ago

actually that's from the queue drawer, it's unrelated

develocoOP•2w ago

I'd understand if the worker can't talk to the server/db at all, but the fact I see both native and gpu worker in the service logs tells me at least DB connection is working properly... Is there some kind of port I need to open in my SGs or it is only one way communication?

rubenf•2w ago

no it should work

develocoOP•2w ago

I was hoping you will not say that 😄

rubenf•2w ago

I will try to reproduce quickly but we would probably know if it was a common issue I can't reproduce the relevant api call is: /api/workers/list?per_page=1000&ping_since=300

develocoOP•2w ago

That return empty list, status 200.

rubenf•2w ago

then I would investigate the worker_ping table in the db your logs show the ping is sent so maybe a timezone issue, unclear to me

develocoOP•2w ago

There are 2 hours of difference between the actual time (17:01 PM) and the timestamp in the DB (15:01 PM), but why would that be an issue?

rubenf•2w ago

we only display the last 300s worker pings but also it's probably just the timezone difference which mean the time is correct

develocoOP•2w ago

yup, I am in GMT+2 and the zone in that field is GMT+0, so all good there

rubenf•2w ago

keep your instance up, and verify that ping_at is less than 300s, then look the workers page

develocoOP•2w ago

just did and they don't show up... but one question: are we sure that local time is correctly mapped to GMT before comparing the diff with ping_at? because if not this would explain me getting the empty array back... I'm getting 2hrs of diff and everything grater than 300s is discarded I'll try manually pinging the same API with very large ping_since param to see if I start getting those if yes, that will confirm the doubt

rubenf•2w ago

yes i'm sure we handle timezone correctly

develocoOP•2w ago

I believe you, but let me quickly check just in case 😄 yeah, still empty string

develocoOP•2w ago

https://github.com/windmill-labs/windmill/blob/422a02d8f78cae8e71ace405f1d423978054cb0b/backend/windmill-api/src/workers.rs#L98 so weird, looking into the code it should return it :/

GitHub

windmill/backend/windmill-api/src/workers.rs at 422a02d8f78cae8e71a...

Open-source developer platform to power your entire infra and turn scripts into webhooks, workflows and UIs. Fastest workflow engine (13x vs Airflow). Open-source alternative to Retool and Temporal...

rubenf•2w ago

this is the real query:

SELECT worker, worker_instance,  EXTRACT(EPOCH FROM (now() - ping_at))::integer as last_ping, started_at, ip, jobs_executed,
        CASE WHEN $4 IS TRUE THEN current_job_id ELSE NULL END as last_job_id, CASE WHEN $4 IS TRUE THEN current_job_workspace_id ELSE NULL END as last_job_workspace_id, 
        custom_tags, worker_group, wm_version, occupancy_rate, occupancy_rate_15s, occupancy_rate_5m, occupancy_rate_30m, memory, vcpus, memory_usage, wm_memory_usage
        FROM worker_ping
        WHERE ($1::integer IS NULL AND ping_at > now() - interval '5 minute') OR (ping_at > now() - ($1 || ' seconds')::interval)
        ORDER BY ping_at desc LIMIT $2 OFFSET $3

SELECT worker, worker_instance,  EXTRACT(EPOCH FROM (now() - ping_at))::integer as last_ping, started_at, ip, jobs_executed,
        CASE WHEN $4 IS TRUE THEN current_job_id ELSE NULL END as last_job_id, CASE WHEN $4 IS TRUE THEN current_job_workspace_id ELSE NULL END as last_job_workspace_id, 
        custom_tags, worker_group, wm_version, occupancy_rate, occupancy_rate_15s, occupancy_rate_5m, occupancy_rate_30m, memory, vcpus, memory_usage, wm_memory_usage
        FROM worker_ping
        WHERE ($1::integer IS NULL AND ping_at > now() - interval '5 minute') OR (ping_at > now() - ($1 || ' seconds')::interval)
        ORDER BY ping_at desc LIMIT $2 OFFSET $3

        query.ping_since,
        per_page as i64,
        offset as i64,
        is_super_admin

        query.ping_since,
        per_page as i64,
        offset as i64,
        is_super_admin

develocoOP•2w ago

interesting, substituting the arguments (admin=TRUE, ping_since=300, offset=0, per_page=1000) returns all those native workers O.o I am assuming the fact I have only a single default admin@windmill.dev user is not relevant?

rubenf•2w ago

what about the gpu worker group? are you an admin on that workspace?

develocoOP•2w ago

Yes, this is the default superadmin account admin@windmill.dev I am not sure if I had that worker on when I was performing the tests Just checked and I didn't have it running, so the output was expected.

develocoOP•2w ago

and starting it makes it show in the query result immediately, so all good there

rubenf•2w ago

Do you see it in the workers page?

develocoOP•2w ago

no I take exactly the same worker config, just point it to my locally hosted instance (the same local network as the worker) and it works nicely, but pointing it to the RDS database doesn't work

rubenf•2w ago

what's the result of:

/api/w/demo/users/whoami

/api/w/demo/users/whoami

replacing demo with your current workspace and can you list your users in your RDS including their attributes, the equivalent of \du+; in psql

develocoOP•2w ago

{"workspace_id":"admin","email":"admin@windmill.dev","username":"admin@windmill.dev","is_admin":true,"is_super_admin":true,"created_at":"2025-05-20T07:50:09.178777677Z","groups":[],"operator":false,"disabled":false,"role":"superadmin","folders_read":[],"folders":[],"folders_owners":[],"name":null} I didn't get the chance to create any user other than the default admin one created by migrations

rubenf•2w ago

can you still list them with their attributes please I want to check for Bypass RLS also can you run the same query above but as windmill_admin instead of your db user

develocoOP•2w ago

seems like I have windmill_user and windmill_admin groups, not users... and when I try to connect to my DB using the windmill_admin/changeme it says the credentials are wrong

develocoOP•2w ago

windmill_admin has the Bypass_RLS set, _user doesn't

develocoOP•2w ago

and then the user I'm connecting to the db with is called "samantha", and that user is a member of the windmill_admin group (and windmill_user) ...so this is not possible since I don't have the windmill_admin user.

rubenf•2w ago

you need to set role once connected they have no login it's normal

develocoOP•2w ago

got it, and I think we might be onto something 😄 when setting the role to windmill_admin I get no results back from the query! O.o

rubenf•2w ago

it being bypass rls it doesn't make much sense if you select * from worker_ping you also get no results?

develocoOP•2w ago

this works SET ROLE samantha; select * from worker_ping; this doesn't SET ROLE windmill_admin; select * from worker_ping; this doesn't work either SET ROLE windmill_user; select * from worker_ping;

rubenf•2w ago

can you show all the policies and grant for windmill_admin on worker_ping

develocoOP•2w ago

you hit the nail on the head! there were none 🙂 After granting them it all started working! Thanks for the support Ruben, this is amazing! I'll keep recommending Windmill to my friends and colleagues, these kind of things make so much difference!

rubenf•2w ago

just to be sure, what did you to make it work? you did a grant of worker_ping to windmill_admin ? What's odd if those were missing you should have had an error

develocoOP•2w ago

yes, but I did have an error on runs table as well, so I had to grant all tables to windmill_admin and then everything started working, my user creation job executed and the whole instance became ready for use most probably those were not granted since I created a scheme manually in a shared RDS DB and it wasn't called "public" but "samantha", could it be something with that?

rubenf•2w ago

did you create the schema after the initial migrations?

develocoOP•2w ago

no, before

rubenf•2w ago

you're supposed to use PG_SCHEMA=samantha

develocoOP•2w ago

when?

rubenf•2w ago

on every servers that's the real fix

develocoOP•2w ago

sorry, but where should I set that env var? in my ECS task definitions (essentially in the docker container running my server)?

rubenf•2w ago

yes

develocoOP•2w ago

ah ok I will do that did I miss it in the documentation? should this be set on workers as well?

rubenf•2w ago

no need for the worekrs

develocoOP•2w ago

amazing, thanks unfortunately, I'm still having PG permission issues even though I have set PG_SCHEMA=samantha in my server docker container... do I need to nuke the DB to force migrations to run again in this new setup?

rubenf•2w ago

not too sure unfortunately but if you can use the default schema, I would start with that

develocoOP•2w ago

I can't :/ it is a shared DB each product has it's own schema and a user

rubenf•2w ago

Then PG_SCHEMA=samantha and nuke the db might work

develocoOP•2w ago

I will try that thank you Ruben!

develocoOP•2w ago

migrations did run (screeshot), and then I got the setup screen and right after saving my instance settings (so, before asking me to change the default superuser) I got errors on this /route apps/get/g/all/setup_app?nomenubar=true&workspace=admins (screenshot attached) SqlErr: error returned from database: relation "app" does not exist @apps.rs:494:17 SqlErr: error returned from database: relation "script" does not exist @scripts.rs:323:16 SqlErr: error returned from database: relation "websocket_trigger" does not exist @workspaces.rs:1321:26

develocoOP•2w ago

I can confirm that workers are pinging the new DB (I see the logs with samantha ROLE)

rubenf•2w ago

can you set role as windmill_user and see if you see those tables in the samantha schema also you're sure you passed PG_SCHEMA ?

develocoOP•2w ago

absolutely sure, checked it in the container interactive session :/ at the end I solved it just by manually granting the permissions to both _user and _admin

Workers show up in service logs, but not in the Workers UI

Did you find this page helpful?