develoco
develoco•2w ago

Workers show up in service logs, but not in the Workers UI

I have connected one native worker (ECS service) and one GPU worker (on-prem server, connected to VPN/VPC). I have added routing to my VPN config and can connect to the PG DB using pgAdmin and psql when ssh-ed into both of these machines. Docker logs for the worker containers show normal connection to PG instance and, most importantly, workers show in the service logs (attached screenshot). The problem is the Workers screen shows 0 worker for all three worker groups (screenshot attached). This is also preventing me to create a root admin user, since the Deno job which is (I assume) supposed to execute to do this doesn't find an available worker. I can see that my job queue has some jobs, probably reflecting the fact I tried to create the root user 3 times without success. I am curious are there any additional ports I need to open except 5432? If not, how can I further debug the issue? Thanks!
No description
No description
60 Replies
develoco
develocoOP•2w ago
Here's the proof queue in the fresh server/db has some jobs (admin account creation intents) and logs from the on-prem worker, showing the DB connection happened successfully. Interestingly, there are warnings about DB being undersized. Could this be relevant?
No description
No description
No description
develoco
develocoOP•2w ago
No description
develoco
develocoOP•2w ago
Not sure if this console output is relevant. The first one is EE feature, but the one below it seems suspicious... My network tab only shows a single EE endpoint failing (/list) otherwise all normal.
No description
rubenf
rubenf•2w ago
what is that bad request response exactly?
develoco
develocoOP•2w ago
Not sure, it seems like there is only a single bad request in the whole network tab, and that's the EE /list endpoint (expected), so it seems like the second console log comes from that request? :/
No description
rubenf
rubenf•2w ago
actually that's from the queue drawer, it's unrelated
develoco
develocoOP•2w ago
I'd understand if the worker can't talk to the server/db at all, but the fact I see both native and gpu worker in the service logs tells me at least DB connection is working properly... Is there some kind of port I need to open in my SGs or it is only one way communication?
rubenf
rubenf•2w ago
no it should work
develoco
develocoOP•2w ago
I was hoping you will not say that 😄
rubenf
rubenf•2w ago
I will try to reproduce quickly but we would probably know if it was a common issue I can't reproduce the relevant api call is: /api/workers/list?per_page=1000&ping_since=300
develoco
develocoOP•2w ago
That return empty list, status 200.
rubenf
rubenf•2w ago
then I would investigate the worker_ping table in the db your logs show the ping is sent so maybe a timezone issue, unclear to me
develoco
develocoOP•2w ago
There are 2 hours of difference between the actual time (17:01 PM) and the timestamp in the DB (15:01 PM), but why would that be an issue?
No description
rubenf
rubenf•2w ago
we only display the last 300s worker pings but also it's probably just the timezone difference which mean the time is correct
develoco
develocoOP•2w ago
yup, I am in GMT+2 and the zone in that field is GMT+0, so all good there
rubenf
rubenf•2w ago
keep your instance up, and verify that ping_at is less than 300s, then look the workers page
develoco
develocoOP•2w ago
just did and they don't show up... but one question: are we sure that local time is correctly mapped to GMT before comparing the diff with ping_at? because if not this would explain me getting the empty array back... I'm getting 2hrs of diff and everything grater than 300s is discarded I'll try manually pinging the same API with very large ping_since param to see if I start getting those if yes, that will confirm the doubt
rubenf
rubenf•2w ago
yes i'm sure we handle timezone correctly
develoco
develocoOP•2w ago
I believe you, but let me quickly check just in case 😄 yeah, still empty string
develoco
develocoOP•2w ago
GitHub
windmill/backend/windmill-api/src/workers.rs at 422a02d8f78cae8e71a...
Open-source developer platform to power your entire infra and turn scripts into webhooks, workflows and UIs. Fastest workflow engine (13x vs Airflow). Open-source alternative to Retool and Temporal...
No description
rubenf
rubenf•2w ago
this is the real query:
SELECT worker, worker_instance, EXTRACT(EPOCH FROM (now() - ping_at))::integer as last_ping, started_at, ip, jobs_executed,
CASE WHEN $4 IS TRUE THEN current_job_id ELSE NULL END as last_job_id, CASE WHEN $4 IS TRUE THEN current_job_workspace_id ELSE NULL END as last_job_workspace_id,
custom_tags, worker_group, wm_version, occupancy_rate, occupancy_rate_15s, occupancy_rate_5m, occupancy_rate_30m, memory, vcpus, memory_usage, wm_memory_usage
FROM worker_ping
WHERE ($1::integer IS NULL AND ping_at > now() - interval '5 minute') OR (ping_at > now() - ($1 || ' seconds')::interval)
ORDER BY ping_at desc LIMIT $2 OFFSET $3
SELECT worker, worker_instance, EXTRACT(EPOCH FROM (now() - ping_at))::integer as last_ping, started_at, ip, jobs_executed,
CASE WHEN $4 IS TRUE THEN current_job_id ELSE NULL END as last_job_id, CASE WHEN $4 IS TRUE THEN current_job_workspace_id ELSE NULL END as last_job_workspace_id,
custom_tags, worker_group, wm_version, occupancy_rate, occupancy_rate_15s, occupancy_rate_5m, occupancy_rate_30m, memory, vcpus, memory_usage, wm_memory_usage
FROM worker_ping
WHERE ($1::integer IS NULL AND ping_at > now() - interval '5 minute') OR (ping_at > now() - ($1 || ' seconds')::interval)
ORDER BY ping_at desc LIMIT $2 OFFSET $3
query.ping_since,
per_page as i64,
offset as i64,
is_super_admin
query.ping_since,
per_page as i64,
offset as i64,
is_super_admin
develoco
develocoOP•2w ago
No description
develoco
develocoOP•2w ago
interesting, substituting the arguments (admin=TRUE, ping_since=300, offset=0, per_page=1000) returns all those native workers O.o I am assuming the fact I have only a single default admin@windmill.dev user is not relevant?
rubenf
rubenf•2w ago
what about the gpu worker group? are you an admin on that workspace?
develoco
develocoOP•2w ago
Yes, this is the default superadmin account admin@windmill.dev I am not sure if I had that worker on when I was performing the tests Just checked and I didn't have it running, so the output was expected.
develoco
develocoOP•2w ago
and starting it makes it show in the query result immediately, so all good there
No description
rubenf
rubenf•2w ago
Do you see it in the workers page?
develoco
develocoOP•2w ago
no I take exactly the same worker config, just point it to my locally hosted instance (the same local network as the worker) and it works nicely, but pointing it to the RDS database doesn't work
rubenf
rubenf•2w ago
what's the result of:
/api/w/demo/users/whoami
/api/w/demo/users/whoami
replacing demo with your current workspace and can you list your users in your RDS including their attributes, the equivalent of \du+; in psql
develoco
develocoOP•2w ago
{"workspace_id":"admin","email":"admin@windmill.dev","username":"admin@windmill.dev","is_admin":true,"is_super_admin":true,"created_at":"2025-05-20T07:50:09.178777677Z","groups":[],"operator":false,"disabled":false,"role":"superadmin","folders_read":[],"folders":[],"folders_owners":[],"name":null} I didn't get the chance to create any user other than the default admin one created by migrations
rubenf
rubenf•2w ago
can you still list them with their attributes please I want to check for Bypass RLS also can you run the same query above but as windmill_admin instead of your db user
develoco
develocoOP•2w ago
seems like I have windmill_user and windmill_admin groups, not users... and when I try to connect to my DB using the windmill_admin/changeme it says the credentials are wrong
No description
develoco
develocoOP•2w ago
windmill_admin has the Bypass_RLS set, _user doesn't
No description
develoco
develocoOP•2w ago
and then the user I'm connecting to the db with is called "samantha", and that user is a member of the windmill_admin group (and windmill_user) ...so this is not possible since I don't have the windmill_admin user.
rubenf
rubenf•2w ago
you need to set role once connected they have no login it's normal
develoco
develocoOP•2w ago
got it, and I think we might be onto something 😄 when setting the role to windmill_admin I get no results back from the query! O.o
rubenf
rubenf•2w ago
it being bypass rls it doesn't make much sense if you select * from worker_ping you also get no results?
develoco
develocoOP•2w ago
this works SET ROLE samantha; select * from worker_ping; this doesn't SET ROLE windmill_admin; select * from worker_ping; this doesn't work either SET ROLE windmill_user; select * from worker_ping;
rubenf
rubenf•2w ago
can you show all the policies and grant for windmill_admin on worker_ping
develoco
develocoOP•2w ago
you hit the nail on the head! there were none 🙂 After granting them it all started working! Thanks for the support Ruben, this is amazing! I'll keep recommending Windmill to my friends and colleagues, these kind of things make so much difference!
rubenf
rubenf•2w ago
just to be sure, what did you to make it work? you did a grant of worker_ping to windmill_admin ? What's odd if those were missing you should have had an error
develoco
develocoOP•2w ago
yes, but I did have an error on runs table as well, so I had to grant all tables to windmill_admin and then everything started working, my user creation job executed and the whole instance became ready for use most probably those were not granted since I created a scheme manually in a shared RDS DB and it wasn't called "public" but "samantha", could it be something with that?
rubenf
rubenf•2w ago
did you create the schema after the initial migrations?
develoco
develocoOP•2w ago
no, before
rubenf
rubenf•2w ago
you're supposed to use PG_SCHEMA=samantha
develoco
develocoOP•2w ago
when?
rubenf
rubenf•2w ago
on every servers that's the real fix
develoco
develocoOP•2w ago
sorry, but where should I set that env var? in my ECS task definitions (essentially in the docker container running my server)?
rubenf
rubenf•2w ago
yes
develoco
develocoOP•2w ago
ah ok I will do that did I miss it in the documentation? should this be set on workers as well?
rubenf
rubenf•2w ago
no need for the worekrs
develoco
develocoOP•2w ago
amazing, thanks unfortunately, I'm still having PG permission issues even though I have set PG_SCHEMA=samantha in my server docker container... do I need to nuke the DB to force migrations to run again in this new setup?
rubenf
rubenf•2w ago
not too sure unfortunately but if you can use the default schema, I would start with that
develoco
develocoOP•2w ago
I can't :/ it is a shared DB each product has it's own schema and a user
rubenf
rubenf•2w ago
Then PG_SCHEMA=samantha and nuke the db might work
develoco
develocoOP•2w ago
I will try that thank you Ruben!
develoco
develocoOP•2w ago
migrations did run (screeshot), and then I got the setup screen and right after saving my instance settings (so, before asking me to change the default superuser) I got errors on this /route apps/get/g/all/setup_app?nomenubar=true&workspace=admins (screenshot attached) SqlErr: error returned from database: relation "app" does not exist @apps.rs:494:17 SqlErr: error returned from database: relation "script" does not exist @scripts.rs:323:16 SqlErr: error returned from database: relation "websocket_trigger" does not exist @workspaces.rs:1321:26
No description
No description
No description
develoco
develocoOP•2w ago
I can confirm that workers are pinging the new DB (I see the logs with samantha ROLE)
rubenf
rubenf•2w ago
can you set role as windmill_user and see if you see those tables in the samantha schema also you're sure you passed PG_SCHEMA ?
develoco
develocoOP•2w ago
absolutely sure, checked it in the container interactive session :/ at the end I solved it just by manually granting the permissions to both _user and _admin

Did you find this page helpful?