Ross Creighton
Ross Creighton15h ago

Error: Connecting to database: pool timed out while waiting for an open connection

Windmill server and worker containers deployed to ECS exit with the above error. The RDS postgres logs show could not receive data from client: Connection reset by peer, suggesting a connection is being made but the client (server/worker containers) are killing the connection. RDS instance is db.t4g.large. Confirmed the max connections setting is still the default (which allows ~900 connections for this instance size). ECS tasks are deployed on Fargate with 2 vCPU and 4GB Memory. I don't see any evidence of memory or CPU constraints in monitoring. I can successfully connect to the database via psql from an EC2 bastion using the same secrets and same security group configuration as the ECS Fargate tasks, but connections from the Fargate tasks are getting killed. Tasks are using ghcr.io/windmill-labs/windmill image. I've tried redeploying the services with :main, :latest, and :1.547.0 tags. Also tried rebooting the RDS database and deleting everything from the jobs tables, although this is a new windmill instance so there isn't much in the database. I'm at a bit of a loss at the moment.
3 Replies
rubenf
rubenf15h ago
Maybe try running windmill on same db outside of ecs to try to isolate the issue
Ross Creighton
Ross CreightonOP13h ago
Figured it out. RDS instance was set to publicly accessible = true, so the db endpoint had public dns. My ECS tasks are in private subnet with NAT gateway, so db connections from the ECS tasks were getting routed out through the NAT gateway to the public internet and then back in to the RDS public endpoint. The NAT was timing out the connections. https://aws.amazon.com/blogs/networking-and-content-delivery/implementing-long-running-tcp-connections-within-vpc-networking/
Amazon Web Services
Implementing long-running TCP Connections within VPC networking | A...
Many network appliances define idle connection timeout to terminate connections after an inactivity period. For example, appliances like NAT Gateway, Amazon Virtual Private Cloud (Amazon VPC) Endpoints, and Network Load Balancer (NLB) currently have a fixed idle timeout of 350 seconds. Packets sent after the idle timeout expired aren’t delive...
Ross Creighton
Ross CreightonOP13h ago
Setting publicly accessible = false allowed the connections to route privately within the vpc and fixed the timeout issue.

Did you find this page helpful?