Intermittent 401 errors on private NPM packages
We're seeing some 401 errors in our Bun scripts during bun install of the form:
These errors only happen occasionally, and I cannot seem to understand on what basis. My assumption would be that specific workers are having trouble installing it, but I am unsure how I would debug this at all.
We have Bunfig install scopes configured at the Instance level in Windmill, so my assumption would be that all workers should be able to install private packages. I've also tried clearing the cache multiple times, which I thought used to fix the issue, but I cannot confirm that, as the errors seem to just come back eventually.
I don't know how the bunfig install scopes are given to the workers, but it may be worth mentioning that we use autoscaling, so in my head it does make some sense that it could be some race condition in the worker initialization itself that is causing this. Again, I am not sure how to confirm or deny this, so I would appreciate any pointers on how to collect more information regarding this issue.
21 Replies
Oh, I almost forgot, running Windmill EE v1.468.0, hosted on fly.io, autoscaling it set up with a custom script that creates and destroys machines with flyctl accordingly
We will investigate
@invakid404 the logic we have doesn't seem to show weakness for a race condition
you should see a:
at start with
bunfig_install_scopes
as setting name of that worker
and a bunfig.toml is generated from it for every job prior to install
you can check that by toggling in core settings to keep job dir, ssh into the job dir after the error and checking that the bunfig is there
also if you never have those issues before it might be due to a bun bug? We recently updated to 1.2.3I'll check out the logs and I'll let you know
I cannot tell you with confidence whether it happened after the bun upgrade unfortunately
I suspected it may be a nicher bug as I didn't found any other reports of this
what's for sure is that we haven't changed that logic in a long long time
@rubenf this is interesting, I can see the "Loaded setting" log at some point before the job, I can see the bunfig.toml in the job directory, yet the job still failed
here are the only relevant jobs related to the job that failed
keep in mind it took me like 50 attempts to get it to fail
so it happens very rarely, but it definitely happens
i am wondering whether it is possible this is a GHCR issue and not a windmill issue now
but I find it unlikely for it to cut me off for a single run
then succeed on the next
if it is indeed GHCR cutting me off
it is the same token and the same package each time
It's either a file system sync issue or a bun issue
Or a ghcr issue
as I said, I highly doubt it is a GHCR issue, only singular attempts to install it fail
then it succeeds right after
I don't have any fancy mounts for /tmp/windmill specifically
and I am not sure how I would check whether it is a bun issue
it is also worth mentioning that we've been using private GHCR packages for quite some time
I am currently looking for the first instance of this kind of error that I can find
if I am not mistaken, the first time we've seen such an error according to our slack webhook channel is Feb 19, and it appears we were running Windmill EE 1.463.3 at the time
if that helps narrow it down at all
There is not much we can do on our side
We write the bunfig file
And then bun install
There is nothing fancy that we do that would cause the issue to be windmill specific
well this is causing me production issues and i need a solution
so if you have any suggestions at all
i'd take them
i don't see why windmill couldn't retry the bun install if it crashes
considering it does appear to solve the problem
We would need to detect that is an issue that may be temporary
You mentioned auto scaling, does it happen when it's the first job of that kind on a worker?
Also if you use our S3 instance storage the bundle are saved there
Saving the need for bun install
Or at least doing it all the time
autoscaling was a shot in the dark, it does not appear to have anything to do with it
it happened again as we speak, same exact error
i sshed into the worker, ran bun install myself
and it succeeded
so idk
and we cannot use the S3 instance storage unfortunately as we are on the Windmill Pro plan
Right
in theory i could hunt down every single node in every single flow that uses any private package and slap a retry on top
but i am bound to miss at least one and find out too late about it
so i am not quite sure what I could do on my end
I will see what we can do
to me it sounds like if a script succeeded to deploy and it has a generated lockfile and such and such
it is somewhat reasonable to assume that bun install should succeed
yet sometimes it doesn't on our end
so i don't know
Yes I agree with your assumptions
It doesn't mean that it's windmill failing per se
What it does is very straightforward
For now I would mount a volume
That will be the most straightforward
mounting volume is an absolute pain on Fly as their VMs lack the necessary modules so you have to resort to userspace solutions
for the time being, I replaced the bun executable in the docker image with a script that tries a few times if the command is bun install
and I'll set up an alert to fire if that ever happens so I can keep a note of whether the issue is gone or not
@invakid404 you have simpler, you can do that in the init script
the install dir for bun would be
/tmp/windmill/cache_nomount/bun/
we already have a custom dockerfile so it's fine
mostly because we run tailscale on all workers for an unrelated reason
and because we need a userspace solution for a logs shared volume