Intermittent 401 errors on private NPM packages

We're seeing some 401 errors in our Bun scripts during bun install of the form:

bun install v1.2.3 (8c4d3ff8)
error: GET https://npm.pkg.github.com/download/x/x0.0.1/xxx - 401

bun install v1.2.3 (8c4d3ff8)
error: GET https://npm.pkg.github.com/download/x/x0.0.1/xxx - 401

These errors only happen occasionally, and I cannot seem to understand on what basis. My assumption would be that specific workers are having trouble installing it, but I am unsure how I would debug this at all. We have Bunfig install scopes configured at the Instance level in Windmill, so my assumption would be that all workers should be able to install private packages. I've also tried clearing the cache multiple times, which I thought used to fix the issue, but I cannot confirm that, as the errors seem to just come back eventually. I don't know how the bunfig install scopes are given to the workers, but it may be worth mentioning that we use autoscaling, so in my head it does make some sense that it could be some race condition in the worker initialization itself that is causing this. Again, I am not sure how to confirm or deny this, so I would appreciate any pointers on how to collect more information regarding this issue.

21 Replies

invakid404OP•4mo ago

Oh, I almost forgot, running Windmill EE v1.468.0, hosted on fly.io, autoscaling it set up with a custom script that creates and destroys machines with flyctl accordingly

rubenf•4mo ago

We will investigate @invakid404 the logic we have doesn't seem to show weakness for a race condition you should see a:

            tracing::info!("Loaded setting {setting_name} from db config: {:#?}", &q);

            tracing::info!("Loaded setting {setting_name} from db config: {:#?}", &q);

at start with bunfig_install_scopes as setting name of that worker and a bunfig.toml is generated from it for every job prior to install you can check that by toggling in core settings to keep job dir, ssh into the job dir after the error and checking that the bunfig is there also if you never have those issues before it might be due to a bun bug? We recently updated to 1.2.3

invakid404OP•4mo ago

I'll check out the logs and I'll let you know I cannot tell you with confidence whether it happened after the bun upgrade unfortunately I suspected it may be a nicher bug as I didn't found any other reports of this

rubenf•4mo ago

what's for sure is that we haven't changed that logic in a long long time

invakid404OP•4mo ago

@rubenf this is interesting, I can see the "Loaded setting" log at some point before the job, I can see the bunfig.toml in the job directory, yet the job still failed

invakid404OP•4mo ago

here are the only relevant jobs related to the job that failed

logs.txt

invakid404OP•4mo ago

keep in mind it took me like 50 attempts to get it to fail so it happens very rarely, but it definitely happens i am wondering whether it is possible this is a GHCR issue and not a windmill issue now but I find it unlikely for it to cut me off for a single run then succeed on the next if it is indeed GHCR cutting me off it is the same token and the same package each time

rubenf•4mo ago

It's either a file system sync issue or a bun issue Or a ghcr issue

invakid404OP•4mo ago

as I said, I highly doubt it is a GHCR issue, only singular attempts to install it fail then it succeeds right after I don't have any fancy mounts for /tmp/windmill specifically and I am not sure how I would check whether it is a bun issue it is also worth mentioning that we've been using private GHCR packages for quite some time I am currently looking for the first instance of this kind of error that I can find if I am not mistaken, the first time we've seen such an error according to our slack webhook channel is Feb 19, and it appears we were running Windmill EE 1.463.3 at the time if that helps narrow it down at all

rubenf•4mo ago

There is not much we can do on our side We write the bunfig file And then bun install There is nothing fancy that we do that would cause the issue to be windmill specific

invakid404OP•4mo ago

well this is causing me production issues and i need a solution so if you have any suggestions at all i'd take them i don't see why windmill couldn't retry the bun install if it crashes considering it does appear to solve the problem

rubenf•4mo ago

We would need to detect that is an issue that may be temporary You mentioned auto scaling, does it happen when it's the first job of that kind on a worker? Also if you use our S3 instance storage the bundle are saved there Saving the need for bun install Or at least doing it all the time

invakid404OP•4mo ago

autoscaling was a shot in the dark, it does not appear to have anything to do with it it happened again as we speak, same exact error i sshed into the worker, ran bun install myself and it succeeded so idk and we cannot use the S3 instance storage unfortunately as we are on the Windmill Pro plan

rubenf•4mo ago

Right

invakid404OP•4mo ago

in theory i could hunt down every single node in every single flow that uses any private package and slap a retry on top but i am bound to miss at least one and find out too late about it so i am not quite sure what I could do on my end

rubenf•4mo ago

I will see what we can do

invakid404OP•4mo ago

to me it sounds like if a script succeeded to deploy and it has a generated lockfile and such and such it is somewhat reasonable to assume that bun install should succeed yet sometimes it doesn't on our end so i don't know

rubenf•4mo ago

Yes I agree with your assumptions It doesn't mean that it's windmill failing per se What it does is very straightforward For now I would mount a volume That will be the most straightforward

invakid404OP•4mo ago

mounting volume is an absolute pain on Fly as their VMs lack the necessary modules so you have to resort to userspace solutions for the time being, I replaced the bun executable in the docker image with a script that tries a few times if the command is bun install and I'll set up an alert to fire if that ever happens so I can keep a note of whether the issue is gone or not

rubenf•4mo ago

@invakid404 you have simpler, you can do that in the init script the install dir for bun would be /tmp/windmill/cache_nomount/bun/

invakid404OP•4mo ago

we already have a custom dockerfile so it's fine mostly because we run tailscale on all workers for an unrelated reason and because we need a userspace solution for a logs shared volume

Intermittent 401 errors on private NPM packages

Did you find this page helpful?