invakid404
invakid4042d ago

Intermittent 401 errors on private NPM packages

We're seeing some 401 errors in our Bun scripts during bun install of the form:
bun install v1.2.3 (8c4d3ff8)
error: GET https://npm.pkg.github.com/download/x/x0.0.1/xxx - 401
bun install v1.2.3 (8c4d3ff8)
error: GET https://npm.pkg.github.com/download/x/x0.0.1/xxx - 401
These errors only happen occasionally, and I cannot seem to understand on what basis. My assumption would be that specific workers are having trouble installing it, but I am unsure how I would debug this at all. We have Bunfig install scopes configured at the Instance level in Windmill, so my assumption would be that all workers should be able to install private packages. I've also tried clearing the cache multiple times, which I thought used to fix the issue, but I cannot confirm that, as the errors seem to just come back eventually. I don't know how the bunfig install scopes are given to the workers, but it may be worth mentioning that we use autoscaling, so in my head it does make some sense that it could be some race condition in the worker initialization itself that is causing this. Again, I am not sure how to confirm or deny this, so I would appreciate any pointers on how to collect more information regarding this issue.
21 Replies
invakid404
invakid404OP2d ago
Oh, I almost forgot, running Windmill EE v1.468.0, hosted on fly.io, autoscaling it set up with a custom script that creates and destroys machines with flyctl accordingly
rubenf
rubenf2d ago
We will investigate @invakid404 the logic we have doesn't seem to show weakness for a race condition you should see a:
tracing::info!("Loaded setting {setting_name} from db config: {:#?}", &q);
tracing::info!("Loaded setting {setting_name} from db config: {:#?}", &q);
at start with bunfig_install_scopes as setting name of that worker and a bunfig.toml is generated from it for every job prior to install you can check that by toggling in core settings to keep job dir, ssh into the job dir after the error and checking that the bunfig is there also if you never have those issues before it might be due to a bun bug? We recently updated to 1.2.3
invakid404
invakid404OP2d ago
I'll check out the logs and I'll let you know I cannot tell you with confidence whether it happened after the bun upgrade unfortunately I suspected it may be a nicher bug as I didn't found any other reports of this
rubenf
rubenf2d ago
what's for sure is that we haven't changed that logic in a long long time
invakid404
invakid404OP2d ago
@rubenf this is interesting, I can see the "Loaded setting" log at some point before the job, I can see the bunfig.toml in the job directory, yet the job still failed
invakid404
invakid404OP2d ago
here are the only relevant jobs related to the job that failed
invakid404
invakid404OP2d ago
keep in mind it took me like 50 attempts to get it to fail so it happens very rarely, but it definitely happens i am wondering whether it is possible this is a GHCR issue and not a windmill issue now but I find it unlikely for it to cut me off for a single run then succeed on the next if it is indeed GHCR cutting me off it is the same token and the same package each time
rubenf
rubenf2d ago
It's either a file system sync issue or a bun issue Or a ghcr issue
invakid404
invakid404OP2d ago
as I said, I highly doubt it is a GHCR issue, only singular attempts to install it fail then it succeeds right after I don't have any fancy mounts for /tmp/windmill specifically and I am not sure how I would check whether it is a bun issue it is also worth mentioning that we've been using private GHCR packages for quite some time I am currently looking for the first instance of this kind of error that I can find if I am not mistaken, the first time we've seen such an error according to our slack webhook channel is Feb 19, and it appears we were running Windmill EE 1.463.3 at the time if that helps narrow it down at all
rubenf
rubenf2d ago
There is not much we can do on our side We write the bunfig file And then bun install There is nothing fancy that we do that would cause the issue to be windmill specific
invakid404
invakid404OP2d ago
well this is causing me production issues and i need a solution so if you have any suggestions at all i'd take them i don't see why windmill couldn't retry the bun install if it crashes considering it does appear to solve the problem
rubenf
rubenf2d ago
We would need to detect that is an issue that may be temporary You mentioned auto scaling, does it happen when it's the first job of that kind on a worker? Also if you use our S3 instance storage the bundle are saved there Saving the need for bun install Or at least doing it all the time
invakid404
invakid404OP2d ago
autoscaling was a shot in the dark, it does not appear to have anything to do with it it happened again as we speak, same exact error i sshed into the worker, ran bun install myself and it succeeded so idk and we cannot use the S3 instance storage unfortunately as we are on the Windmill Pro plan
rubenf
rubenf2d ago
Right
invakid404
invakid404OP2d ago
in theory i could hunt down every single node in every single flow that uses any private package and slap a retry on top but i am bound to miss at least one and find out too late about it so i am not quite sure what I could do on my end
rubenf
rubenf2d ago
I will see what we can do
invakid404
invakid404OP2d ago
to me it sounds like if a script succeeded to deploy and it has a generated lockfile and such and such it is somewhat reasonable to assume that bun install should succeed yet sometimes it doesn't on our end so i don't know
rubenf
rubenf2d ago
Yes I agree with your assumptions It doesn't mean that it's windmill failing per se What it does is very straightforward For now I would mount a volume That will be the most straightforward
invakid404
invakid404OP2d ago
mounting volume is an absolute pain on Fly as their VMs lack the necessary modules so you have to resort to userspace solutions for the time being, I replaced the bun executable in the docker image with a script that tries a few times if the command is bun install and I'll set up an alert to fire if that ever happens so I can keep a note of whether the issue is gone or not
rubenf
rubenf2d ago
@invakid404 you have simpler, you can do that in the init script the install dir for bun would be /tmp/windmill/cache_nomount/bun/
invakid404
invakid404OP2d ago
we already have a custom dockerfile so it's fine mostly because we run tailscale on all workers for an unrelated reason and because we need a userspace solution for a logs shared volume

Did you find this page helpful?