I recently switched to CachyOS and I like it. Hopping from Distro to Distro ever couple of years, I am finally back to “Arch” (in a greater sense) - I think last time was around 2010 or so.
So here is the thing: I am using ansible (more precise ansible-playbook) to manage some systems (a webserver plus some local VMs). So far, this never had been a problem, the playbooks run smooth. But with CachyOS, this is not the case - they are slow as hell. I tried really a lot of things:
running them from a VM - no problem. Even an Archlinux VM had no problem.
using a container on CachyOS with podman
switching Schedulers and even trying other kernels (including core/linux from the Arch repo)
All of those are running fine, except the kernel/scheduler switch on CachyOS (no matter which constellation, same “bad” performance). My main playbook is for my webserver and I run it quite regulary. It is executing in ~1min (sure there are some variations), in CachyOS it is more like 3m30s+ (up to 12 min!). And I also tested it in a “blank”, almost unconfigured CachyOS VM installation - same issue.
So I am a little bit confused. I have no idea where to look to further. Its not really a showstopper for me as it works very slowly, but also there are other options that I can use (podman+container or a small VM I startup and shutdown afterwards). I would like to understand the underlying problem - where is the difference between CachyOS and any other Distro here? As ansible is actually quite simple (copying some python scripts to a remote host via scp and executing them), this maybe also impact other programs.
Hm, I fear it maybe some strange ssh configuration/network latency issue. I haven’t figured it out so far, but running it “locally” (a test VM webserver clone I setup to test updates and such), it is doing fine. I am really out of ideas, but want to understand what makes CachyOS in this specific case so terrible slow. Not running a VPN or something like this (well, then the docker container/ansible VM would have the same issue as the host).
So ansible locally to a test VM (nothing todo on VM side):
~20s from a VM and also from CachyOS
Running it towards an external Webserver (located in the same country):
~1min from a VM or Docker container (this should use the same kernel and scheduler)
~3min+ (up to 10!) from CachyOS (host or from a test VM) - each task/step is significant slower!
I checked quiet a lot of settings, today I compared the network settings in sysctl with an Archlinux default without much chance. A tcpdump shows a “high” number of “spurious retransmissions” when I execute ansible from the command-line. From Docker/Podman those aren’t there. I may try to do it via network cable too - maybe its some strange handling of the WLAN adapter, also this should usually also affect a container/VM. I even downgraded ssh to be on the Archlinux version.
Further ideas would be great I would like to solve this issue.
Ok, talking a little bit to myself here, but I found the solution. Setting net.ipv4.tcp_ecn = 2 - it is set to 1 (by /usr/lib/sysctl.d/99-cachyos-settings.conf) in the default configuration. Since I set it to 2, there are no issues anymore. Looks like containers and VMs can influence the TCP stack for their own needs.