Issue with apparent random system freeze

Hi there
Issue:
I am having apparent random issues with system freezes.
I play modded minecraft and the system will suddenly just freeze. The system will no longer respond to any input and the network controller will no longer respond to ping.
I cannot seem to find anything in the logs as it appears that whatever is happening its stopping before it gets a chance to write to file?

Analysis:
Journal Log:
journalctl -b -1 | paste-cachyos

Version:

uname -r
6.9.8-2-cachyos

GPU:

โฏ sudo inxi -G && chwd --list
[sudo] password for steve: 
Graphics:
  Device-1: AMD Navi 23 [Radeon RX 6600/6600 XT/6600M] driver: amdgpu
    v: kernel
  Display: unspecified server: X.Org v: 24.1 with: Xwayland v: 24.1.0
    driver: X: loaded: modesetting dri: radeonsi gpu: amdgpu resolution:
    1: 2560x1440~165Hz 2: 1440x2560~165Hz
  API: EGL v: 1.5 drivers: radeonsi,swrast
    platforms: gbm,x11,surfaceless,device
  API: OpenGL v: 4.6 compat-v: 4.5 vendor: amd mesa v: 24.1.3-arch1.2
    renderer: AMD Radeon RX 6600 XT (radeonsi navi23 LLVM 18.1.8 DRM 3.57
    6.9.8-2-cachyos)
  API: Vulkan v: 1.3.285 drivers: radv surfaces: xcb,xlib
> 0000:07:00.0 (0280:14e4:43a0) Network controller Broadcom Inc. and subsidiaries:

โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ Name        โ”† NonFree โ”‚
โ•žโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ก
โ”‚ broadcom-wl โ”† false   โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

> 0000:03:00.0 (0300:1002:73ff) VGA compatible controller Advanced Micro Devices, Inc. [AMD/ATI]:

โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ Name     โ”† NonFree โ”‚
โ•žโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ก
โ”‚ amd      โ”† false   โ”‚
โ”œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ค
โ”‚ fallback โ”† false   โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

Query:

I am wondering if its a memory leak issue?
Or maybe the AMD driver?
I am unsure how to diagnose properly without the logs being dumped. Any help on tracking down greatly appreciated!

Cheers!

1 Like

I highly doubt it is a memory leak. That kind of thing is a very rare occurrence in this day and age. I see that you are using x11. Seems like a lot of problems with it lately. Kde plasma seems to be having input issues as it just starts thrashing if you throw too much at it in a short period of time.

There is nothing really visible in the logs.

How you do reproduce this issue? Was your RAM maybe got too full?

Welcome Steve,

I have heard some reports about this as well.

@steve-rackham have you tried logging in with wayland to see if the problem persists?

(I have 0 problems on wayland with my amd 6000 series).

Hi,

Could you please check, if using a different kernel does remove the freezing?
Either linux-cachyos-lts or linux-cachyos-rc.

Hmm. I thought I was. The login screen shows Wayland, but I believe I have mistakenly thought this meant Wayland selected, rather than an option.
I have logged in and selected Wayland.

Confirmed using Wayland now:

โฏ echo $XDG_SESSION_TYPE

wayland

I assumed memory leak as the behaviour has vastly improved after running some performance and memory leak mods on Minecraft. I was surprised when it happened again after a few days of stability.

Iโ€™ll run this for a bit and report back. I understand that its potentially the kernel and note that that is also suggested. Iโ€™ll eliminate one thing at a time (and would prefer Wayland regardless). If behaviour continues Iโ€™ll try rollling back to LTS kernel and go from there.

Thanks for your help, team!

Cheers.

Thanks for the suggestion! Another member suggested using Wayland. I will try that first to eliminate as its easy to do and one thing at a time. If the behaviour persists that I will use this as a next step.

Thanks for your help!

do you have swap?

Yes.

โฏ cat /proc/swaps 
Filename                                Type            Size            Used            Priority
/dev/zram0                              partition       16300028        485888          100

It seems to be more stable on Wayland.
However, I am still noting memory dropping over time. I am wondering if its just a minecraft thing - thats why thought memory leak and applied the appropriate performance mods.
I do have another installation of Manjaro and experienced the same thing though when watching Youtube and its been rock solid since changing to Wayland.
At present, I am just monitoring the RAM and when dropping to far I save the game, quit, then go back in.
The thing that lead me to write this post though is that this is relatively new behaviour. I have been putting up with it for a few weeks, maybe a bit more?

Would the next step be to downgrade the kernel? I dont want to waste anyones time if the consensus is that this is not related to the OS itself.

Cheers!

No I donโ€™t think you have, you have only zram which is not the same as a swap on disk. I would consider adding a swapfile it only takes a few minutes. Swapping to disk is important, otherwise you can crash/freeze your system. Swap - ArchWiki

But with a swap you will still get slow downs, the only thing you can do is somehow lower the quality of the game or get more ram.

OH! My apologies. I should have noted the size as well!

Understood. That is precisely what is happening.
Ok, Iโ€™ll go through the article and then test that out. Thank you!

Be careful, if you have btrfs, itโ€™s a different procedure.

Good call out. Yes, using btrfs.
I am not adverse to a reinstall if there is advice to suggest another file system? I have backups so all good.

Well Iโ€™m new to btrfs and I didnโ€™t have the patience to figure it out, so I just made a separate regular swap partition during install. You should stay on btrfs, auto snapping is great. You could also reformat another partition.

Or you could try, what the btrfs wiki saysโ€ฆ I just went with what I know already, because that article is missing a few commands I think.

Filename				Type		Size		Used	Priority
/dev/zram0                              partition	14143484	512	100
/dev/nvme0n1p2                          partition	14142784	0	-2

Yeah the article does appear to be missing something as commands dont complete. And I dont want to blindly follow. Hmm.

If I use free -h I can see swap?

โฏ free -h
               total        used        free      shared  buff/cache   available
Mem:            15Gi       4.1Gi       6.2Gi       192Mi       5.8Gi        11Gi
Swap:           15Gi       473Mi        15Gi

But that does not look right?

Thatโ€™s just your zram.

run this lsblk -o NAME,FSTYPE,SIZE,MOUNTPOINT

Ah! Yes, you are correct. Ok, we have a path forward.

โฏ lsblk -o NAME,FSTYPE,SIZE,MOUNTPOINT
NAME        FSTYPE   SIZE MOUNTPOINT
sda                232.9G 
โ””โ”€sda1      ext4   232.9G 
sdb                111.8G 
โ”œโ”€sdb1                 1M 
โ”œโ”€sdb2      ext4       1G 
โ””โ”€sdb3      btrfs  110.8G 
sdc                223.6G 
โ””โ”€sdc1      btrfs  223.6G /var/tmp
zram0               15.5G [SWAP]
nvme0n1            931.5G 
โ””โ”€nvme0n1p1 ext4   931.5G 

OK. So a swap file helped with RAM starvation.
However, the issue just reoccurred. What is different this time is that I can ping the machine and even SSH. This is new.
The question is, what to look for?

Checking the journal I cannot see anything obvious. Any clues?

Jul 14 15:33:39 steve-cachyos steam-native[562660]: ERROR: ld.so: object '/usr/lib32/libextest.so' from LD_PRELOAD cannot be preloaded (wrong ELF class: ELFCLASS32): ignored.
Jul 14 15:33:39 steve-cachyos steam-native[562660]: ERROR: ld.so: object '/usr/lib32/libextest.so' from LD_PRELOAD cannot be preloaded (wrong ELF class: ELFCLASS32): ignored.
Jul 14 15:49:12 steve-cachyos steam-native[566931]: ERROR: ld.so: object '/usr/lib32/libextest.so' from LD_PRELOAD cannot be preloaded (wrong ELF class: ELFCLASS32): ignored.
Jul 14 15:49:12 steve-cachyos steam-native[566931]: ERROR: ld.so: object '/usr/lib32/libextest.so' from LD_PRELOAD cannot be preloaded (wrong ELF class: ELFCLASS32): ignored.
Jul 14 16:04:13 steve-cachyos steam-native[571083]: ERROR: ld.so: object '/usr/lib32/libextest.so' from LD_PRELOAD cannot be preloaded (wrong ELF class: ELFCLASS32): ignored.
Jul 14 16:04:13 steve-cachyos steam-native[571083]: ERROR: ld.so: object '/usr/lib32/libextest.so' from LD_PRELOAD cannot be preloaded (wrong ELF class: ELFCLASS32): ignored.
Jul 14 16:18:57 steve-cachyos steam-native[574882]: ERROR: ld.so: object '/usr/lib32/libextest.so' from LD_PRELOAD cannot be preloaded (wrong ELF class: ELFCLASS32): ignored.
Jul 14 16:18:57 steve-cachyos steam-native[574882]: ERROR: ld.so: object '/usr/lib32/libextest.so' from LD_PRELOAD cannot be preloaded (wrong ELF class: ELFCLASS32): ignored.
Jul 14 16:34:09 steve-cachyos steam-native[578687]: ERROR: ld.so: object '/usr/lib32/libextest.so' from LD_PRELOAD cannot be preloaded (wrong ELF class: ELFCLASS32): ignored.
Jul 14 16:34:09 steve-cachyos steam-native[578687]: ERROR: ld.so: object '/usr/lib32/libextest.so' from LD_PRELOAD cannot be preloaded (wrong ELF class: ELFCLASS32): ignored.
Jul 14 16:41:13 steve-cachyos systemd[1]: Starting Cleanup of Temporary Directories...
Jul 14 16:41:13 steve-cachyos systemd[1]: systemd-tmpfiles-clean.service: Deactivated successfully.
Jul 14 16:41:13 steve-cachyos systemd[1]: Finished Cleanup of Temporary Directories.
Jul 14 16:41:13 steve-cachyos systemd[1]: run-credentials-systemd\x2dtmpfiles\x2dclean.service.mount: Deactivated successfully.
Jul 14 16:49:37 steve-cachyos steam-native[582951]: ERROR: ld.so: object '/usr/lib32/libextest.so' from LD_PRELOAD cannot be preloaded (wrong ELF class: ELFCLASS32): ignored.
Jul 14 16:49:37 steve-cachyos steam-native[582951]: ERROR: ld.so: object '/usr/lib32/libextest.so' from LD_PRELOAD cannot be preloaded (wrong ELF class: ELFCLASS32): ignored.
Jul 14 16:50:19 steve-cachyos sshd-session[583073]: pam_systemd_home(sshd:auth): New sd-bus connection (system-bus-pam-systemd-home-583073) opened.

I can see some clean up of files right before hand and then me logging in remotely but nothing else. I have htop running and it has not stopped. But system is unresponsive to keyboard and mouse and game is hung.

Scratch that. I can get into the system locally.
Its all responsive except for Minecraft itself.

I think we can safely call this another issue.

If I check the Minecraft logs I can see errors that correlate to the crash and the logs ends with:

Process crashed with exitcode 9.

So I think its just pushing the machine to hard.
Although, it didnt solve the problem, adding swap definitely helped so thanks for that!