Frequent kernel panics caused by race condition in amdgpu

Hi!

Long story short: I encounter frequent kernel panics (like 1/hour) while playing WoW. I didn’t test other games though. I finally got Kdumps working and here is the first dmesg output:

Summary
[  584.908903] [     C10] BUG: kernel NULL pointer dereference, address: 0000000000000000
[  584.908907] [     C10] #PF: supervisor read access in kernel mode
[  584.908910] [     C10] #PF: error_code(0x0000) - not-present page
[  584.908912] [     C10] PGD 0 P4D 0 
[  584.908915] [     C10] Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
[  584.908918] [     C10] CPU: 10 PID: 4276 Comm: wineserver Kdump: loaded Tainted: G           OE      6.10.8-2-cachyos #1 56996ea3d65410c7
87cafb5fc91984f901b413d0
[  584.908922] [     C10] Hardware name: To Be Filled By O.E.M. X570 Taichi/X570 Taichi, BIOS P5.63 08/22/2024
[  584.908924] [     C10] RIP: 0010:dcn10_set_drr+0xe2/0x110 [amdgpu]
[  584.909202] [     C10] Code: 85 ff 74 44 48 8b 17 48 85 d2 74 3c 48 8b 92 28 01 00 00 48 85 d2 74 0b 48 89 ee e8 98 84 5f dc 48 8b 03 48 
8b b8 f8 00 00 00 <48> 8b 07 48 8b 80 40 01 00 00 48 85 c0 74 0f ba 02 00 00 00 be 00
[  584.909204] [     C10] RSP: 0018:ffffb40e00424db0 EFLAGS: 00010086
[  584.909207] [     C10] RAX: ffff89f2e1ac0be0 RBX: ffffb40e00424e08 RCX: 0000000000000000
[  584.909209] [     C10] RDX: 0000000080010035 RSI: 00000000000141e4 RDI: 0000000000000000
[  584.909211] [     C10] RBP: ffffb40e00424db4 R08: 0000000000000008 R09: 000000009f3f07ff
[  584.909212] [     C10] R10: ffffb40e00424da0 R11: 0000000080000000 R12: ffffb40e00424e10
[  584.909214] [     C10] R13: ffff89ef55f80178 R14: ffff89ef55f804f0 R15: ffff89ef453faf60
[  584.909216] [     C10] FS:  00007a9dd4731b40(0000) GS:ffff89f66ed00000(0000) knlGS:0000000000000000
[  584.909218] [     C10] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  584.909220] [     C10] CR2: 0000000000000000 CR3: 000000023add8000 CR4: 0000000000f50ef0
[  584.909222] [     C10] PKRU: 55555554
[  584.909224] [     C10] Call Trace:
[  584.909226] [     C10]  <IRQ>
[  584.909228] [     C10]  ? __die_body.cold+0x8/0x12
[  584.909233] [     C10]  ? page_fault_oops+0x15a/0x2e0
[  584.909237] [     C10]  ? generic_reg_set_ex+0x156/0x2d0 [amdgpu d42adf081d1bd1efedaa3586cdc4ecc046215f96]
[  584.909454] [     C10]  ? exc_page_fault+0x81/0x190
[  584.909458] [     C10]  ? asm_exc_page_fault+0x26/0x30
[  584.909464] [     C10]  ? dcn10_set_drr+0xe2/0x110 [amdgpu d42adf081d1bd1efedaa3586cdc4ecc046215f96]
[  584.909706] [     C10]  dc_stream_adjust_vmin_vmax+0x195/0x360 [amdgpu d42adf081d1bd1efedaa3586cdc4ecc046215f96]
[  584.909876] [     C10]  dm_crtc_high_irq+0x230/0x2b0 [amdgpu d42adf081d1bd1efedaa3586cdc4ecc046215f96]
[  584.910066] [     C10]  amdgpu_dm_irq_handler+0x85/0x1f0 [amdgpu d42adf081d1bd1efedaa3586cdc4ecc046215f96]
[  584.910253] [     C10]  amdgpu_irq_dispatch+0xd6/0x210 [amdgpu d42adf081d1bd1efedaa3586cdc4ecc046215f96]
[  584.910402] [     C10]  amdgpu_ih_process+0x83/0x100 [amdgpu d42adf081d1bd1efedaa3586cdc4ecc046215f96]
[  584.910544] [     C10]  amdgpu_irq_handler+0x23/0x60 [amdgpu d42adf081d1bd1efedaa3586cdc4ecc046215f96]
[  584.910685] [     C10]  __handle_irq_event_percpu+0x4d/0x1b0
[  584.910688] [     C10]  handle_irq_event+0x3b/0x90
[  584.910690] [     C10]  handle_edge_irq+0x9a/0x260
[  584.910693] [     C10]  __common_interrupt+0x41/0xa0
[  584.910696] [     C10]  common_interrupt+0x80/0xa0
[  584.910698] [     C10]  </IRQ>
[  584.910699] [     C10]  <TASK>
[  584.910701] [     C10]  asm_common_interrupt+0x26/0x40
[  584.910703] [     C10] RIP: 0010:_raw_spin_unlock_irqrestore+0x1d/0x40
[  584.910705] [     C10] Code: 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 c6 07 00 0f 1f 00 f7 c6 00 02 00 00 74 06 
fb 0f 1f 44 00 00 <65> ff 0d 74 1c b7 62 74 05 e9 f0 04 24 00 e8 10 33 f4 fe e9 e6 04
[  584.910707] [     C10] RSP: 0018:ffffb40e1cfc7bb8 EFLAGS: 00000206
[  584.910709] [     C10] RAX: ffff89f66ed25a00 RBX: ffffb40e1cfc7bc0 RCX: ffffb40e26f03d68
[  584.910710] [     C10] RDX: 0000000000000000 RSI: 0000000000000287 RDI: ffff89f66ed259c0
[  584.910711] [     C10] RBP: ffff89ef99788000 R08: ffff89f66ed25a20 R09: 0000000000000000
[  584.910713] [     C10] R10: 0000000000000000 R11: 0000000000000100 R12: ffff89f66ed25a00
[  584.910714] [     C10] R13: 0000000000000287 R14: ffff89f66ed259c0 R15: ffff89f66ed259c0
[  584.910717] [     C10]  ? srso_alias_return_thunk+0x5/0xfbef5
[  584.910720] [     C10]  schedule_hrtimeout_range_clock+0x203/0x2c0
[  584.910722] [     C10]  ? __pfx_hrtimer_wakeup+0x10/0x10
[  584.910726] [     C10]  ep_poll+0x623/0x6f0
[  584.910730] [     C10]  ? __pfx_ep_autoremove_wake_function+0x10/0x10
[  584.910733] [     C10]  __x64_sys_epoll_wait+0x19b/0x1e0
[  584.910737] [     C10]  do_syscall_64+0x82/0x190
[  584.910739] [     C10]  ? do_syscall_64+0x8e/0x190
[  584.910742] [     C10]  ? srso_alias_return_thunk+0x5/0xfbef5
[  584.910743] [     C10]  ? syscall_exit_to_user_mode_prepare+0x148/0x170
[  584.910746] [     C10]  ? srso_alias_return_thunk+0x5/0xfbef5
[  584.910748] [     C10]  ? syscall_exit_to_user_mode+0x73/0x1f0
[  584.910750] [     C10]  ? srso_alias_return_thunk+0x5/0xfbef5
[  584.910751] [     C10]  ? do_syscall_64+0x8e/0x190
[  584.910756] [     C10]  ? srso_alias_return_thunk+0x5/0xfbef5
[  584.910757] [     C10]  ? syscall_exit_to_user_mode_prepare+0x148/0x170
[  584.910760] [     C10]  ? srso_alias_return_thunk+0x5/0xfbef5
[  584.910761] [     C10]  ? syscall_exit_to_user_mode+0x73/0x1f0
[  584.910763] [     C10]  ? srso_alias_return_thunk+0x5/0xfbef5
[  584.910765] [     C10]  ? do_syscall_64+0x8e/0x190
[  584.910767] [     C10]  ? do_syscall_64+0x8e/0x190
[  584.910769] [     C10]  ? srso_alias_return_thunk+0x5/0xfbef5
[  584.910771] [     C10]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  584.910773] [     C10] RIP: 0033:0x7a9dd4bded17
[  584.910793] [     C10] Code: ff ff ff ff eb ba 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 f3 0f 1e fa 80 3d 55 c3 0d 00 00 41 89 ca 74 10 b8 
e8 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 59 c3 48 83 ec 28 89 54 24 18 48 89 74 24 10
[  584.910794] [     C10] RSP: 002b:00007fffdaf84038 EFLAGS: 00000202 ORIG_RAX: 00000000000000e8
[  584.910796] [     C10] RAX: ffffffffffffffda RBX: 00007fffdaf84050 RCX: 00007a9dd4bded17
[  584.910797] [     C10] RDX: 0000000000000080 RSI: 00007fffdaf84040 RDI: 000000000000000e
[  584.910799] [     C10] RBP: 00007fffdaf84040 R08: 0000000000000007 R09: 00005ee10e8cd600
[  584.910800] [     C10] R10: 000000000000000a R11: 0000000000000202 R12: 00007fffdaf84050
[  584.910801] [     C10] R13: 00007fffdaf84808 R14: 0000000000000000 R15: 0000000000000001
[  584.910805] [     C10]  </TASK>
[  584.910806] [     C10] Modules linked in: rfcomm snd_seq_dummy snd_hrtimer snd_seq nf_conntrack_netbios_ns nf_conntrack_broadcast nft_mas
q nft_reject_ipv4 bridge stp llc nf_nat_tftp nf_conntrack_tftp nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib ccm nft_reject_inet algif_aead
 nf_reject_ipv4 crypto_null nf_reject_ipv6 nft_reject des3_ede_x86_64 cbc nft_ct des_generic libdes md4 nft_chain_nat nf_nat nf_conntrack nf
_defrag_ipv6 nf_defrag_ipv4 nf_tables cmac algif_hash algif_skcipher af_alg bnep vfat fat joydev mousedev intel_rapl_msr amd_atl intel_rapl_
common kvm_amd iwlmvm hid_generic kvm snd_hda_codec_hdmi crct10dif_pclmul uvcvideo mac80211 videobuf2_vmalloc crc32_pclmul snd_hda_intel pol
yval_clmulni uvc snd_usb_audio snd_intel_dspcfg videobuf2_memops polyval_generic snd_virtuoso btusb ghash_clmulni_intel snd_intel_sdw_acpi v
ideobuf2_v4l2 libarc4 snd_hda_codec snd_oxygen_lib snd_usbmidi_lib sha512_ssse3 btrtl ucsi_ccg videodev sha1_ssse3 snd_mpu401_uart snd_ump i
wlwifi typec_ucsi btintel snd_hda_core aesni_intel snd_rawmidi
[  584.910857] [     C10]  btbcm typec videobuf2_common snd_seq_device gf128mul btmtk snd_hwdep crypto_simd roles usbhid cryptd mc snd_pcm b
luetooth cfg80211 wmi_bmof mxm_wmi intel_wmi_thunderbolt ccp rapl igb pcspkr snd_timer k10temp i2c_piix4 ptp crc16 snd pps_core thunderbolt 
dca rfkill soundcore mac_hid lz4 lz4_compress winesync(OE) pkcs8_key_parser i2c_dev crypto_user dm_mod loop nfnetlink zram ip_tables x_table
s amdgpu btrfs blake2b_generic video libcrc32c amdxcp xor i2c_algo_bit raid6_pq drm_ttm_helper ttm drm_exec gpu_sched drm_suballoc_helper dr
m_buddy nvme drm_display_helper nvme_core sha256_ssse3 cec nvme_auth xhci_pci xhci_pci_renesas wmi crc32c_generic crc32c_intel
[  584.910902] [     C10] CR2: 0000000000000000
[  584.910904] [     C10] ---[ end trace 0000000000000000 ]---
[  584.910905] [     C10] RIP: 0010:dcn10_set_drr+0xe2/0x110 [amdgpu]
[  584.911137] [     C10] Code: 85 ff 74 44 48 8b 17 48 85 d2 74 3c 48 8b 92 28 01 00 00 48 85 d2 74 0b 48 89 ee e8 98 84 5f dc 48 8b 03 48 
8b b8 f8 00 00 00 <48> 8b 07 48 8b 80 40 01 00 00 48 85 c0 74 0f ba 02 00 00 00 be 00
[  584.911139] [     C10] RSP: 0018:ffffb40e00424db0 EFLAGS: 00010086
[  584.911142] [     C10] RAX: ffff89f2e1ac0be0 RBX: ffffb40e00424e08 RCX: 0000000000000000
[  584.911144] [     C10] RDX: 0000000080010035 RSI: 00000000000141e4 RDI: 0000000000000000
[  584.911146] [     C10] RBP: ffffb40e00424db4 R08: 0000000000000008 R09: 000000009f3f07ff
[  584.911148] [     C10] R10: ffffb40e00424da0 R11: 0000000080000000 R12: ffffb40e00424e10
[  584.911149] [     C10] R13: ffff89ef55f80178 R14: ffff89ef55f804f0 R15: ffff89ef453faf60
[  584.911151] [     C10] FS:  00007a9dd4731b40(0000) GS:ffff89f66ed00000(0000) knlGS:0000000000000000
[  584.911154] [     C10] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  584.911156] [     C10] CR2: 0000000000000000 CR3: 000000023add8000 CR4: 0000000000f50ef0
[  584.911158] [     C10] PKRU: 55555554
[  584.911160] [     C10] Kernel panic - not syncing: Fatal exception in interrupt
[  584.912166] [     C10] Kernel Offset: 0x1b400000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)

Is there something I can do? I’ll try to downgrade to an older kernel now.

Best regards,
Mr nUUb

EDIT: Is there some archive where I can download an older kernel version? Some days ago I accidentally deleted /var/cache/*

EDIT 2: I think I found the issue: Kernel panic during amdgpu IRQ (#3142) · Issues · drm / amd · GitLab

EDIT 3: Temporary workaround: disable VRR. With VRR enabled, I can very easily trigger the crash (and graphical artifacts) by rapidly enabling/disabling nightlight (KDE Plasma). With VRR disabled, I cannot reproduce it anymore.

@ptr1337 Would it be possible to include the patch mentioned in issue Kernel panic during amdgpu IRQ (#3142) · Issues · drm / amd · GitLab? I switched to Cachy, because Arch crashed on me like crazy. I suspect it’s because of this race condition and now it’s happening to me on Cachy.

I will put it into the RC Kernel for now.

1 Like

Added as of 6.11: Add missing amdgpu irq handling fixes · CachyOS/kernel-patches@0c4847f · GitHub.