Intermittent NVMe Issue with KLEVV CRAS C910 4TB (VF101C59)

Been having this issue for a while where this drive will only work under Linux/Arch/CachyOS (current) in specific conditions.

The condition: Cold Boot to Win11 from either Grub or BIOS, Drive is Present, then Reboot into CachyOS, Drive usually shows up.

Sometimes the drive won’t show so I must then go back to windows11 from a COLD boot and attempt to get it to show under Linux again.

I have been collecting loads of data on this issue, at present the drive is mounted. But before I continue, please understand this issue does not appear to be related to nvme_core.default_ps_max_latency_us=0 as per arch recommended fix. This does nothing.

I need help in isolating how to push this device bug upstream to get a patch made for Linux. Kind of sick of running into this issue and having no SOLID fix!

The motherboard in used is ASUS PRIME B840M-A WIFI with latest BIOS

CACHYOS SYSTEM INF

          .-------------------------:                    gerarderloper@underverse-host
          .+=========================.                    -----------------------------
         :++===++==================-       :++-           OS: CachyOS x86_64
        :*++====+++++=============-        .==:           Kernel: Linux 6.13.6-2-cachyos
       -*+++=====+***++==========:                        Uptime: 20 mins
      =*++++========------------:                         Packages: 1415 (pacman), 51 (flatpak)
     =*+++++=====-                     ...                Shell: fish 4.0.0
   .+*+++++=-===:                    .=+++=:              Display (KAMN26F7SA): 2560x1080 @ 75 Hz (as 2048x864) in 26" [External]
  :++++=====-==:                     -*****+              Display (LG TV SSCR2): 3840x2160 @ 144 Hz (as 2560x1440) in 72" [External, HDR] *
 :++========-=.                      .=+**+.              DE: KDE Plasma 6.3.3
.+==========-.                          .                 WM: KWin (Wayland)
 :+++++++====-                                .--==-.     WM Theme: Breeze
  :++==========.                             :+++++++:    Theme: Breeze (Dark) [Qt], Breeze-Dark [GTK2], Breeze [GTK3]
   .-===========.                            =*****+*+    Icons: breeze-dark [Qt], breeze-dark [GTK2/3/4]
    .-===========:                           .+*****+:    Font: Noto Sans (10pt) [Qt], Noto Sans (10pt) [GTK2/3/4]
      -=======++++:::::::::::::::::::::::::-:  .---:      Cursor: volantes (32px)
       :======++++====+++******************=.             Terminal: konsole 24.12.3
        :=====+++==========++++++++++++++*-               CPU: AMD Ryzen 7 7800X3D (16) @ 5.05 GHz
         .====++==============++++++++++*-                GPU: NVIDIA GeForce RTX 4090 [Discrete]
          .===+==================+++++++:                 Memory: 7.23 GiB / 62.45 GiB (12%)
           .-=======================+++:                  Swap: 0 B / 62.45 GiB (0%)
             ..........................                   Disk (/): 75.92 GiB / 1.79 TiB (4%) - ext4
                                                          Disk (/mnt/GamesNVMe): 593.77 GiB / 3.64 TiB (16%) - ntfs3
                                                          Disk (/mnt/GamesNVMe2): 95.23 MiB / 1.86 TiB (0%) - btrfs
                                                          Disk (/mnt/StorageNVMe): 1.34 TiB / 1.82 TiB (74%) - btrfs
                                                          Local IP (wlan0): 192.168.1.20/24
                                                          Locale: en_AU.UTF-8

Incoming data dump:

PCI ID:

09:00.0 Non-Volatile memory controller: Realtek Semiconductor Co., Ltd. RTS5772DL NVMe SSD Controller (DRAM-less) (rev 01) (prog-if 02 [NVM Express])
	Subsystem: Realtek Semiconductor Co., Ltd. RTS5772DL NVMe SSD Controller (DRAM-less)
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
	Interrupt: pin A routed to IRQ 34
	IOMMU group: 16
	Region 0: Memory at f6c00000 (64-bit, non-prefetchable) [size=16K]
	Region 5: Memory at f6c04000 (32-bit, non-prefetchable) [size=8K]
	Capabilities: [40] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [50] MSI: Enable- Count=1/8 Maskable- 64bit+
		Address: 0000000000000000  Data: 0000
	Capabilities: [70] Express (v2) Endpoint, IntMsgNum 0
		DevCap:	MaxPayload 512 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 75W TEE-IO-
		DevCtl:	CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop- FLReset-
			MaxPayload 512 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 16GT/s, Width x4, ASPM L1, Exit Latency L1 <64us
			ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM L1 Enabled; RCB 64 bytes, LnkDisable- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 16GT/s, Width x4
			TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP- LTR+
			 10BitTagComp+ 10BitTagReq- OBFF Via message/WAKE#, ExtFmt- EETLPPrefix-
			 EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
			 FRS- TPHComp- ExtTPHComp-
			 AtomicOpsCap: 32bit- 64bit- 128bitCAS-
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis+
			 AtomicOpsCtl: ReqEn-
			 IDOReq- IDOCompl- LTR+ EmergencyPowerReductionReq-
			 10BitTagReq- OBFF Disabled, EETLPPrefixBlk-
		LnkCap2: Supported Link Speeds: 2.5-16GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
		LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot
		LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+ EqualizationPhase1+
			 EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
			 Retimer- 2Retimers- CrosslinkRes: Upstream Port
	Capabilities: [b0] MSI-X: Enable+ Count=17 Masked-
		Vector table: BAR=0 offset=00002000
		PBA: BAR=0 offset=00003000
	Capabilities: [100 v2] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP-
			ECRC- UnsupReq- ACSViol- UncorrIntErr- BlockedTLP- AtomicOpBlocked- TLPBlockedErr-
			PoisonTLPBlocked- DMWrReqBlocked- IDECheck- MisIDETLP- PCRC_CHECK- TLPXlatBlocked-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP-
			ECRC- UnsupReq- ACSViol- UncorrIntErr+ BlockedTLP- AtomicOpBlocked- TLPBlockedErr-
			PoisonTLPBlocked- DMWrReqBlocked- IDECheck- MisIDETLP- PCRC_CHECK- TLPXlatBlocked-
		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+
			ECRC- UnsupReq- ACSViol- UncorrIntErr+ BlockedTLP- AtomicOpBlocked- TLPBlockedErr-
			PoisonTLPBlocked- DMWrReqBlocked- IDECheck- MisIDETLP- PCRC_CHECK- TLPXlatBlocked-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr- CorrIntErr- HeaderOF-
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+ CorrIntErr+ HeaderOF+
		AERCap:	First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
			MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
		HeaderLog: 00000000 00000000 00000000 00000000
	Capabilities: [148 v1] Virtual Channel
		Caps:	LPEVC=0 RefClk=100ns PATEntryBits=1
		Arb:	Fixed+ WRR32+ WRR64+ WRR128+
		Ctrl:	ArbSelect=Fixed
		Status:	InProgress-
		Port Arbitration Table [1b8] <?>
		VC0:	Caps:	PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
			Arb:	Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
			Ctrl:	Enable+ ID=0 ArbSelect=Fixed TC/VC=01
			Status:	NegoPending- InProgress-
	Capabilities: [1f8 v1] Device Serial Number 00-00-00-01-00-4c-e0-00
	Capabilities: [208 v1] Power Budgeting <?>
	Capabilities: [218 v1] Secondary PCI Express
		LnkCtl3: LnkEquIntrruptEn- PerformEqu-
		LaneErrStat: 0
	Capabilities: [238 v1] Physical Layer 16.0 GT/s <?>
	Capabilities: [25c v1] Lane Margining at the Receiver
		PortCap: Uses Driver-
		PortSta: MargReady+ MargSoftReady+
	Capabilities: [274 v1] Latency Tolerance Reporting
		Max snoop latency: 1048576ns
		Max no snoop latency: 1048576ns
	Capabilities: [27c v1] L1 PM Substates
		L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
			  PortCommonModeRestoreTime=60us PortTPowerOnTime=60us
		L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
			   T_CommonMode=0us LTR1.2_Threshold=32768ns
		L1SubCtl2: T_PwrOn=60us
	Capabilities: [28c v1] Vendor Specific Information: ID=0002 Rev=4 Len=100 <?>
	Capabilities: [38c v1] Vendor Specific Information: ID=0001 Rev=1 Len=038 <?>
	Capabilities: [3c4 v1] Data Link Feature <?>
	Kernel driver in use: nvme
	Kernel modules: nvme

Some fdisk Info:


Disk /dev/nvme3n1: 3.64 TiB, 4000787030016 bytes, 7814037168 sectors
Disk model: KLEVV CRAS C910 M.2 NVMe SSD 4TB        
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 6560ED47-9A4F-4352-9C46-B12D78A3041F

Device         Start        End    Sectors  Size Type
/dev/nvme3n1p1    34      32767      32734   16M Microsoft reserved
/dev/nvme3n1p2 32768 7813152767 7813120000  3.6T Microsoft basic data

dmesg nvme block: I believe its nvme0

[    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-linux-cachyos root=UUID=2c2f2818-a36a-4edc-91b2-a6df63083de0 rw nowatchdog zswap.enabled=0 nvme_load=yes nvme_core.default_ps_max_latency_us=0 nvidia.NVreg_EnableGpuFirmware=0 splash loglevel=3
[    0.042622] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-linux-cachyos root=UUID=2c2f2818-a36a-4edc-91b2-a6df63083de0 rw nowatchdog zswap.enabled=0 nvme_load=yes nvme_core.default_ps_max_latency_us=0 nvidia.NVreg_EnableGpuFirmware=0 splash loglevel=3
[    0.042655] Unknown kernel command line parameters "splash BOOT_IMAGE=/boot/vmlinuz-linux-cachyos nvme_load=yes", will be passed to user space.
[    3.238162]     nvme_load=yes
[    5.378883] nvme nvme1: pci function 0000:05:00.0
[    5.378885] nvme nvme2: pci function 0000:06:00.0
[    5.378887] nvme nvme0: pci function 0000:02:00.0
[    5.378888] nvme nvme3: pci function 0000:09:00.0
[    5.412406] nvme nvme1: missing or invalid SUBNQN field.
[    5.418195] nvme nvme1: 15/0/0 default/read/poll queues
[    5.420606] nvme nvme1: Ignoring bogus Namespace Identifiers
[    5.421826]  nvme1n1: p1
[    5.422636] nvme nvme0: D3 entry latency set to 10 seconds
[    5.426498] nvme nvme0: 16/0/0 default/read/poll queues
[    5.429545]  nvme0n1: p1 p2
[    5.483926] nvme nvme2: allocated 64 MiB host memory buffer (1 segment).
[    5.509360] nvme nvme3: allocated 64 MiB host memory buffer (1 segment).
[    5.513901] nvme nvme3: 16/0/0 default/read/poll queues
[    5.521797]  nvme3n1: p1 p2
[    5.523057] nvme nvme2: 8/0/0 default/read/poll queues
[    5.546060] nvme nvme2: Ignoring bogus Namespace Identifiers
[    5.734198] EXT4-fs (nvme0n1p2): mounted filesystem 2c2f2818-a36a-4edc-91b2-a6df63083de0 r/w with ordered data mode. Quota mode: none.
[    6.142619] EXT4-fs (nvme0n1p2): re-mounted 2c2f2818-a36a-4edc-91b2-a6df63083de0 r/w. Quota mode: none.
[    7.152161] BTRFS: device label StorageNVMe devid 1 transid 326 /dev/nvme2n1 (259:8) scanned by mount (719)
[    7.153046] BTRFS: device label GamesNVMe2 devid 1 transid 100 /dev/nvme1n1p1 (259:1) scanned by mount (718)
[    7.153453] BTRFS info (device nvme2n1): first mount of filesystem e3682e1d-ca3f-4543-989a-da86d2bf0044
[    7.153465] BTRFS info (device nvme2n1): using crc32c (crc32c-intel) checksum algorithm
[    7.153636] BTRFS info (device nvme1n1p1): first mount of filesystem 093fc491-866d-4632-b6e7-91c4f30013a7
[    7.153638] BTRFS info (device nvme1n1p1): using crc32c (crc32c-intel) checksum algorithm
[   19.333266] nvme nvme0: using unchecked data buffer
[   19.399823] block nvme3n1: No UUID available providing old NGUID

More info to come no doubt.

When drive fails to initialize under Linux, and thus will be excluded from system devices entirely, the following error is in dmesg.

[ 3.272061] nvme nvme3: pci function 0000:09:00.0
[ 23.304409] nvme nvme3: Device not ready; aborting reset, CSTS=0x1

This does not appear to be related to power management apst issue like many other NVMe have been hit with and is not patched in kernel to resolve.

Reference: Solid state drive/NVMe - ArchWiki

In this state, typically if I soft boot with grub into Windows 11 I will have the drive missing there also (Put into a failed state by Linux for some reason)
A Cold Boot / Hard Reset usually kicks it out of this state and Windows 11 can then see it again.

PS. Because I didn’t build my own kernel I can’t post on BugZilla atm. However I have tested several arch distro kernels and the issue is universal. The next thing to test to see if non-arch distro’s have this issue.
I do not believe compiling my own kernel will help because this is intermittent and not straight up failure to work.