Strange NVME Cache issue

Fluxonium · March 27

Hi All,

I am new to unraid, swapped from a QNAP NAS as it failed and they wanted £1040 to try and fix the issue or buy

I have the following:

Asus PRIME B660-PLUS D4
i5-2400
48GB DDR4 - 2x 8GB 3200 DDR4, 2x 16GB 3200 DDR4
1x LSI 9211-8i IT Mode HBA SAS SATA
8x Seagate IronWolf 3TB drives (ST3000VN007) - connected to the LSI card
2x 1TB WDC WDS100T2B0C connected to a two port QNAP nvme to PCIe adapter (moved from the QNAP NAS)
1x 1TB WDC WDS100T2B0C in a motherboard slot
1x INTEL SSDPEKNW512G8

Main Array (8x ST3000VN007, connected to the LSI Card)

Parity (ST3000VN007)
Parity2 (ST3000VN007)
Disk 1 .. 6 ST3000VN007 (xfs)

Cache_ssd (xfs) - used for iso share

INTEL_SSDPEKNW512G8_PHNH9420007Q512A - 512 GB (nvme1n1)

Cache_nvme (xfs) - used for docker, vm and system folders

WDC_WDS100T2B0C-00PXH0_2131CQ471804 - 1 TB (nvme0n1)

Cache_protected (btrfs mirror) - used to cache data to the main array - Connected via the QNAP nvme to PCIe card

WDC_WDS100T2B0C-00PXH0_20427P467007 - 1 TB (nvme2n1)
WDC_WDS100T2B0C-00PXH0_2052FR446602 - 1 TB (nvme3n1)

I have been testing unraid before I move any data across, so have copied 4 movie files (~50-80GB in size) and I am having some major issues with the Cache_protected pool.

copying from Windows PC -> Cache_protected over 1GB ethernet works fine (~280GB, 4x files).
Running Mover - copying from the Array -> Cache_protected works ever time I have tested (4-5 times now)
Running Mover - copying from Cache_protected -> Array after between 150-200GB nvme3n1 (second disk in the Cache_protected) disconnects from unraid and the log has the following error message

Mar 27 13:45:39 Obsidian move: file: /mnt/cache_protected/movies/test.mkv
Mar 27 13:53:06 Obsidian kernel: nvme nvme3: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
Mar 27 13:53:06 Obsidian kernel: nvme nvme3: Does your device have a faulty power saving mode enabled?
Mar 27 13:53:06 Obsidian kernel: nvme nvme3: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug
Mar 27 13:53:08 Obsidian kernel: nvme3n1: I/O Cmd(0x2) @ LBA 111992656, 1024 blocks, I/O Error (sct 0x3 / sc 0x71) 
Mar 27 13:53:08 Obsidian kernel: I/O error, dev nvme3n1, sector 111992656 op 0x0:(READ) flags 0x84700 phys_seg 5 prio class 2
Mar 27 13:53:08 Obsidian kernel: nvme3n1: I/O Cmd(0x2) @ LBA 111993680, 1024 blocks, I/O Error (sct 0x3 / sc 0x71) 
Mar 27 13:53:08 Obsidian kernel: I/O error, dev nvme3n1, sector 111993680 op 0x0:(READ) flags 0x84700 phys_seg 5 prio class 2
Mar 27 13:53:08 Obsidian kernel: nvme3n1: I/O Cmd(0x2) @ LBA 111994704, 1024 blocks, I/O Error (sct 0x3 / sc 0x71) 
Mar 27 13:53:08 Obsidian kernel: I/O error, dev nvme3n1, sector 111994704 op 0x0:(READ) flags 0x84700 phys_seg 5 prio class 2
Mar 27 13:53:08 Obsidian kernel: nvme3n1: I/O Cmd(0x2) @ LBA 111995728, 1024 blocks, I/O Error (sct 0x3 / sc 0x71) 
Mar 27 13:53:08 Obsidian kernel: I/O error, dev nvme3n1, sector 111995728 op 0x0:(READ) flags 0x84700 phys_seg 5 prio class 2
Mar 27 13:53:08 Obsidian kernel: nvme3n1: I/O Cmd(0x2) @ LBA 111996752, 1024 blocks, I/O Error (sct 0x3 / sc 0x71) 
Mar 27 13:53:08 Obsidian kernel: I/O error, dev nvme3n1, sector 111996752 op 0x0:(READ) flags 0x84700 phys_seg 5 prio class 2
Mar 27 13:53:08 Obsidian kernel: nvme3n1: I/O Cmd(0x2) @ LBA 111997776, 1024 blocks, I/O Error (sct 0x3 / sc 0x71) 
Mar 27 13:53:08 Obsidian kernel: I/O error, dev nvme3n1, sector 111997776 op 0x0:(READ) flags 0x84700 phys_seg 5 prio class 2
Mar 27 13:53:08 Obsidian kernel: nvme3n1: I/O Cmd(0x2) @ LBA 111998800, 1024 blocks, I/O Error (sct 0x3 / sc 0x71) 
Mar 27 13:53:08 Obsidian kernel: I/O error, dev nvme3n1, sector 111998800 op 0x0:(READ) flags 0x84700 phys_seg 24 prio class 2
Mar 27 13:53:08 Obsidian kernel: nvme3n1: I/O Cmd(0x2) @ LBA 111999824, 1024 blocks, I/O Error (sct 0x3 / sc 0x71) 
Mar 27 13:53:08 Obsidian kernel: I/O error, dev nvme3n1, sector 111999824 op 0x0:(READ) flags 0x80700 phys_seg 32 prio class 2
Mar 27 13:53:08 Obsidian kernel: nvme3n1: I/O Cmd(0x2) @ LBA 112000848, 1024 blocks, I/O Error (sct 0x3 / sc 0x71) 
Mar 27 13:53:08 Obsidian kernel: I/O error, dev nvme3n1, sector 112000848 op 0x0:(READ) flags 0x84700 phys_seg 32 prio class 2
Mar 27 13:53:08 Obsidian kernel: nvme3n1: I/O Cmd(0x2) @ LBA 112001872, 1024 blocks, I/O Error (sct 0x3 / sc 0x71) 
Mar 27 13:53:08 Obsidian kernel: I/O error, dev nvme3n1, sector 112001872 op 0x0:(READ) flags 0x84700 phys_seg 20 prio class 2
Mar 27 13:53:08 Obsidian kernel: nvme 0000:09:00.0: Unable to change power state from D3cold to D0, device inaccessible
Mar 27 13:53:08 Obsidian kernel: nvme nvme3: Removing after probe failure status: -19
Mar 27 13:53:08 Obsidian kernel: nvme3n1: detected capacity change from 1953525168 to 0
Mar 27 13:53:08 Obsidian kernel: BTRFS error (device nvme2n1p1): bdev /dev/nvme3n1p1 errs: wr 21837240, rd 7, flush 5, corrupt 1, gen 0
Mar 27 13:53:08 Obsidian kernel: BTRFS error (device nvme2n1p1): bdev /dev/nvme3n1p1 errs: wr 21837240, rd 7, flush 5, corrupt 1, gen 0

It is strange that the nvme3n1 only disconnects / has issues when moving from Cache_protected to the Array??? any other operation works fine.

I have removed the Parity2 from the array to speed up testing and it seems to copy more data before nvme3n1 disconnects, any ideas or suggestions

I am going to try swapping nvme3n1 and nvme2n1 and see if the error follows the port or the drive.

NOTE: I have tested the RAM using memtest overnight and all passes.

NOTE: The LSI card and all eight drives have been tested in Windows using HD sentinel - Extended SMART test then 2x write - read surface scans - all working fine without error

NOTE: I have no docker containers or VM's running.

JorgeB · March 27

1 hour ago, Fluxonium said:
Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off"

Try this first, if it doesn't help try a different m.2 slot or a different NVMe device (or board).

Fluxonium · March 27

"nvme_core.default_ps_max_latency_us=0 pcie_aspm=off"

I will give it ago, seems unlikely to fix this issue as the other WDS100T2B0C has no issue.

I have also tested the drives and controller in Windows using using HD sentinel - Extended SMART test then 2x write - read surface scans - all with no issues, also copied enough data to fill the drive to 90% in windows and no issues or warnings, using the same hardware just a different boot drive, so looks like an issue with Linux / unraid for some reason.

i am also getting the following in the log from time to time:

Mar 27 16:03:29 Obsidian kernel: pcieport 0000:00:1c.4: AER: Multiple Corrected error received: 0000:08:00.0
Mar 27 16:03:29 Obsidian kernel: nvme 0000:08:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Mar 27 16:03:29 Obsidian kernel: nvme 0000:08:00.0:   device [15b7:5009] error status/mask=00000001/0000e000
Mar 27 16:03:29 Obsidian kernel: nvme 0000:08:00.0:    [ 0] RxErr                 
Mar 27 16:24:19 Obsidian kernel: pcieport 0000:00:1c.4: AER: Multiple Corrected error received: 0000:09:00.0
Mar 27 16:24:19 Obsidian kernel: nvme 0000:09:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Mar 27 16:24:19 Obsidian kernel: nvme 0000:09:00.0:   device [15b7:5009] error status/mask=00000001/0000e000
Mar 27 16:24:19 Obsidian kernel: nvme 0000:09:00.0:    [ 0] RxErr

device 15b7:5009 is the QNAP nvme to PCIe adapter both nvme3n1 and nvme2n1 are connected to:

IOMMU group 22:	[15b7:5009] 08:00.0 Non-Volatile memory controller: Sandisk Corp SanDisk Ultra 3D / WD Blue SN550 NVMe SSD (rev 01)
	[N:2:1:1]    disk    WDC WDS100T2B0C-00PXH0__1                  /dev/nvme2n1  1.00TB
IOMMU group 23:	[15b7:5009] 09:00.0 Non-Volatile memory controller: Sandisk Corp SanDisk Ultra 3D / WD Blue SN550 NVMe SSD (rev 01)
	[N:3:1:1]    disk    WDC WDS100T2B0C-00PXH0__1                  /dev/nvme3n1  1.00TB

BIOS has been updated to the latest version and all drive firmware has been checked in Windows for the latest version.

Thanks for your help.

Edited March 27 by Fluxonium

Michael_P · March 27

1 hour ago, Fluxonium said:

i am also getting the following in the log from time to time:

Might also be related to the power management issue

Fluxonium · March 27

Unfortunately, "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" did not work.

Strange thing is I cannot break the drive in Windows both work perfectly (on the same hardware other than the boot disk), going to try the following but cannot see it working unless Windows is not doing something unraid is (better / more stable drivers).

i have even tried sleeping / hibernate from Windows and both drives still work.

Remove adapter and swap drive positions (clean drive and adapter contacts)
Remove both drives from the adapter and insert into motherboard and re-test

Edited March 27 by Fluxonium

JorgeB · March 27

If you swap them does the problem follow the device or the slot?

Fluxonium · March 28

OK, after some more testing it looks like the issue is the QNAP QM2-2P-374 as the nvme drives work in the motherboard slots without error (so far :)).

I am guessing this is a Linux / unraid driver issue as the card works fine in Windows.

Strange NVME Cache issue

Recommended Posts

Fluxonium

Link to comment

JorgeB

Link to comment

Fluxonium

Link to comment

Michael_P

Link to comment

Fluxonium

Link to comment

JorgeB

Link to comment

Fluxonium

Link to comment

Join the conversation