Strange NVME Cache issue


Go to solution Solved by Fluxonium,

Recommended Posts

Hi All, 

 

I am new to unraid, swapped from a QNAP NAS as it failed and they wanted £1040 to try and fix the issue or buy :( 

 

I have the following:

  • Asus PRIME B660-PLUS D4
  • i5-2400
  • 48GB DDR4 - 2x 8GB 3200 DDR4, 2x 16GB 3200 DDR4
  • 1x LSI 9211-8i IT Mode HBA SAS SATA
  • 8x Seagate IronWolf 3TB drives (ST3000VN007) - connected to the LSI card
  • 2x 1TB WDC WDS100T2B0C connected to a two port QNAP nvme to PCIe adapter (moved from the QNAP NAS)
  • 1x  1TB WDC WDS100T2B0C in a motherboard slot
  • 1x INTEL SSDPEKNW512G8

 

Main Array (8x ST3000VN007, connected to the LSI Card)

  1. Parity (ST3000VN007)
  2. Parity2 (ST3000VN007)
  3. Disk  1 .. 6 ST3000VN007 (xfs)

Cache_ssd (xfs) - used for iso share

  1. INTEL_SSDPEKNW512G8_PHNH9420007Q512A - 512 GB (nvme1n1)

Cache_nvme (xfs) - used for docker, vm and system folders

  1. WDC_WDS100T2B0C-00PXH0_2131CQ471804 - 1 TB (nvme0n1)

Cache_protected (btrfs mirror) - used to cache data to the main array - Connected via the QNAP nvme to PCIe card

  1. WDC_WDS100T2B0C-00PXH0_20427P467007 - 1 TB (nvme2n1) 

  2. WDC_WDS100T2B0C-00PXH0_2052FR446602 - 1 TB (nvme3n1)    

 

I have been testing unraid before I move any data across, so have copied 4 movie files (~50-80GB in size) and I am having some major issues with the Cache_protected pool.

 

  1. copying from Windows PC -> Cache_protected over 1GB ethernet works fine (~280GB, 4x files).
  2. Running Mover - copying from the Array -> Cache_protected works ever time I have tested (4-5 times now)
  3. Running Mover - copying from Cache_protected -> Array after between 150-200GB nvme3n1 (second disk in the Cache_protected) disconnects from unraid and the log has the following error message

 

Mar 27 13:45:39 Obsidian move: file: /mnt/cache_protected/movies/test.mkv
Mar 27 13:53:06 Obsidian kernel: nvme nvme3: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
Mar 27 13:53:06 Obsidian kernel: nvme nvme3: Does your device have a faulty power saving mode enabled?
Mar 27 13:53:06 Obsidian kernel: nvme nvme3: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug
Mar 27 13:53:08 Obsidian kernel: nvme3n1: I/O Cmd(0x2) @ LBA 111992656, 1024 blocks, I/O Error (sct 0x3 / sc 0x71) 
Mar 27 13:53:08 Obsidian kernel: I/O error, dev nvme3n1, sector 111992656 op 0x0:(READ) flags 0x84700 phys_seg 5 prio class 2
Mar 27 13:53:08 Obsidian kernel: nvme3n1: I/O Cmd(0x2) @ LBA 111993680, 1024 blocks, I/O Error (sct 0x3 / sc 0x71) 
Mar 27 13:53:08 Obsidian kernel: I/O error, dev nvme3n1, sector 111993680 op 0x0:(READ) flags 0x84700 phys_seg 5 prio class 2
Mar 27 13:53:08 Obsidian kernel: nvme3n1: I/O Cmd(0x2) @ LBA 111994704, 1024 blocks, I/O Error (sct 0x3 / sc 0x71) 
Mar 27 13:53:08 Obsidian kernel: I/O error, dev nvme3n1, sector 111994704 op 0x0:(READ) flags 0x84700 phys_seg 5 prio class 2
Mar 27 13:53:08 Obsidian kernel: nvme3n1: I/O Cmd(0x2) @ LBA 111995728, 1024 blocks, I/O Error (sct 0x3 / sc 0x71) 
Mar 27 13:53:08 Obsidian kernel: I/O error, dev nvme3n1, sector 111995728 op 0x0:(READ) flags 0x84700 phys_seg 5 prio class 2
Mar 27 13:53:08 Obsidian kernel: nvme3n1: I/O Cmd(0x2) @ LBA 111996752, 1024 blocks, I/O Error (sct 0x3 / sc 0x71) 
Mar 27 13:53:08 Obsidian kernel: I/O error, dev nvme3n1, sector 111996752 op 0x0:(READ) flags 0x84700 phys_seg 5 prio class 2
Mar 27 13:53:08 Obsidian kernel: nvme3n1: I/O Cmd(0x2) @ LBA 111997776, 1024 blocks, I/O Error (sct 0x3 / sc 0x71) 
Mar 27 13:53:08 Obsidian kernel: I/O error, dev nvme3n1, sector 111997776 op 0x0:(READ) flags 0x84700 phys_seg 5 prio class 2
Mar 27 13:53:08 Obsidian kernel: nvme3n1: I/O Cmd(0x2) @ LBA 111998800, 1024 blocks, I/O Error (sct 0x3 / sc 0x71) 
Mar 27 13:53:08 Obsidian kernel: I/O error, dev nvme3n1, sector 111998800 op 0x0:(READ) flags 0x84700 phys_seg 24 prio class 2
Mar 27 13:53:08 Obsidian kernel: nvme3n1: I/O Cmd(0x2) @ LBA 111999824, 1024 blocks, I/O Error (sct 0x3 / sc 0x71) 
Mar 27 13:53:08 Obsidian kernel: I/O error, dev nvme3n1, sector 111999824 op 0x0:(READ) flags 0x80700 phys_seg 32 prio class 2
Mar 27 13:53:08 Obsidian kernel: nvme3n1: I/O Cmd(0x2) @ LBA 112000848, 1024 blocks, I/O Error (sct 0x3 / sc 0x71) 
Mar 27 13:53:08 Obsidian kernel: I/O error, dev nvme3n1, sector 112000848 op 0x0:(READ) flags 0x84700 phys_seg 32 prio class 2
Mar 27 13:53:08 Obsidian kernel: nvme3n1: I/O Cmd(0x2) @ LBA 112001872, 1024 blocks, I/O Error (sct 0x3 / sc 0x71) 
Mar 27 13:53:08 Obsidian kernel: I/O error, dev nvme3n1, sector 112001872 op 0x0:(READ) flags 0x84700 phys_seg 20 prio class 2
Mar 27 13:53:08 Obsidian kernel: nvme 0000:09:00.0: Unable to change power state from D3cold to D0, device inaccessible
Mar 27 13:53:08 Obsidian kernel: nvme nvme3: Removing after probe failure status: -19
Mar 27 13:53:08 Obsidian kernel: nvme3n1: detected capacity change from 1953525168 to 0
Mar 27 13:53:08 Obsidian kernel: BTRFS error (device nvme2n1p1): bdev /dev/nvme3n1p1 errs: wr 21837240, rd 7, flush 5, corrupt 1, gen 0
Mar 27 13:53:08 Obsidian kernel: BTRFS error (device nvme2n1p1): bdev /dev/nvme3n1p1 errs: wr 21837240, rd 7, flush 5, corrupt 1, gen 0

 

It is strange that the nvme3n1 only disconnects / has issues when moving from Cache_protected to the Array??? any other operation works fine.

 

I have removed the Parity2 from the array to speed up testing and it seems to copy more data before nvme3n1 disconnects, any ideas or suggestions

 

I am going to try swapping nvme3n1 and nvme2n1 and see if the error follows the port or the drive.

 

NOTE: I have tested the RAM using memtest overnight and all passes.

NOTE: The LSI card and all eight drives have been tested in Windows using HD sentinel - Extended SMART test then 2x write - read surface scans - all working fine without error

NOTE: I have no docker containers or VM's running.

Link to comment
Posted (edited)

"nvme_core.default_ps_max_latency_us=0 pcie_aspm=off"

 

I will give it ago, seems unlikely to fix this issue as the other WDS100T2B0C has no issue.

 

I have also tested the drives and controller in Windows using using HD sentinel - Extended SMART test then 2x write - read surface scans - all with no issues, also copied enough data to fill the drive to 90% in windows and no issues or warnings, using the same hardware just a different boot drive, so looks like an issue with Linux / unraid for some reason.

 

i am also getting the following in the log from time to time:

 

Mar 27 16:03:29 Obsidian kernel: pcieport 0000:00:1c.4: AER: Multiple Corrected error received: 0000:08:00.0
Mar 27 16:03:29 Obsidian kernel: nvme 0000:08:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Mar 27 16:03:29 Obsidian kernel: nvme 0000:08:00.0:   device [15b7:5009] error status/mask=00000001/0000e000
Mar 27 16:03:29 Obsidian kernel: nvme 0000:08:00.0:    [ 0] RxErr                 
Mar 27 16:24:19 Obsidian kernel: pcieport 0000:00:1c.4: AER: Multiple Corrected error received: 0000:09:00.0
Mar 27 16:24:19 Obsidian kernel: nvme 0000:09:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Mar 27 16:24:19 Obsidian kernel: nvme 0000:09:00.0:   device [15b7:5009] error status/mask=00000001/0000e000
Mar 27 16:24:19 Obsidian kernel: nvme 0000:09:00.0:    [ 0] RxErr                 

 

device 15b7:5009 is the QNAP nvme to PCIe adapter both nvme3n1 and nvme2n1 are connected to:

 

IOMMU group 22:	[15b7:5009] 08:00.0 Non-Volatile memory controller: Sandisk Corp SanDisk Ultra 3D / WD Blue SN550 NVMe SSD (rev 01)
	[N:2:1:1]    disk    WDC WDS100T2B0C-00PXH0__1                  /dev/nvme2n1  1.00TB
IOMMU group 23:	[15b7:5009] 09:00.0 Non-Volatile memory controller: Sandisk Corp SanDisk Ultra 3D / WD Blue SN550 NVMe SSD (rev 01)
	[N:3:1:1]    disk    WDC WDS100T2B0C-00PXH0__1                  /dev/nvme3n1  1.00TB

 

BIOS has been updated to the latest version and all drive firmware has been checked in Windows for the latest version.

 

Thanks for your help.

Edited by Fluxonium
Link to comment
Posted (edited)

Unfortunately, "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" did not work.

 

Strange thing is I cannot break the drive in Windows both work perfectly (on the same hardware other than the boot disk), going to try the following but cannot see it working unless Windows is not doing something unraid is (better / more stable drivers).

 

i have even tried sleeping / hibernate from Windows and both drives still work.

 

  1. Remove adapter and swap drive positions (clean drive and adapter contacts)
  2. Remove both drives from the adapter and insert into motherboard and re-test

 

Edited by Fluxonium
Link to comment
  • Solution

OK, after some more testing it looks like the issue is the QNAP QM2-2P-374 as the nvme drives work in the motherboard slots without error (so far :)).

 

I am guessing this is a Linux / unraid driver issue as the card works fine in Windows.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.