Dual NVME Cache Drive Errors


Recommended Posts

New system build and new to Unraid, so this could be something simple.  I've been having a weird issue where whenever I enable the use of cache disks on a share it causes the cache drives to error out and force a reboot.  Cache drives are 2x Samsung M.2 NVME drives set up in drive pool for redundancy.  Whenever I start a file transfer using Krusader to transfer files from a network NAS, it will 100% error out and require a reboot if the share is using the cache drives.  I've tested by disabling cache drive usage for the share and get no errors.

 

Diagnostic logs attached, snapshot taken from non-error state

syslog attached from error state

 

Any help or insight would be appreciated because I've spent hours trying to isolate this issue and it's driving me nuts.

empunraid-diagnostics-20191110-0800.zip empunraid-syslog-20191110-0700.zip

Link to comment

Update:

I've tried taking both NVME drives out of the cache pool and clearing them - Crashed Unraid

I've tried taking both NVME drives out of the cache pool and checking them - Crashed Unraid

I've tried breaking the pool, reformatting a single NVME drive as XFS and re-adding as a cache drive - Crashed Unraid

 

Edited by dis3as3d
Link to comment

The strange thing is it happens to either of the two NVME drives independently.  I've tested both drives independently and they both fall offline whenever I initiate a large data transfer or I/O heavy operation.  I may be holding out hope, but I'm wondering if this could be a drivers issue.

Link to comment

I flashed the BIOS to the latest version last night as well, no dice.  Agreed it could be a MOBO/Controller issue but the fact it only happens under heavy I/O feels more like software.  There's also this long thread dating back to 2017(Seriously why is a bug this old still open?!) about issues with some Samsung NVME drives and Unraid.  Seems strangely similar, and the thread goes on to talk about sector sizes on some Samsung NVME drives causing issues.

 

I'm new to Linux so I've got no clue where to start troubleshooting this.  Might just give up on Unraid and run Windows. 

 

 

Link to comment
10 minutes ago, uldise said:

maybe simple overheating on controller? How you connect these drives to Mobo?

Yes, the Mobo has 3 M.2 slots directly on the board.  Drives are in slot 0 and 1 at the moment.  Other HDD are running off a LSI 9211 since NVME shares a bus with the SATA ports. I haven't smelled any burning, and even ran an IR gun over the board and drives looking for hot spots and didn't find anything.  I think I'd have to be very specific in where I'm reading the temp, so not sure if I would've caught the controller overheating.

 

*Edit - The drives crash in 2-5 min of starting a heavy I/O operation as well.  I'd expect it any overheating to take longer than that.

Edited by dis3as3d
update
Link to comment
14 minutes ago, dis3as3d said:

The drives crash in 2-5 min of starting a heavy I/O operation as well

this is more than enough to start overheating, if your case have not a good ventilation.

but you can make just a test - place some fan near the drives to blow hot air away from drives.

i have no personal experience with NVMe drives, but i see many PCIe NVMe controllers that comes with their own fans..

Link to comment
5 minutes ago, uldise said:

this is more than enough to start overheating, if your case have not a good ventilation.

but you can make just a test - place some fan near the drives to blow hot air away from drives.

i have no personal experience with NVMe drives, but i see many PCIe NVMe controllers that comes with their own fans..

The drives themselves aren't overheating, I'll have to look up where the controller is on the board and give that a test.

Link to comment

New Theory: NVME drives go into a lower power standby mode.  Samsung drives in particular seem to give Linux problems.  While in standby you can still do low I/O transfers, and that aligns with my issue where I can see the drive and even write to it some, but larger I/O transfers give me problems.

 

The recommended fix is to add this to the syslinux.cfg:

nvme_core.default_ps_max_latency_us=5500

Being new to Linux I tried and couldn't get it working.  Below is my edited syslinux.cfg.  What did I do wrong?

default menu.c32
menu title Lime Technology, Inc.
prompt 0
timeout 50
label Unraid OS
  menu default
  kernel /bzimage
  append initrd=/bzroot
  append nvme_core.default_ps_max_latency_us=5500
label Unraid OS GUI Mode
  kernel /bzimage
  append initrd=/bzroot,/bzroot-gui
label Unraid OS Safe Mode (no plugins, no GUI)
  kernel /bzimage
  append initrd=/bzroot unraidsafemode
label Unraid OS GUI Safe Mode (no plugins)
  kernel /bzimage
  append initrd=/bzroot,/bzroot-gui unraidsafemode
label Memtest86+
  kernel /memtest

 

Link to comment
  • 9 months later...

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.