dis3as3d Posted November 10, 2019 Share Posted November 10, 2019 New system build and new to Unraid, so this could be something simple. I've been having a weird issue where whenever I enable the use of cache disks on a share it causes the cache drives to error out and force a reboot. Cache drives are 2x Samsung M.2 NVME drives set up in drive pool for redundancy. Whenever I start a file transfer using Krusader to transfer files from a network NAS, it will 100% error out and require a reboot if the share is using the cache drives. I've tested by disabling cache drive usage for the share and get no errors. Diagnostic logs attached, snapshot taken from non-error state syslog attached from error state Any help or insight would be appreciated because I've spent hours trying to isolate this issue and it's driving me nuts. empunraid-diagnostics-20191110-0800.zip empunraid-syslog-20191110-0700.zip Quote Link to comment
dis3as3d Posted November 11, 2019 Author Share Posted November 11, 2019 (edited) Update: I've tried taking both NVME drives out of the cache pool and clearing them - Crashed Unraid I've tried taking both NVME drives out of the cache pool and checking them - Crashed Unraid I've tried breaking the pool, reformatting a single NVME drive as XFS and re-adding as a cache drive - Crashed Unraid Edited November 11, 2019 by dis3as3d Quote Link to comment
JorgeB Posted November 11, 2019 Share Posted November 11, 2019 One of the NVMe devices is dropping offline, hardware problem Quote Link to comment
dis3as3d Posted November 11, 2019 Author Share Posted November 11, 2019 The strange thing is it happens to either of the two NVME drives independently. I've tested both drives independently and they both fall offline whenever I initiate a large data transfer or I/O heavy operation. I may be holding out hope, but I'm wondering if this could be a drivers issue. Quote Link to comment
JorgeB Posted November 11, 2019 Share Posted November 11, 2019 Just now, dis3as3d said: I may be holding out hope, but I'm wondering if this could be a drivers issue. It could, but IMHO more likely a board/controller/bios issue. Quote Link to comment
dis3as3d Posted November 11, 2019 Author Share Posted November 11, 2019 I flashed the BIOS to the latest version last night as well, no dice. Agreed it could be a MOBO/Controller issue but the fact it only happens under heavy I/O feels more like software. There's also this long thread dating back to 2017(Seriously why is a bug this old still open?!) about issues with some Samsung NVME drives and Unraid. Seems strangely similar, and the thread goes on to talk about sector sizes on some Samsung NVME drives causing issues. I'm new to Linux so I've got no clue where to start troubleshooting this. Might just give up on Unraid and run Windows. Quote Link to comment
JorgeB Posted November 11, 2019 Share Posted November 11, 2019 31 minutes ago, dis3as3d said: Seems strangely similar, Not at all, since no devices drop offline in that case. Quote Link to comment
uldise Posted November 11, 2019 Share Posted November 11, 2019 52 minutes ago, dis3as3d said: but the fact it only happens under heavy I/O maybe simple overheating on controller? How you connect these drives to Mobo? Quote Link to comment
dis3as3d Posted November 11, 2019 Author Share Posted November 11, 2019 (edited) 10 minutes ago, uldise said: maybe simple overheating on controller? How you connect these drives to Mobo? Yes, the Mobo has 3 M.2 slots directly on the board. Drives are in slot 0 and 1 at the moment. Other HDD are running off a LSI 9211 since NVME shares a bus with the SATA ports. I haven't smelled any burning, and even ran an IR gun over the board and drives looking for hot spots and didn't find anything. I think I'd have to be very specific in where I'm reading the temp, so not sure if I would've caught the controller overheating. *Edit - The drives crash in 2-5 min of starting a heavy I/O operation as well. I'd expect it any overheating to take longer than that. Edited November 11, 2019 by dis3as3d update Quote Link to comment
uldise Posted November 11, 2019 Share Posted November 11, 2019 14 minutes ago, dis3as3d said: The drives crash in 2-5 min of starting a heavy I/O operation as well this is more than enough to start overheating, if your case have not a good ventilation. but you can make just a test - place some fan near the drives to blow hot air away from drives. i have no personal experience with NVMe drives, but i see many PCIe NVMe controllers that comes with their own fans.. Quote Link to comment
dis3as3d Posted November 11, 2019 Author Share Posted November 11, 2019 5 minutes ago, uldise said: this is more than enough to start overheating, if your case have not a good ventilation. but you can make just a test - place some fan near the drives to blow hot air away from drives. i have no personal experience with NVMe drives, but i see many PCIe NVMe controllers that comes with their own fans.. The drives themselves aren't overheating, I'll have to look up where the controller is on the board and give that a test. Quote Link to comment
dis3as3d Posted November 11, 2019 Author Share Posted November 11, 2019 Oh, looks like NVME is controlled by the southbridge and not a seperate chip. I definately turned the IR gun on the southbridge. Don't think that's the issue. Quote Link to comment
dis3as3d Posted November 12, 2019 Author Share Posted November 12, 2019 New Theory: NVME drives go into a lower power standby mode. Samsung drives in particular seem to give Linux problems. While in standby you can still do low I/O transfers, and that aligns with my issue where I can see the drive and even write to it some, but larger I/O transfers give me problems. The recommended fix is to add this to the syslinux.cfg: nvme_core.default_ps_max_latency_us=5500 Being new to Linux I tried and couldn't get it working. Below is my edited syslinux.cfg. What did I do wrong? default menu.c32 menu title Lime Technology, Inc. prompt 0 timeout 50 label Unraid OS menu default kernel /bzimage append initrd=/bzroot append nvme_core.default_ps_max_latency_us=5500 label Unraid OS GUI Mode kernel /bzimage append initrd=/bzroot,/bzroot-gui label Unraid OS Safe Mode (no plugins, no GUI) kernel /bzimage append initrd=/bzroot unraidsafemode label Unraid OS GUI Safe Mode (no plugins) kernel /bzimage append initrd=/bzroot,/bzroot-gui unraidsafemode label Memtest86+ kernel /memtest Quote Link to comment
uldise Posted November 12, 2019 Share Posted November 12, 2019 don't use second append line, just add at the end of the first one. Quote Link to comment
dis3as3d Posted November 12, 2019 Author Share Posted November 12, 2019 (edited) Is it space delineated? Just space and then nvme_core...? Edited November 12, 2019 by dis3as3d Quote Link to comment
uldise Posted November 12, 2019 Share Posted November 12, 2019 i think so, see https://wiki.unraid.net/Boot_Codes 1 Quote Link to comment
dis3as3d Posted November 12, 2019 Author Share Posted November 12, 2019 2 minutes ago, uldise said: i think so, see https://wiki.unraid.net/Boot_Codes Thanks for finding this! Quote Link to comment
dis3as3d Posted November 13, 2019 Author Share Posted November 13, 2019 Well, that didn't work. Back to square 1. Tried both of the below and the drive still crashes. Anyone got any ideas? nvme_core.default_ps_max_latency_us=0 nvme_core.default_ps_max_latency_us=5500 Quote Link to comment
dis3as3d Posted November 13, 2019 Author Share Posted November 13, 2019 One other thing I tried is updating to the unstable version of Unraid and still had issues. Quote Link to comment
LynxNZ Posted August 23, 2020 Share Posted August 23, 2020 Sorry to necro, but this is exact issue I'm facing with nvme drive at the moment. Fine under small loads, then chokes when things get busy. Xfs, single cache drive. Expecting it to be hardware issue at this point. Quote Link to comment
trurl Posted August 23, 2020 Share Posted August 23, 2020 Latest beta has a different alignment for SSDs which may improve performance with some devices. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.