BIGtoeknee Posted July 12 Share Posted July 12 Hello, I am having all kinds of issues with my NVME cache array. Now I know I am probably overkill here but its what I wanted... I have 2 990 Pros 4tb and 2 1 TB NVME drives (I know they aren't the same size and speed). I have a drive 990 Pro that will disconnect from heavy usage. I get some heat warnings but these 990 Pros have heatsinks and I have tons of fans in a Fractal Meshify XL case. Everything else temp wise is great. I have update firmware on the drives and even replaced some drives with the same make and model. Not sure if something Unraid is doing here or if my motherboard is causing the issues. I included the diag file. Thanks for any insight in advance. flair-diagnostics-20240712-1114.zip Quote Link to comment
JorgeB Posted July 12 Share Posted July 12 Logs are being spammed with btrfs checksum errors, reboot and post new diags after array start Quote Link to comment
BIGtoeknee Posted July 12 Author Share Posted July 12 10 minutes ago, JorgeB said: Logs are being spammed with btrfs checksum errors, reboot and post new diags after array start New diagnostic after fresh reboot. flair-diagnostics-20240712-1210.zip Quote Link to comment
JorgeB Posted July 12 Share Posted July 12 Run a correcting scrub for the pool and post the results. Quote Link to comment
BIGtoeknee Posted July 12 Author Share Posted July 12 44 minutes ago, JorgeB said: Run a correcting scrub for the pool and post the results. UUID: 5fd43c19-2e69-4009-9ee4-b3a556723527 Scrub started: Fri Jul 12 13:14:29 2024 Status: finished Duration: 0:16:19 Total to scrub: 1.99TiB Rate: 2.08GiB/s Error summary: read=4603 csum=1408 Corrected: 0 Uncorrectable: 6011 Unverified: 0 This is a second run. The first run had some corrected actions but aborted for some reason. Quote Link to comment
JorgeB Posted July 12 Share Posted July 12 Look at the syslog for the corrupt file list, they should be deleted or restore from a backup, then run another scrub to confirm 0 errors. Quote Link to comment
BIGtoeknee Posted July 12 Author Share Posted July 12 (edited) 1 hour ago, JorgeB said: Look at the syslog for the corrupt file list, they should be deleted or restore from a backup, then run another scrub to confirm 0 errors. Will do, I looked in the syslog and see many errors like: Jul 12 14:56:15 Flair kernel: BTRFS error (device nvme1n1p1): bdev /dev/nvme1n1p1 errs: wr 1691137, rd 43454, flush 31324, corrupt 354119, gen 5888 How would I go about locating the corrupted file to remove (I am super new to unraid)? Edited July 12 by BIGtoeknee Quote Link to comment
BIGtoeknee Posted July 13 Author Share Posted July 13 5 hours ago, JorgeB said: Post the current diagnostics. New Diag flair-diagnostics-20240713-0955.zip Quote Link to comment
JorgeB Posted July 14 Share Posted July 14 Look for lines like this: Quote Jul 13 09:50:56 Flair kernel: BTRFS warning (device nvme1n1p1): checksum error at logical 17587691520 on dev /dev/nvme1n1p1, physical 5516550144, root 5, inode 105772, offset 0, length 4096, links 1 (path: appdata/Plex-Media-Server/Library/Application Support/Plex Media Server/Media/localhost/1/881b9ccef7d74c23387b88ca810de0f432baf99.bundle/Contents/Chapters/chapter13.jpg) They show the path to the corrupt file, those are the ones that need to be deleted/restored from a backup. Quote Link to comment
BIGtoeknee Posted July 14 Author Share Posted July 14 3 hours ago, JorgeB said: Look for lines like this: They show the path to the corrupt file, those are the ones that need to be deleted/restored from a backup. I don't see any paths in the system logs. How did you get that? Quote Link to comment
JorgeB Posted July 14 Share Posted July 14 That's from the diags posted, see the date and time. Quote Link to comment
BIGtoeknee Posted July 19 Author Share Posted July 19 Got it. I was able to remove those files and then it seemed to be fine but then I had another NVME drive drop out on me. flair-diagnostics-20240719-1018.zip Quote Link to comment
Solution JorgeB Posted July 19 Solution Share Posted July 19 Jul 19 02:38:55 Flair kernel: nvme nvme2: Does your device have a faulty power saving mode enabled? Jul 19 02:38:55 Flair kernel: nvme nvme2: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug Add nvme_core.default_ps_max_latency_us=0 pcie_aspm=off to the syslinux boot options, then power cycle the server to bring the device back, a reboot is usually not enough, and run another scrub. Quote Link to comment
BIGtoeknee Posted July 25 Author Share Posted July 25 On 7/19/2024 at 10:30 AM, JorgeB said: Jul 19 02:38:55 Flair kernel: nvme nvme2: Does your device have a faulty power saving mode enabled? Jul 19 02:38:55 Flair kernel: nvme nvme2: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug Add nvme_core.default_ps_max_latency_us=0 pcie_aspm=off to the syslinux boot options, then power cycle the server to bring the device back, a reboot is usually not enough, and run another scrub. I've been putting these drives through the gauntlet since I executed this and they havent dropped. I believe this is the solution. Thanks! 1 Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.