NVMEdrives disconnecting due to heat?

BIGtoeknee · July 12

Hello, I am having all kinds of issues with my NVME cache array. Now I know I am probably overkill here but its what I wanted...

I have 2 990 Pros 4tb and 2 1 TB NVME drives (I know they aren't the same size and speed). I have a drive 990 Pro that will disconnect from heavy usage. I get some heat warnings but these 990 Pros have heatsinks and I have tons of fans in a Fractal Meshify XL case. Everything else temp wise is great. I have update firmware on the drives and even replaced some drives with the same make and model. Not sure if something Unraid is doing here or if my motherboard is causing the issues. I included the diag file. Thanks for any insight in advance.

flair-diagnostics-20240712-1114.zip

JorgeB · July 12

Logs are being spammed with btrfs checksum errors, reboot and post new diags after array start

BIGtoeknee · July 12

10 minutes ago, JorgeB said:

Logs are being spammed with btrfs checksum errors, reboot and post new diags after array start

New diagnostic after fresh reboot.

flair-diagnostics-20240712-1210.zip

JorgeB · July 12

Run a correcting scrub for the pool and post the results.

BIGtoeknee · July 12

44 minutes ago, JorgeB said:

Run a correcting scrub for the pool and post the results.


UUID:             5fd43c19-2e69-4009-9ee4-b3a556723527
Scrub started:    Fri Jul 12 13:14:29 2024
Status:           finished
Duration:         0:16:19
Total to scrub:   1.99TiB
Rate:             2.08GiB/s
Error summary:    read=4603 csum=1408
  Corrected:      0
  Uncorrectable:  6011
  Unverified:     0

This is a second run. The first run had some corrected actions but aborted for some reason.

JorgeB · July 12

Look at the syslog for the corrupt file list, they should be deleted or restore from a backup, then run another scrub to confirm 0 errors.

BIGtoeknee · July 12

1 hour ago, JorgeB said:

Look at the syslog for the corrupt file list, they should be deleted or restore from a backup, then run another scrub to confirm 0 errors.

Will do, I looked in the syslog and see many errors like:

Jul 12 14:56:15 Flair kernel: BTRFS error (device nvme1n1p1): bdev /dev/nvme1n1p1 errs: wr 1691137, rd 43454, flush 31324, corrupt 354119, gen 5888

How would I go about locating the corrupted file to remove (I am super new to unraid)?

Edited July 12 by BIGtoeknee

JorgeB · July 13

Post the current diagnostics.

BIGtoeknee · July 13

5 hours ago, JorgeB said:

Post the current diagnostics.

New Diag

flair-diagnostics-20240713-0955.zip

JorgeB · July 14

Look for lines like this:

Quote

Jul 13 09:50:56 Flair kernel: BTRFS warning (device nvme1n1p1): checksum error at logical 17587691520 on dev /dev/nvme1n1p1, physical 5516550144, root 5, inode 105772, offset 0, length 4096, links 1 (path: appdata/Plex-Media-Server/Library/Application Support/Plex Media Server/Media/localhost/1/881b9ccef7d74c23387b88ca810de0f432baf99.bundle/Contents/Chapters/chapter13.jpg)

They show the path to the corrupt file, those are the ones that need to be deleted/restored from a backup.

BIGtoeknee · July 14

3 hours ago, JorgeB said:

Look for lines like this:

They show the path to the corrupt file, those are the ones that need to be deleted/restored from a backup.

I don't see any paths in the system logs. How did you get that?

JorgeB · July 14

That's from the diags posted, see the date and time.

BIGtoeknee · July 19

Got it. I was able to remove those files and then it seemed to be fine but then I had another NVME drive drop out on me.

flair-diagnostics-20240719-1018.zip

JorgeB · July 19

Jul 19 02:38:55 Flair kernel: nvme nvme2: Does your device have a faulty power saving mode enabled?
Jul 19 02:38:55 Flair kernel: nvme nvme2: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug

Add

nvme_core.default_ps_max_latency_us=0 pcie_aspm=off

to the syslinux boot options, then power cycle the server to bring the device back, a reboot is usually not enough, and run another scrub.

BIGtoeknee · July 25

On 7/19/2024 at 10:30 AM, JorgeB said:
Jul 19 02:38:55 Flair kernel: nvme nvme2: Does your device have a faulty power saving mode enabled?
Jul 19 02:38:55 Flair kernel: nvme nvme2: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug
Add
nvme_core.default_ps_max_latency_us=0 pcie_aspm=off
to the syslinux boot options, then power cycle the server to bring the device back, a reboot is usually not enough, and run another scrub.

I've been putting these drives through the gauntlet since I executed this and they havent dropped. I believe this is the solution. Thanks!

NVMEdrives disconnecting due to heat?

Recommended Posts

BIGtoeknee

Link to comment

JorgeB

Link to comment

BIGtoeknee

Link to comment

JorgeB

Link to comment

BIGtoeknee

Link to comment

JorgeB

Link to comment

BIGtoeknee

Link to comment

JorgeB

Link to comment

BIGtoeknee

Link to comment

JorgeB

Link to comment

BIGtoeknee

Link to comment

JorgeB

Link to comment

BIGtoeknee

Link to comment

JorgeB

Link to comment

BIGtoeknee

Link to comment

Join the conversation