Jump to content

NVMEdrives disconnecting due to heat?


Go to solution Solved by JorgeB,

Recommended Posts

Hello, I am having all kinds of issues with my NVME cache array. Now I know I am probably overkill here but its what I wanted...

 

I have 2 990 Pros 4tb and 2 1 TB NVME drives (I know they aren't the same size and speed). I have a drive 990 Pro that will disconnect from heavy usage. I get some heat warnings but these 990 Pros have heatsinks and I have tons of fans in a Fractal Meshify XL case. Everything else temp wise is great. I have update firmware on the drives and even replaced some drives with the same make and model. Not sure if something Unraid is doing here or if my motherboard is causing the issues. I included the diag file. Thanks for any insight in advance.

flair-diagnostics-20240712-1114.zip

Link to comment
44 minutes ago, JorgeB said:

Run a correcting scrub for the pool and post the results.


UUID:             5fd43c19-2e69-4009-9ee4-b3a556723527
Scrub started:    Fri Jul 12 13:14:29 2024
Status:           finished
Duration:         0:16:19
Total to scrub:   1.99TiB
Rate:             2.08GiB/s
Error summary:    read=4603 csum=1408
  Corrected:      0
  Uncorrectable:  6011
  Unverified:     0

This is a second run. The first run had some corrected actions but aborted for some reason.

Link to comment
Posted (edited)
1 hour ago, JorgeB said:

Look at the syslog for the corrupt file list, they should be deleted or restore from a backup, then run another scrub to confirm 0 errors.

Will do, I looked in the syslog and see many errors like:

Jul 12 14:56:15 Flair kernel: BTRFS error (device nvme1n1p1): bdev /dev/nvme1n1p1 errs: wr 1691137, rd 43454, flush 31324, corrupt 354119, gen 5888

How would I go about locating the corrupted file to remove (I am super new to unraid)?

Edited by BIGtoeknee
Link to comment

Look for lines like this:

 

Quote

Jul 13 09:50:56 Flair kernel: BTRFS warning (device nvme1n1p1): checksum error at logical 17587691520 on dev /dev/nvme1n1p1, physical 5516550144, root 5, inode 105772, offset 0, length 4096, links 1 (path: appdata/Plex-Media-Server/Library/Application Support/Plex Media Server/Media/localhost/1/881b9ccef7d74c23387b88ca810de0f432baf99.bundle/Contents/Chapters/chapter13.jpg)

 

They show the path to the corrupt file, those are the ones that need to be deleted/restored from a backup.

Link to comment
3 hours ago, JorgeB said:

Look for lines like this:

 

 

They show the path to the corrupt file, those are the ones that need to be deleted/restored from a backup.

I don't see any paths in the system logs. How did you get that?

Link to comment
  • Solution
Jul 19 02:38:55 Flair kernel: nvme nvme2: Does your device have a faulty power saving mode enabled?
Jul 19 02:38:55 Flair kernel: nvme nvme2: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug

 

Add 

nvme_core.default_ps_max_latency_us=0 pcie_aspm=off

to the syslinux boot options, then power cycle the server to bring the device back, a reboot is usually not enough, and run another scrub.

Link to comment
On 7/19/2024 at 10:30 AM, JorgeB said:
Jul 19 02:38:55 Flair kernel: nvme nvme2: Does your device have a faulty power saving mode enabled?
Jul 19 02:38:55 Flair kernel: nvme nvme2: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug

 

Add 

nvme_core.default_ps_max_latency_us=0 pcie_aspm=off

to the syslinux boot options, then power cycle the server to bring the device back, a reboot is usually not enough, and run another scrub.

I've been putting these drives through the gauntlet since I executed this and they havent dropped. I believe this is the solution. Thanks!

  • Like 1
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...