SSDs dropping offline

July 16, 20241 yr

So I have this issue where my Unraid machine (6.12.10) starts freezing up after 1 - x days. The GUI and array still sort of work but CPU usage is pegged at 100% and dockers and VMs are no longer responsive.
I am pretty sure I have nailed the issue down to an SSD that I have running as a pool for my VMs. At first, the issue occured several times after about one day when the SSD attempted to go into suspend mode. After setting the spin down delay to 'never' I thought I had fixed the issue, as the machine was stable for over 6 days. However, this morning the issue reappeared. The machine was 'frozen' and the SSD showed a * where the temperature would normally be displayed, so apparently it still went into sleep mode or became unavailable in some other manner.

I remember having this issue with another SSD a year or two ago, but back then setting the spin down delay to 'never' fixed the issue for me. This time I am not so lucky apparently. Is this known behaviour for Unraid, has anyone seen this as well?

The SSD in question is a Kingspec 2TB m.2 SATA SSD, formatted with ZFS that is in a single pool. I might replace it for a better known brand if I knew that would fix the issue.

I forgot to copy the log before rebooting the machine, but if it freezes up again (don't hope so) I can add it here if that would help in identifying the issue.
For now I am just wondering if this is a known issue, and if there is an (easy) fix for it.

Thx a lot for any thoughts!

Regards, Danny

Quote

July 16, 20241 yr

Community Expert

26 minutes ago, Sten3danny said:

but if it freezes up again (don't hope so) I can add it here if that would help in identifying the issue.

Do that.

Quote

July 16, 20241 yr

Author

Unfortunately I did not have to wait long, I just noticed the issue occurred again.

I have attached the syslog, not sure if it holds any relevant information, the error messages don't mean that much to me.

I would really appreciate any insights! :)

olifant-syslog-20240716-1831.zip

Quote

July 17, 20241 yr

Community Expert

Jul 16 18:40:15 Olifant kernel: WARNING: Pool 'kingspec' has encountered an uncorrectable I/O failure and has been suspended.

This device dropped offline and since there's no redundancy zfs suspends the pool, and this will make you unable to stop the array and do a clean reboot, start by replacing the cables for that device.

Quote

July 17, 20241 yr

Author

Hi JorgeB, thx for your reply. It is a m.2 drive plugged directly into the motherboard, so no cables to replace. Could it be that it is just a bad quality drive? If so, I might try replacing it.

Alternatively, I was thinking to reformat the drive to XFS, because I remember reading that ZFS is not recommended for a single device pool. Or is that not the case any more?
What are your thoughts?

Quote

July 17, 20241 yr

Community Expert
Solution

59 minutes ago, Sten3danny said:

Could it be that it is just a bad quality drive?

It's possible, you can try swapping/using a different m.2 slot if available.

1 hour ago, Sten3danny said:

Alternatively, I was thinking to reformat the drive to XFS, because I remember reading that ZFS is not recommended for a single device pool. Or is that not the case any more?

It's not not recommended, but for a single device pool, and if you don't care about checksums or snapshots, xfs is fine, but the issue was not caused by zfs, it just adds the not able to stop the array problem, that won't happen with btrfs or xfs.

Quote

July 17, 20241 yr

Author

Unfortunately, I have no empty m.2 slots to swap it to.

I will try to reformat to xfs because I don't really NEED snapshots. If that doesn't fix the issue, I will probably replace the drive.

Thanks for your help JorgeB!

Oh, one last question: when changing the FS, should I use the 'Erase' function? I am not really sure what that does and couldn't find it in the manual..

Quote

July 17, 20241 yr

Community Expert

11 minutes ago, Sten3danny said:

should I use the 'Erase' function?

You can, that's the easiest way.

Quote

July 17, 20241 yr

Author

Thx, I will go ahead with that then..

Quote

July 27, 20241 yr

Author

So, ten days in after changing the filesystem to XFS and the issue hasn't reoccured (yet).
Not sure if I've just been lucky so far, or if this has actually 'fixed' the issue, or if there are any other variables at play here, but I am happy for now

Quote

1

May 11, 20251 yr

Author

Hi all, I am reopening this thread because again I am having issues with an SSD dropping offline in Unraid. Luckily this time it concerns an SSD in a ZFS pool with redundancy so I have not experienced any crashes or data loss. I have already ordered a replacement drive, hopefully I will not have any issues after replacing it.

However, given that 1) this concerns a brand new SSD from a reputable brand (WD - just a few weeks old) and 2) the drive I previously had issues with has since (over half a year) been sitting happily without issues in my Windows desktop (98% CrystalDiskInfo score), I am wondering if Unraid/Linux/ZFS is just particularly sensitive to SSD imperfections, or if maybe something else is going wrong with my system that I am unaware of.

Can anyone else confirm having similar experiences, or maybe provide me with some useful insights? See also the history of this thread for more background information.

PS - I swapped the drive (power and SATA cable) with an identical drive to confirm that the issue is actually with the drive itself, and not with the cabling.
I am attaching the syslog and also a short snippet of the log file from when the failure occured.
I would appreciate any and all insights!

syslog.txt failure_log.txt

Quote

May 11, 20251 yr

Community Expert

Please post the complete diagnostics.

Quote

May 12, 20251 yr

Author

Done!

olifant-diagnostics-20250510-0005.zip

Quote

May 12, 20251 yr

Community Expert

May  9 03:17:12 Olifant kernel: ata10.00: revalidation failed (errno=-19)
May  9 03:17:12 Olifant kernel: ata10.00: disable device
May  9 03:17:12 Olifant kernel: sd 11:0:0:0: rejecting I/O to offline device

Device dropped offline, this is typically a power/connection issue.

Quote

May 12, 20251 yr

Author

Hi JorgeB, thank you for your response.
As I mentioned in my post, I swapped the drive's position with another drive in my system (so power and sata cable), and still this one same drive was dropping offline (and the other was not), which lead me to believe the issue is with the drive itself somehow.

Maybe it is a power/connection issue like you say, but then still why is it affecting only this one particular ssd, irrespective of where I put it in my system?
Is there any other information to be extracted from the diagnostics file? I will also have another close look myself, but I am no expert, so any other suggestions will be appreciated!
Thx again!

Edited May 12, 20251 yr by Sten3danny

Quote

May 12, 20251 yr

Community Expert

If you already replaced cables, try replacing the device, not all types of issues can be seen on SMART.

Quote

May 12, 20251 yr

Author

I replaced the ssd with a new one this morning and resilvered the pool.

I will do a full SMART check of the 'old' device and place it in another system for testing.
Will post what happens..

In the meantime, is there anyone else out there having similar issues?

Quote

May 12, 20251 yr

Author

By the way, how can I change the title of this thread? I cannot seem to find it..

Quote

May 12, 20251 yr

Community Expert

Long click on the title.

Quote

1

May 17, 20251 yr

Author

Okay, an update:

So I replaced the drive that was dropping offline with a shiny new one a few days ago. All was well for a few days until today the new drive dropped offline.
(Note that previously I had already swapped position (sata and power cable) of the old 'faulty' drive with another drive and still that one drive was dropping offline. Now the new drive dropped offline at that same position. So this is still a bit mysterious to me, but never mind that for now)

For now it seems pretty safe to conclude that the issue was/is not with the drive itself. I didn't really know what to do next, until I realized that today for the first time since replacing the drive, I had spun up one of my VM's that has a graphics card (and corresponding sound card) passed through.
Could this be somehow related? The VM in itself was working fine. For now I will not spin up this particular VM, to confirm if the drive will still drop offline without the VM running.
I have attached new diagnostics from just after the drive dropping offline, really hope one of you experts can retrieve some useful information from it.
Many thanks in advance!

PS: two facts that might be of interest:
1) Unraid was previously throwing up a VFIO bind error that I did not understand. I have now checked the check boxes next to the graphics card and sound card that are being passed through to the VM and restarted Unraid. They are together in one IOMMU group.
2) This morning I updated to Unraid 7.1.2. Might not have been the smartest move given I am in a troubleshooting process, but it's done.

olifant-diagnostics-20250517-1442.zip

Quote

May 18, 20251 yr

Community Expert

18 hours ago, Sten3danny said:

this particular VM

Which VM is it, there are several.

Quote

May 18, 20251 yr

Author

Hi JorgeB, it's Jabberwocky.

Quote

May 18, 20251 yr

Community Expert

The VM config looks good to me, only the GPU is being pass-through, so it should be unrelated, maybe you are have PCIe ACS override enabled, that can sometimes cause issues.

Quote

May 18, 20251 yr

Author

Hi JorgeB, thanks for your response.
I do not have PCIe ACS override enabled, but as I mentioned I did see a VFIO bind error popping up every time on startup. I have now checked the checkboxes next to the graphics card and sound card that I am passing through and the error is no longer popping up. But tbh, I don't really understand how this works. I always thought that if the devices that are to be passed through are together in one IOMMU group, it was not needed to bind them, but apparantly it is (?)
In any case, probably its unrelated as you say, but I will not spin up the VM for now to see what happens..

Quote

1

June 30, 20251 yr

Author

After a few more weeks of fiddling I think I can draw some kind of conclusion. I am now pretty sure that my issues were somehow related to that VFIO bind error. Over the past few weeks I have made hardware changes a few times (such as adding an nvme drive) and every time that VFIO bind error would pop up and one of my SATA SSD's would drop offline. After rebooting it would reappear, but as an unassigned drive, meaning the ZFS pool to which it belonged would be in error state. Then, after fixing the VFIO bind error, removing the ZFS pool, recreating the ZFS pool and resilvering, all would be okay again. As long as I don't make any hardware changes and the VFIO bind error does not exist, everyting seems hunky-dory.

Not sure if this is an Unraid issue, or something particular to my system, but just thought I would share my findings here.

Quote

SSDs dropping offline

Featured Replies

Solved by JorgeB

Join the conversation

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)