July 16, 20241 yr So I have this issue where my Unraid machine (6.12.10) starts freezing up after 1 - x days. The GUI and array still sort of work but CPU usage is pegged at 100% and dockers and VMs are no longer responsive. I am pretty sure I have nailed the issue down to an SSD that I have running as a pool for my VMs. At first, the issue occured several times after about one day when the SSD attempted to go into suspend mode. After setting the spin down delay to 'never' I thought I had fixed the issue, as the machine was stable for over 6 days. However, this morning the issue reappeared. The machine was 'frozen' and the SSD showed a * where the temperature would normally be displayed, so apparently it still went into sleep mode or became unavailable in some other manner. I remember having this issue with another SSD a year or two ago, but back then setting the spin down delay to 'never' fixed the issue for me. This time I am not so lucky apparently. Is this known behaviour for Unraid, has anyone seen this as well? The SSD in question is a Kingspec 2TB m.2 SATA SSD, formatted with ZFS that is in a single pool. I might replace it for a better known brand if I knew that would fix the issue. I forgot to copy the log before rebooting the machine, but if it freezes up again (don't hope so) I can add it here if that would help in identifying the issue. For now I am just wondering if this is a known issue, and if there is an (easy) fix for it. Thx a lot for any thoughts! Regards, Danny
July 16, 20241 yr Community Expert 26 minutes ago, Sten3danny said: but if it freezes up again (don't hope so) I can add it here if that would help in identifying the issue. Do that.
July 16, 20241 yr Author Unfortunately I did not have to wait long, I just noticed the issue occurred again. I have attached the syslog, not sure if it holds any relevant information, the error messages don't mean that much to me. I would really appreciate any insights! :) olifant-syslog-20240716-1831.zip
July 17, 20241 yr Community Expert Jul 16 18:40:15 Olifant kernel: WARNING: Pool 'kingspec' has encountered an uncorrectable I/O failure and has been suspended. This device dropped offline and since there's no redundancy zfs suspends the pool, and this will make you unable to stop the array and do a clean reboot, start by replacing the cables for that device.
July 17, 20241 yr Author Hi JorgeB, thx for your reply. It is a m.2 drive plugged directly into the motherboard, so no cables to replace. Could it be that it is just a bad quality drive? If so, I might try replacing it. Alternatively, I was thinking to reformat the drive to XFS, because I remember reading that ZFS is not recommended for a single device pool. Or is that not the case any more? What are your thoughts?
July 17, 20241 yr Community Expert Solution 59 minutes ago, Sten3danny said: Could it be that it is just a bad quality drive? It's possible, you can try swapping/using a different m.2 slot if available. 1 hour ago, Sten3danny said: Alternatively, I was thinking to reformat the drive to XFS, because I remember reading that ZFS is not recommended for a single device pool. Or is that not the case any more? It's not not recommended, but for a single device pool, and if you don't care about checksums or snapshots, xfs is fine, but the issue was not caused by zfs, it just adds the not able to stop the array problem, that won't happen with btrfs or xfs.
July 17, 20241 yr Author Unfortunately, I have no empty m.2 slots to swap it to. I will try to reformat to xfs because I don't really NEED snapshots. If that doesn't fix the issue, I will probably replace the drive. Thanks for your help JorgeB! Oh, one last question: when changing the FS, should I use the 'Erase' function? I am not really sure what that does and couldn't find it in the manual..
July 17, 20241 yr Community Expert 11 minutes ago, Sten3danny said: should I use the 'Erase' function? You can, that's the easiest way.
July 27, 20241 yr Author So, ten days in after changing the filesystem to XFS and the issue hasn't reoccured (yet). Not sure if I've just been lucky so far, or if this has actually 'fixed' the issue, or if there are any other variables at play here, but I am happy for now
May 11, 20251 yr Author Hi all, I am reopening this thread because again I am having issues with an SSD dropping offline in Unraid. Luckily this time it concerns an SSD in a ZFS pool with redundancy so I have not experienced any crashes or data loss. I have already ordered a replacement drive, hopefully I will not have any issues after replacing it. However, given that 1) this concerns a brand new SSD from a reputable brand (WD - just a few weeks old) and 2) the drive I previously had issues with has since (over half a year) been sitting happily without issues in my Windows desktop (98% CrystalDiskInfo score), I am wondering if Unraid/Linux/ZFS is just particularly sensitive to SSD imperfections, or if maybe something else is going wrong with my system that I am unaware of. Can anyone else confirm having similar experiences, or maybe provide me with some useful insights? See also the history of this thread for more background information. PS - I swapped the drive (power and SATA cable) with an identical drive to confirm that the issue is actually with the drive itself, and not with the cabling. I am attaching the syslog and also a short snippet of the log file from when the failure occured. I would appreciate any and all insights! syslog.txt failure_log.txt
May 12, 20251 yr Community Expert May 9 03:17:12 Olifant kernel: ata10.00: revalidation failed (errno=-19) May 9 03:17:12 Olifant kernel: ata10.00: disable device May 9 03:17:12 Olifant kernel: sd 11:0:0:0: rejecting I/O to offline device Device dropped offline, this is typically a power/connection issue.
May 12, 20251 yr Author Hi JorgeB, thank you for your response. As I mentioned in my post, I swapped the drive's position with another drive in my system (so power and sata cable), and still this one same drive was dropping offline (and the other was not), which lead me to believe the issue is with the drive itself somehow. Maybe it is a power/connection issue like you say, but then still why is it affecting only this one particular ssd, irrespective of where I put it in my system? Is there any other information to be extracted from the diagnostics file? I will also have another close look myself, but I am no expert, so any other suggestions will be appreciated! Thx again! Edited May 12, 20251 yr by Sten3danny
May 12, 20251 yr Community Expert If you already replaced cables, try replacing the device, not all types of issues can be seen on SMART.
May 12, 20251 yr Author I replaced the ssd with a new one this morning and resilvered the pool. I will do a full SMART check of the 'old' device and place it in another system for testing. Will post what happens.. In the meantime, is there anyone else out there having similar issues?
May 12, 20251 yr Author By the way, how can I change the title of this thread? I cannot seem to find it..
May 17, 20251 yr Author Okay, an update: So I replaced the drive that was dropping offline with a shiny new one a few days ago. All was well for a few days until today the new drive dropped offline. (Note that previously I had already swapped position (sata and power cable) of the old 'faulty' drive with another drive and still that one drive was dropping offline. Now the new drive dropped offline at that same position. So this is still a bit mysterious to me, but never mind that for now) For now it seems pretty safe to conclude that the issue was/is not with the drive itself. I didn't really know what to do next, until I realized that today for the first time since replacing the drive, I had spun up one of my VM's that has a graphics card (and corresponding sound card) passed through. Could this be somehow related? The VM in itself was working fine. For now I will not spin up this particular VM, to confirm if the drive will still drop offline without the VM running. I have attached new diagnostics from just after the drive dropping offline, really hope one of you experts can retrieve some useful information from it. Many thanks in advance! PS: two facts that might be of interest: 1) Unraid was previously throwing up a VFIO bind error that I did not understand. I have now checked the check boxes next to the graphics card and sound card that are being passed through to the VM and restarted Unraid. They are together in one IOMMU group. 2) This morning I updated to Unraid 7.1.2. Might not have been the smartest move given I am in a troubleshooting process, but it's done. olifant-diagnostics-20250517-1442.zip
May 18, 20251 yr Community Expert 18 hours ago, Sten3danny said: this particular VM Which VM is it, there are several.
May 18, 20251 yr Community Expert The VM config looks good to me, only the GPU is being pass-through, so it should be unrelated, maybe you are have PCIe ACS override enabled, that can sometimes cause issues.
May 18, 20251 yr Author Hi JorgeB, thanks for your response. I do not have PCIe ACS override enabled, but as I mentioned I did see a VFIO bind error popping up every time on startup. I have now checked the checkboxes next to the graphics card and sound card that I am passing through and the error is no longer popping up. But tbh, I don't really understand how this works. I always thought that if the devices that are to be passed through are together in one IOMMU group, it was not needed to bind them, but apparantly it is (?) In any case, probably its unrelated as you say, but I will not spin up the VM for now to see what happens..
June 30, 20251 yr Author After a few more weeks of fiddling I think I can draw some kind of conclusion. I am now pretty sure that my issues were somehow related to that VFIO bind error. Over the past few weeks I have made hardware changes a few times (such as adding an nvme drive) and every time that VFIO bind error would pop up and one of my SATA SSD's would drop offline. After rebooting it would reappear, but as an unassigned drive, meaning the ZFS pool to which it belonged would be in error state. Then, after fixing the VFIO bind error, removing the ZFS pool, recreating the ZFS pool and resilvering, all would be okay again. As long as I don't make any hardware changes and the VFIO bind error does not exist, everyting seems hunky-dory.Not sure if this is an Unraid issue, or something particular to my system, but just thought I would share my findings here.
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.