-
[7.1.4] libvirt prevented unmounting the SSD cache pool
No, it occasionally does this. I will post here if I encounter the same issue on 7.2.0.
-
[7.1.4] libvirt prevented unmounting the SSD cache pool
Hi, I was attempting the upgrade from 7.1.4 to 7.2.0 and encountered this issue while trying to stop the array, after the updater finished expanding the new image onto the flash drive. I decided to stop the array before restarting the server because I've had instances where the array didn't cleanly unmount before, and the built-in timeout caused the reboot attempt to be marked as an unclean shutdown. Sure enough, the array would not cleanly unmount today, with the "Retry unmounting disk share(s)" message printed in the lower-left corner. I checked via SSH and while the main array had already unmounted, Unraid was stuck unmounting the cache pool: Oct 31 23:58:47 dipper emhttpd: shcmd (40096042): umount /mnt/cache Oct 31 23:58:47 dipper root: umount: /mnt/cache: target is busy. Oct 31 23:58:47 dipper emhttpd: shcmd (40096042): exit status: 32 Oct 31 23:58:47 dipper emhttpd: Retry unmounting disk share(s)... Oct 31 23:58:52 dipper emhttpd: Unmounting disks... Oct 31 23:58:52 dipper emhttpd: shcmd (40096043): umount /mnt/cache Oct 31 23:58:52 dipper root: umount: /mnt/cache: target is busy. Oct 31 23:58:52 dipper emhttpd: shcmd (40096043): exit status: 32 Oct 31 23:58:52 dipper emhttpd: Retry unmounting disk share(s)... I checked what was using the cache pool, and found some processes using /etc/libvirt and not allowing the cache pool to unmount. After sending SIGTERM to all of those processes, Unraid proceeded with the rest of the array stop procedure and I was able to stop and restart the server. root@dipper:/mnt# mount | grep cache /dev/mapper/nvme0n1p1 on /mnt/cache type btrfs (rw,noatime,ssd,discard=async,space_cache=v2,subvolid=5,subvol=/) /mnt/cache/system/libvirt/libvirt.img on /etc/libvirt type btrfs (rw,noatime,ssd,discard=async,space_cache=v2,subvolid=5,subvol=/) root@dipper:/mnt# umount /etc/libvirt umount: /etc/libvirt: target is busy. root@dipper:/mnt# lsof /etc/libvirt COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME notify_po 3754678 root 31r REG 0,53 547 260 /etc/libvirt/libvirt.conf session_c 3754681 root 31r REG 0,53 547 260 /etc/libvirt/libvirt.conf system_te 3754683 root 31r REG 0,53 547 260 /etc/libvirt/libvirt.conf device_li 3754688 root 31r REG 0,53 547 260 /etc/libvirt/libvirt.conf disk_load 3754696 root 31r REG 0,53 547 260 /etc/libvirt/libvirt.conf parity_li 3754700 root 31r REG 0,53 547 260 /etc/libvirt/libvirt.conf sleep 3879329 root 31r REG 0,53 547 260 /etc/libvirt/libvirt.conf root@dipper:/mnt# kill 3754678 root@dipper:/mnt# kill 3754681 root@dipper:/mnt# kill 3754683 root@dipper:/mnt# kill 3754688 root@dipper:/mnt# kill 3754696 root@dipper:/mnt# kill 3754700 root@dipper:/mnt# kill 3879329 -bash: kill: (3879329) - No such process root@dipper:/mnt# lsof /etc/libvirt root@dipper:/mnt# umount /etc/libvirt root@dipper:/mnt# I was wondering if this was a known defect? Was there a reason why all of those processes were running in /etc/libvirt? I don't remember running any active VMs at the moment on my server. Attached diagnostics. dipper-diagnostics-20251031-1100-anonymized.zip
-
-
I/O errors on brand new hard drives, cables, and SATA controller card
Thank you @JorgeB! Your diagnosis was right. It turned out that one or both of the Molex plugs had developed an unreliable connection, possibly due to me going slightly above the cable rating when the drives all spun up at once. The server is now running along with all drives restored by temporarily running with the other connectors on the same Molex cable. If anyone is interested I wrote a blog post about it here: https://ericswpark.com/blog/2025/2025-03-04-i-nearly-lost-my-entire-server/ Sorry for the late reply; I just about got everything wrapped up this morning.
-
I/O errors on brand new hard drives, cables, and SATA controller card
This is sort of a continuation of this post: TL;DR - two hard drives dropped online, replaced them, same errors occurred on the new drive, rechecked connections and discovered what appeared to be a loose Molex connection. Thought the errors were now gone and jinxed myself, because now I'm posting again... Unfortunately since that post, the I/O errors came back. To eliminate more variables, I next replaced the LSI RAID card with a AS1064 4-port SATA card, and got brand new SATA cables to rule out any cabling problems. I replaced the card and SATA cables today and booted up the server and decided to run preclears on the new drives before doing anything with the array. I watched as both of the drives failed the pre-read cycle of the preclear minutes after starting the operation. Now I'm completely lost as to what to do. The only hardware that I haven't swapped out yet is the PSU and the backplane itself built into the NAS chassis. But I don't think it's the backplane because I've tried swapping around the drives and the problems follow, and if the backplane was faulty I think whatever drive was connected to that "faulty" slot would cause issues. For the PSU it's a Lian Li SFX-sized SP750 unit. It's been running fine since June 2022, and according to PSU calculators is completely overkill for this build since the recommended wattage is only at like 380W. I was wondering if anyone encountered anything like this where the software I/O errors persist despite nearly replacing all the components of the NAS. Diagnostics attached. dipper-diagnostics-20250205-0214.zip
-
Two drives suddenly dropped offline with zero warning, connection issue?
I decided to take another shot at remote troubleshooting today with a different family member and we found out that one of the Molex power connections had come loose on the backplane, where it was very slightly loose that it made imperfect contact and _appeared_ to power the drives. For context, the backplane has two Molex plugs that splits off into the five disks. I'm not sure how the Jonsbo N1 has the two connectors routed on the backplane PCB, but I believe that it was trying to draw all the current from one Molex connector, and because of the voltage sag it was manifesting as disks dropping off of the array. I am not sure why or how the failures followed the two drives, but perhaps they are the most susceptible to the voltage sag? After firmly seating the Molex connectors and rechecking all connections the server was turned back on and the two new drives immediately showed up. Currently running a parity-swap procedure and it is going well now for half an hour. Previously even a simple parity check would fail after 30 seconds. Knock on wood and all that but I think this problem has been actually fixed this time. If we checked the connections properly it would've saved all the headache (not to mention the cost of the new drives... at least now I have two cold spares?) But troubleshooting remotely with family has limitations and I can't exactly tell if connections are firmly pressed in over a video call and have to take the other person's word for it. So unfortunate but glad it was resolved in the end. Thank you @JorgeB for the troubleshooting help! I've marked your initial reply as the solution because it was indeed a connection problem.
-
Two drives suddenly dropped offline with zero warning, connection issue?
Just ordered a new H1110 LSI 9211-4i card with a new SAS-to-SATA breakout cable. I'm really hoping this does the trick because I'd hate for the problem to be the backplane of the case.
-
Two drives suddenly dropped offline with zero warning, connection issue?
So I finally got the drives in today and connected them, but I'm _still_ somehow getting the same disk I/O errors on the brand new drives. I'm wondering now if the LSI card that I'm using has died. I'm using a LSI 9240-8i SAS card flashed into IT mode that I bought off of eBay in 2022. Should I try getting a replacement card?
-
Two drives suddenly dropped offline with zero warning, connection issue?
Thanks, but are two rebuilds necessary (or recommended) as outlined in the alternate approach I laid out? Or will it cause unnecessary strain and therefore fine to just rebuild in one go?
-
Two drives suddenly dropped offline with zero warning, connection issue?
Quick question regarding the rebuild process. When I get the new drives I suppose I will have to do a parity swap procedure as all the drives that I had were 8 TB, and I'm replacing a parity drive and a data drive that were both 8 TB. For parity it's a larger drive so that's fine, but I assume I'd have to move parity 2 to data 3, assign the new 16 TB drive to parity 2 and 1, and start the array. Then UnRAID will offer to copy the parity data over, zero the remaining bits, then reconstruct data 3. Is my understanding correct? Alternatively, I could assign parity 1, leave data 3 blank, rebuild, then do the parity swap procedure (parity 2 to data 3 slot, new drive in parity 2, copy parity data, rebuild data 3). This does mean I will rebuild twice, but it seems less riskier (?) since I can have another drive fail during the second rebuild if I'm understanding it correctly. Which approach should I go for? Both replacement drives are 16 TB.
-
Two drives suddenly dropped offline with zero warning, connection issue?
I had a chance today to go over swapping drive slots around and believe that I've isolated the issue down to the drives. The errors followed data drive #3 and parity drive #1 shortly thereafter. I've ordered up new replacements for both. Thank you @JorgeB for the assistance. Very strange that two drives just went like that, but I'm grateful that I went with dual parity now.
-
Two drives suddenly dropped offline with zero warning, connection issue?
That's good to know, thank you. So if I swap the drives around in the bay and get similar errors on the other two drives then unRAID will just stop in a safe manner? I'm planning to use maintenance mode so that no further read/writes of data occur while I troubleshoot. The main concern is having four out of five drives go out of sync, since it would be quite the headache to verify that the data is all there and correct. I do have a backup of the entire server if the array completely fails, but that would take even longer since it has to restore over Internet.
-
Two drives suddenly dropped offline with zero warning, connection issue?
I've just ordered up a new cable that should arrive in a couple of days. I'll post an update once the cable is swapped and the array starts to rebuild.
-
Two drives suddenly dropped offline with zero warning, connection issue?
@JorgeB the issue is the array is already in a degraded state, and the errors don't seem to show up until UnRAID tries to read/write to them. I'd much rather not risk more drives getting disabled.
-
Two drives suddenly dropped offline with zero warning, connection issue?
@JorgeB I just made sure that the cables were firmly connected while reseating them. I'm using the Jonsbo N1 case that has five drive bays and a backplane that has two Molex power ports and 5 SATA connections. 1 SATA cable goes direct to MB, while the other 4 are connected via 1 SAS-to-SATA right-angle cable that is connected to an LSI RAID card in HBA mode. Unfortunately I do not know which cable goes to which HDD, but I think it's safe to guess that both failing HDDs are to the HBA card. The SAS-to-SATA cable was replaced 6 months ago with a right-angle one because I found that the straight one fails after a while because of the tight bend that the case necessitates. Even though it no longer has a bend, I'm now wondering if either the cabling or the HBA card itself is dying. Cable itself is from AliExpress since that's the only marketplace where I could find a right-angle cable. Right now I'm planning to order another SAS cable, and test the drives on another machine with something like a SMART long to make sure the drives themselves aren't faulty. Please let me know if this is a good idea (and any recommendations on better sources for good quality SAS cables...)
-
Two drives suddenly dropped offline with zero warning, connection issue?
Parity just dropped offline again as well. Attaching new diagnostics: dipper-diagnostics-20250113-0630.zip
ericswpark
Members
-
Joined
-
Last visited