January 15, 20251 yr When trying to stop array the system seems to get stuck at Sync Filesystem, the only way to move forward is to reboot the machine. I attached the diagnostics file for reference. Before writing here I did look at the forums and the indication was that maybe one disk is going bad or maybe a connection to the disk is going bad. I did find two disk with some "issues" see attached images and replaced them I also reconnected all the power and SATA connectors to the drives and to the board but no dice. tower-diagnostics-20250114-2152.zip
January 15, 20251 yr Community Expert Disable docker and VMs services, reboot in safe mode, start and stop the array, and see if it still does the same thing.
January 16, 20251 yr Author With no dockers, vm's and in safe mode the array eventually stopped after 15 minutes in the sync filesystems step. On other servers that step takes only seconds. I'm adding a new diagnostics file tower-diagnostics-20250116-1044.zip
January 16, 20251 yr Community Expert Don't see anything logged that explains that, could a slow filesystem, I would do a new config without the array, just the pool, then retest.
January 17, 20251 yr Author I did run the DiskSpeed docker and the results are attached. Disk 6 seems to be on the slow side but even that is between 119Mb/sec and 56 Mb/sec. I'm not sure I understand your suggestion about the new config without the array.
January 18, 20251 yr Community Expert 8 hours ago, soana said: I'm not sure I understand your suggestion about the new config without the array. It would confirm if just the pool unmounts quickly, meaning, the issue is with one of the array filesystems, not disks necessarily.
January 19, 20251 yr Author With just the pool I did a start and stop array everything seemed to work OK. Stopping services took some time but sync filesystems was not there or it was instantaneously since I could not see it during the stop of the array with just the pool.
January 19, 20251 yr Community Expert That suggests one or more of the array disks is the problem, but you would need to test one by one, and then run a correcting parity check at the end.
January 19, 20251 yr Author Thanks for the suggestion it will be a long task, but not impossible Couple of questions: 1. I keep getting these messages in the log, is it possible that one of my controllers or the connection from HDD->backplane->controller is bad? Just thinking if it's worth it to replace the my two SAS controllers with a new one before I start replacing one disk at a time. Jan 19 08:45:48 Tower kernel: sas: Enter sas_scsi_recover_host busy: 1 failed: 1 Jan 19 08:45:48 Tower kernel: sas: ata7: end_device-1:0: cmd error handler Jan 19 08:45:48 Tower kernel: sas: ata7: end_device-1:0: dev error handler Jan 19 08:45:48 Tower kernel: sas: ata8: end_device-1:1: dev error handler Jan 19 08:45:48 Tower kernel: sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 1 tries: 1 Jan 19 08:45:48 Tower kernel: sas: Enter sas_scsi_recover_host busy: 1 failed: 1 Jan 19 08:45:48 Tower kernel: sas: ata10: end_device-2:1: cmd error handler Jan 19 08:45:48 Tower kernel: sas: ata9: end_device-2:0: dev error handler Jan 19 08:45:48 Tower kernel: sas: ata10: end_device-2:1: dev error handler Jan 19 08:45:48 Tower kernel: sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 1 tries: 1 2. During the 12 minute of "sync filesystems" can I run any commands to see what disk are accessed, hoping that can narrow down the culprit.
January 19, 20251 yr Community Expert Those errors can be fairly innocuous with some Marvell based HBAs. For the disk, iotop cold help with that, but not sure it's easy to install at the moment.
January 19, 20251 yr Author Hmm, not sure if related but I just lost all the user shares. Looking into the forums I seem to have a known issue that can be resolved with a reboot. Jan 19 10:05:07 Tower shfs: shfs: ../lib/fuse.c:1402: unlink_node: Assertion `node->nlookup > 1' failed. Edited January 19, 20251 yr by soana Reboot brought the user shares back
January 19, 20251 yr Community Expert 4 hours ago, soana said: known issue that can be resolved with a reboot. Yep, v7.0 has some new tunables that can help with that, but only worth trying if it's a frequent error.
January 20, 20251 yr Author Some progress. I had disk10 in my array that was formatted zfs, after removing everything from it and reformatting it in xfs the time for the sync filesystems reduced from about 12minutes to about 3minutes.
January 20, 20251 yr Author yes, it did have snapshots. pool is also zfs with snapshots. Edited January 20, 20251 yr by soana aded info
January 20, 20251 yr Community Expert If there are lots of snapshots it can make mount/unmount extra slow.
February 11, 20251 yr Author After swapping out all the mechanical disks, changing form the two Dell Perc H310 to one SAS9300-16i LSI and new SAS cables, the sync file systems still last about 3 minutes. I guess moving forward in order to avoid a parity check every time I reboot the server I will have to first stop the array manually and then initiate then shutdown or reboot command. Edited February 11, 20251 yr by soana
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.