Cannot stop array

January 15, 20251 yr

When trying to stop array the system seems to get stuck at Sync Filesystem, the only way to move forward is to reboot the machine.

I attached the diagnostics file for reference.

Before writing here I did look at the forums and the indication was that maybe one disk is going bad or maybe a connection to the disk is going bad.

I did find two disk with some "issues" see attached images and replaced them I also reconnected all the power and SATA connectors to the drives and to the board but no dice.

tower-diagnostics-20250114-2152.zip

Quote

January 15, 20251 yr

Community Expert

Disable docker and VMs services, reboot in safe mode, start and stop the array, and see if it still does the same thing.

Quote

1

January 16, 20251 yr

Author

With no dockers, vm's and in safe mode the array eventually stopped after 15 minutes in the sync filesystems step. On other servers that step takes only seconds.

I'm adding a new diagnostics file

tower-diagnostics-20250116-1044.zip

Quote

January 16, 20251 yr

Community Expert

Don't see anything logged that explains that, could a slow filesystem, I would do a new config without the array, just the pool, then retest.

Quote

January 17, 20251 yr

Author

I did run the DiskSpeed docker and the results are attached.

Disk 6 seems to be on the slow side but even that is between 119Mb/sec and 56 Mb/sec.

I'm not sure I understand your suggestion about the new config without the array.

Quote

January 18, 20251 yr

Community Expert

8 hours ago, soana said:

I'm not sure I understand your suggestion about the new config without the array.

It would confirm if just the pool unmounts quickly, meaning, the issue is with one of the array filesystems, not disks necessarily.

Quote

January 19, 20251 yr

Author

With just the pool I did a start and stop array everything seemed to work OK.

Stopping services took some time but sync filesystems was not there or it was instantaneously since I could not see it during the stop of the array with just the pool.

Quote

January 19, 20251 yr

Community Expert

That suggests one or more of the array disks is the problem, but you would need to test one by one, and then run a correcting parity check at the end.

Quote

January 19, 20251 yr

Author

Thanks for the suggestion it will be a long task, but not impossible

Couple of questions:

1. I keep getting these messages in the log, is it possible that one of my controllers or the connection from HDD->backplane->controller is bad?

Just thinking if it's worth it to replace the my two SAS controllers with a new one before I start replacing one disk at a time.

Jan 19 08:45:48 Tower kernel: sas: Enter sas_scsi_recover_host busy: 1 failed: 1
Jan 19 08:45:48 Tower kernel: sas: ata7: end_device-1:0: cmd error handler
Jan 19 08:45:48 Tower kernel: sas: ata7: end_device-1:0: dev error handler
Jan 19 08:45:48 Tower kernel: sas: ata8: end_device-1:1: dev error handler
Jan 19 08:45:48 Tower kernel: sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 1 tries: 1
Jan 19 08:45:48 Tower kernel: sas: Enter sas_scsi_recover_host busy: 1 failed: 1
Jan 19 08:45:48 Tower kernel: sas: ata10: end_device-2:1: cmd error handler
Jan 19 08:45:48 Tower kernel: sas: ata9: end_device-2:0: dev error handler
Jan 19 08:45:48 Tower kernel: sas: ata10: end_device-2:1: dev error handler
Jan 19 08:45:48 Tower kernel: sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 1 tries: 1

2. During the 12 minute of "sync filesystems" can I run any commands to see what disk are accessed, hoping that can narrow down the culprit.

Quote

January 19, 20251 yr

Community Expert

Those errors can be fairly innocuous with some Marvell based HBAs.

For the disk, iotop cold help with that, but not sure it's easy to install at the moment.

Quote

1

January 19, 20251 yr

Author

Hmm, not sure if related but I just lost all the user shares.

Looking into the forums I seem to have a known issue that can be resolved with a reboot.

Jan 19 10:05:07 Tower shfs: shfs: ../lib/fuse.c:1402: unlink_node: Assertion `node->nlookup > 1' failed.

Edited January 19, 20251 yr by soana
Reboot brought the user shares back

Quote

January 19, 20251 yr

Community Expert

4 hours ago, soana said:

known issue that can be resolved with a reboot.

Yep, v7.0 has some new tunables that can help with that, but only worth trying if it's a frequent error.

Quote

January 20, 20251 yr

Author

Some progress.

I had disk10 in my array that was formatted zfs, after removing everything from it and reformatting it in xfs the time for the sync filesystems reduced from about 12minutes to about 3minutes.

Quote

January 20, 20251 yr

Community Expert

Did it have many snapshots?

Quote

January 20, 20251 yr

Author

yes, it did have snapshots.

pool is also zfs with snapshots.

Edited January 20, 20251 yr by soana
aded info

Quote

January 20, 20251 yr

Community Expert

If there are lots of snapshots it can make mount/unmount extra slow.

Quote

February 11, 20251 yr

Author

After swapping out all the mechanical disks, changing form the two Dell Perc H310 to one SAS9300-16i LSI and new SAS cables, the sync file systems still last about 3 minutes.

I guess moving forward in order to avoid a parity check every time I reboot the server I will have to first stop the array manually and then initiate then shutdown or reboot command.

Edited February 11, 20251 yr by soana

Quote

Cannot stop array

Featured Replies

Join the conversation

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)