6.12.6 - Repeating failure of same disk number - No issues on disks

nwootton · February 15

Going a little nuts here, getting multiple repeating failure scenarios with same disk number, but no actual errors when disks are checked.

Background.

Recently upgraded to latest version (6.12.6 at time of writing). As part of this I also installed the Fix Common Problems plugin.

It flagged 2 issues.

I still had 4 reisferfs formatted disks in my box (cache & 3 array drives)
Some of my dockers were pointing at a non-array disk (unassigned device).

Nothing significant, so upgrade is done. I do need to get rid of the reiserfs before the capability gets removed from the kernel.

After some research I followed this method to move data from the reiserfs disks. Using a brand new 4Tb disk, I followed the procedure (rsync in tmux session with web ui to see read & write numbers changing) and got the first disk migrated without issue. I then left the array to ‘settle’ for a couple of days to make sure everything was ok.

Went back to do the same with the next disk. At some point the web ui stopped responding and although I could switch between tabs, nothing updated. In the tmux session the rsync was proceeding and after it completed, the web ui still failed to act on commands. Eg I’d request disk spin down and it would indicate it was doing it but then the disk wouldn’t change. Stop array, reboot and shutdown all produced the correct dialogues, but the event wouldn’t happen.

I left the web ui open but unused for a while and it remained unhelpful. Opening it in other browsers, forcing cache clear all failed to allow control. Eventually as a last resort I did a shutdown via ssh.

I checked all the connections and then rebooted. Server came up and informed me that I no longer had a license key. After multiple reboots the system now agrees that I do still have a valid license and began to work as expected.

This leaves me with 2 remaining reiserfs disks I want to migrate.

Issue

I left the server running for several days and it appeared fine. Then I get a failure in disk4.

Array is emulated warning, so I check SMART status and there are no errors. Put the array into maintenance mode and run check disk on the xfs drive. No issues. Run fix anyway. In fact nothing I do indicates an issue with the disk. Swap the disk out for another, parity rebuild takes place (12 hrs) and the new disk is running. Array appears ok. Turn on Docker containers.

Next morning, disk4 is in an error state, array is emulated. Run the same SMART and xfs routines and no issues found. Swap the disk out for a third, parity rebuild takes place and array is happy again. Turn on minimum Dockers to keep family happy.

Next day, disk4 is again in an error state. No errors in SMART. Check xfs disk, no errors. Run xfs fix anyway again just in case. Nothing done. Replace disk with original disk. Parity rebuild takes place, arrays says it’s happy. Leave all Dockers off.

I looked at the ‘failed’ disks on another laptop and still find no errors on them. I’ve run parity read-check to make sure everything agrees.

This morning. I get another message that Disk4 is in error state. Logs show read & write errors on disk4 about the time the error message got sent about the array state:

```

....

Feb 14 21:27:16 Tower kernel: md: disk4 read error, sector=1381277744
Feb 14 21:27:16 Tower kernel: md: disk4 read error, sector=1381277752
Feb 14 21:27:16 Tower kernel: md: disk4 read error, sector=1381277760
Feb 14 21:27:16 Tower kernel: md: disk4 read error, sector=1381277768
Feb 14 21:27:16 Tower kernel: md: disk4 read error, sector=1381277776
Feb 14 21:27:16 Tower kernel: md: disk4 read error, sector=1381277784

...

Feb 14 21:27:26 Tower kernel: md: disk4 write error, sector=1381277744
Feb 14 21:27:26 Tower kernel: md: disk4 write error, sector=1381277752
Feb 14 21:27:26 Tower kernel: md: disk4 write error, sector=1381277760
Feb 14 21:27:26 Tower kernel: md: disk4 write error, sector=1381277768
Feb 14 21:27:26 Tower kernel: md: disk4 write error, sector=1381277776
Feb 14 21:27:26 Tower kernel: md: disk4 write error, sector=1381277784

```

Can anyone suggest something that could explain the repeated failure of different disks in the same allocation?

Did the migration process do something that is causing a conflict? Is it something in the current version? Been running unRAID since about version 4 and prior to this all I’ve had is the odd disk failure - something that has been easy to handle. I’ve spent more time dealing with issues in the last 2 weeks than I have in the previous blah years.

Now completely out of my depth with an array that no longer works.

Update:

Hard drives are all 4Tb in size. Original disk4 was WD Red, replaced by Seagate Baracuda, then by Seagate IronWolf.

tmux installed via NerdTools

tower-diagnostics-20240215-0819.zip

Edited February 15 by nwootton

JorgeB · February 15

Those Marvell controllers are known to drop disks, if possible I would recommend replacing them, you can also try to disable spin down, since the issue happened right after disk spin up.

nwootton · February 15

Same cards have run for the past x years without an issue - never once dropping a disk. What would cause them to start now?

Spin down was turned on after the third failure - though maybe the constant running of disks might be the cause. Was off after the update and subsequent failures.

What are the currently recommended alternatives?

Edited February 15 by nwootton

JorgeB · February 15

nwootton · February 15

Just ordered Dell PERC H310 for £20. Hoping it will fit my motherboard and can be stop gap until I can source a couple of LSI cards. Probably 9211-8i or later.

Will have to see if that solves the issue.

nwootton · March 1

Ended up rolling back to 6.12.4 until the new Dell PERC card was installed.

System remained stable with no further issues for the following 2 weeks (29th Feb). Yesterday I installed the new HBA card.

Still running 6.12.4 without issues

6.12.6 - Repeating failure of same disk number - No issues on disks

Recommended Posts

nwootton

Link to comment

JorgeB

Link to comment

nwootton

Link to comment

JorgeB

Link to comment

nwootton

Link to comment

nwootton

Link to comment

Join the conversation