Problem onlining failed parity disk

Beardmann · September 28, 2022

Hi there

I am currently just testing a future setup with a server connected to a NetApp DS4246 shelf with SAS drives installed.

I created a simple setup with two parity disks and 4 data disks, which works...

I then tried to pull one of the data disks to simulate a failure (the shelf is hot plug)...

It took a while before the system noticed the disk was missing (maybe because I didn't have any load on the system)..

But not just the data disk failed, but also one of the two parity disks was marked as failed which I of cause find worrying...

I then tried to reseat the parity disk, but it remained failed.

I even tried to stop the array, remove the party disk from the array, clear the disk (from the GUI) as much as I could, but no matter what I did, I was unable to add it back in its original location... i just remained "disabled"... as if it knew it was the same disk somehow?

I even tried to reboot the server and try it all again, but no dice...

I then replaced the disk with a new one, and this worked just fine, and the parity is not rebuilding...

Can anyone please explain why I was unable to reuse the "failed" disk? Because it is not failed, it works just fine, and I was able to do a complete SMART check with no errors...

Does Unraid save the unique SN of the disk is has seen, and just flat out deny you the possibility to reuse it again?

I am also a bit worried about the fact that pulling a disk was contributing to another one failing... not sure how this is even possible...

Sadly as I rebooted, I also lost the /var/log/messages to see what happened... (I will of cause try this again once the resync has completed)

...and no there is no problems with my HBA or the SAS cable I use to connect to the DS4246 shelf... I am quite sure about this because as it rebuilds now, there are no errors in the logs from the HBA or any of the disks...

As explained this is just a test setup with 3TB disks... the final setup will be with 18TB disks, and failures like this will take forever to rebuild

(with 3TB disks I'm looking at 5-6 hours, which I can then multiply with 6) 🙂

Any help is appreciated.

/B

trurl · September 28, 2022

4 minutes ago, Beardmann said:

then replaced the disk with a new one, and this worked just fine, and the parity is not rebuilding...

5 minutes ago, Beardmann said:

try this again once the resync has completed

So what you really meant to say was "the parity IS rebuilding"?

Bad connections are much more common than bad disks, but when a write to a disk fails, Unraid has to kick it out of the array since it is out-of-sync. To get a disk enabled again it has to be rebuilt so it is in-sync.

Attach diagnostics to your NEXT post in this thread and we will see where to go from here

Beardmann · September 28, 2022

Hi trurl

Sorry for not uploading any diag... as mentioned this is a test system, so not at all critical.

The system is rebuilding yes, but I had to replace the physical disk with a disk that the system has not seen, in order to get it to work.

So I guess my question was if unraid keeps track of the physical disks it has seen? Because no matter what I tried (from the GUI) I was unable to re-introduce the same disk as parity again... even after reboots... and on another system I tested the "failed" parity disk with an extended SMART test which showed no errors...

I do not entirely agree with you on the "failed connections" being more common than failed disks... maybe with consumer grade hardware, but this is enterprise gear all the way, and I have been working with NetApp gear for 15+ years and I am yet to see a copper SAS cable fail once working... 🙂 Of cause laser optics/cables and SFP modules can fail.. but copper SAS cables not so much 🙂 but let's not get bogged down in semantics

Just a quick note...

If I do a "cat /proc/mdstat" I can see that the last entries states this:

diskName.29=

diskSize.29=2930266532

diskState.29=6

diskId.29=ST33000650NS_SA_Z294PYBV_350000c9000354a4c

rdevNumber.29=29

rdevStatus.29=DISK_INVALID

rdevName.29=sdg

rdevOffset.29=64

rdevSize.29=2930266532

rdevId.29=ST33000650NS_SA_Z294PYBV_350000c9000354a4c

rdevReads.29=0

rdevWrites.29=122410131

rdevNumErrors.29=0

Which to me suggests that unraid keeps track of the failed disk? Any way to make it forget this?

/B

itimpi · September 29, 2022

19 hours ago, Beardmann said:

The system is rebuilding yes, but I had to replace the physical disk with a disk that the system has not seen, in order to get it to work.

If you want to re-use an existing disk then you need to make Unraid ‘forget’ about it by:

setting it to Unassigned
start the array to commit that change
stop the array

This is covered here in the online documentation accessible via the ‘Manual’ link at the bottom of the GUI or the DOCS link at the top of each forum page.

trurl · October 1, 2022

On 9/28/2022 at 11:35 AM, Beardmann said:

do not entirely agree with you on the "failed connections" being more common than failed disks

Based on my experience helping many, many users on this forum. Probably don't even see a lot of these cases because experienced users know what to do about it.

And in your case, it was indeed a bad connection.

On 9/28/2022 at 11:07 AM, Beardmann said:

(the shelf is hot plug)

Also note there is no point in hot plugging with Unraid, since it won't do anything with a disk until you assign it (or reassign it in your test), and you can't make any assignment changes with the array started.

Problem onlining failed parity disk

Recommended Posts

Beardmann

Link to comment

trurl

Link to comment

Beardmann

Link to comment

itimpi

Link to comment

trurl

Link to comment

Join the conversation