Order of Operations Replacing failed drive after unclean shutdown

Sands_at_Pier147 · January 27, 2021

Today I was writing to a data drive and I started to get CRC Errors. Tried a few things, and decided it is probably a failing disk, so I will go ahead and replace it. I don't have one on hand, so it will come tomorrow.

Everything was fine before it failed.

Separately, I just inadvertently unplugged the server. I was doing some maintenance in the equipment rack. The array was started. Six of the seven data drives were mounted, and so was the parity drive. Only the failed seventh data drive was not mounted (if I am using the terminology correctly).

I know from previous experience when I start up the array from an unclean shutdown, it is going to want to run a parity check immediately. But what happens if it runs a parity check when one of the drives needs to be rebuilt? Am I hosed? What order should I do things (start array, parity check, replace drive, rebuild drive, etc.) to make sure I can get the data back?

I was all set to rebuild the drive until I accidentally unplugged the server.

Thanks,

Brian

trurl · January 27, 2021

It can't do a parity check with a disabled disk.

crc errors are bad connections not bad disk.

Go to Tools-Diagnostics and attach the complete Diagnostics ZIP file to your NEXT post in this thread.

Sands_at_Pier147 · January 27, 2021

Here is the diagnostic file.

Thanks for your help.

cinemarougea-diagnostics-20210126-2059.zip

trurl · January 27, 2021

12 hours ago, Sands_at_Pier147 said:

The array was started. Six of the seven data drives were mounted, and so was the parity drive. Only the failed seventh data drive was not mounted (if I am using the terminology correctly).

The array is not started in those diagnostics, so I can't see which disks are mounting.

Technically, parity drive doesn't "mount" because it has no filesystem.

Is it disabled disk1 that wasn't mountable?

Just to make sure there isn't some confusion in terminology, what did you see that made you think it wasn't mounted?

Sands_at_Pier147 · January 27, 2021

I started the array. I didn't realize that is not being started affected the diagnostics. Sorry about that.

Disk 1 (sdc) is the disk that has failed. It is one of seven 3TB data disks. There is an eight parity drive of the same size.

I guess I was thinking that a disabled drive was not part of the array, and therefore was not mounted, I guess I may have been using the terminology incorrectly.

Disk 1 is disabled, I guess. And emulated.

I can try to reseat the drive and cables if that is what might be causing UDMA CRC errors.

If I swap out the disk, will the system assume the parity is correct, and rebuild the drive accordingly? Even with the unclean shutdown?

cinemarougea-diagnostics-20210127-0734.zip

trurl · January 27, 2021

Unclean shutdown doesn't invalidate parity, it only triggers a parity check. When you manually start a parity check, parity is still considered valid. Same with unclean shutdown.

Disabled and emulated disk1 is mounted. SMART for that disk looks OK. SMART extended test was aborted before it could complete some time in the past. You could run another if you want.

Rebuilding to another disk allows you to keep the original disk unchanged in case there are problems with the rebuild, but you can rebuild to the same disk if you want.

Do you know how to proceed?

Sands_at_Pier147 · January 27, 2021

I tried rebuilding tot he same disk yesterday, because it looked like SMART data was fine. But I kept getting additional UDMA CRC errors. I assume I need to correct those first? How do I tell the system the drive is good if it thinks it is not?

trurl · January 27, 2021

4 hours ago, Sands_at_Pier147 said:

I tried rebuilding tot he same disk yesterday, because it looked like SMART data was fine. But I kept getting additional UDMA CRC errors. I assume I need to correct those first? How do I tell the system the drive is good if it thinks it is not?

You didn't mention you had already tried to rebuild. And since you rebooted, can't see anything about the original problem nor the rebuilding in those diagnostics. Syslog resets on reboot since it is in RAM. There are ways to set things up so that gets saved somewhere but doesn't help with what already happened.

It doesn't think the drive is bad. It thinks it needs to be rebuilt because it is out-of-sync.

When a write to a disk fails, Unraid disables it so nothing else can happen to it, but it uses that failed write to update parity anyway. Any access to the disk after it is disabled is using the "emulated" contents of the disk which comes from calculating its contents from parity and all the other disks.

The emulated disk is used whether reading or writing, with parity being updated to emulate the writes. So, after a disk is disabled, that original failed write, and any subsequent writes to that emulated disk, can still be recovered from the parity calculation. But the physical disk itself is out-of-sync with the array (the emulated contents) and needs to be rebuilt so the emulated contents can be written back to the disk.

If CRC is increasing you need to fix that, but CRC errors are just the disk keeping track of the bad communications it has had (CRC doesn't match the data). You need to fix the connection or controller issue causing that. And often, there will be connection or controller issues the drive doesn't even know about since it never received any data.

On the Dashboard page, you can click on the SMART warning for that disk and it will let you Acknowledge the current CRC count, and it will warn again if it increases.

Just noticed you have Marvell controllers. Those are known to cause problems, but some people seem to be able to make them work as long as they disable IOMMU (VT-d) in the BIOS. Do you have any VMs that need direct hardware access?

Why are you running such an old version of Unraid?

Sands_at_Pier147 · January 28, 2021

Alright, we're back in business.

Last night, you said "crc errors are bad connections not bad disk." My review of the SMART data seemed to support that. I disconnected everything and put it all back together, just to make sure everything is seated properly and securely. I have been doing some other maintenance in the equipment rack where this server is located, so I might have inadvertently knocked something loose. I expect computer gear to be more resilient than it apparently is. I rebuilt the drive onto itself, and everything seems happy. No more errors, no problems with the write cycle to rebuild the drive. Now I even have an extra drive on hand in case something really bad happens.

Thank you so much for the last explanation you posted. That made a lot of sense to me. I wasn't able to gain that much knowledge from the materials I read yesterday, so I really appreciate the full explanation of how things are working with respect to disabled disks and emulated content.

I've been using these machines (I have two identical ones) for years. One of them I built in 2019 and the other I built in 2009. Never seemed to have an issue with the controller cards that I knew of. As a precaution, maybe I should get them out of the system. I don't use virtual machines so maybe that's why I haven't seen any issues. But I/O cards are cheap in the grand scheme of things, so I should probably replace them before I put the server back in the rack.

As for the old version of unraid, I view these servers like appliances. Set them and forget them. They work so well for what I need, I never have to do much to them. Run a parity check once or twice a year, but that is it. Maybe I should update now while I am working on them.

Thanks again for your help yesterday and today. I really appreciate the support.

Brian

trurl · January 28, 2021

9 hours ago, Sands_at_Pier147 said:

Set them and forget them

Make sure you setup Notifications to alert you immediately by email or other agent as soon as a problem is detected. Don't allow one problem to become multiple problems and data loss.

Order of Operations Replacing failed drive after unclean shutdown

Recommended Posts

Sands_at_Pier147

Link to comment

trurl

Link to comment

Sands_at_Pier147

Link to comment

trurl

Link to comment

Sands_at_Pier147

Link to comment

trurl

Link to comment

Sands_at_Pier147

Link to comment

trurl

Link to comment

Sands_at_Pier147

Link to comment

trurl

Link to comment

Join the conversation