Jump to content

Failing disk path forward


Recommended Posts

Looking for a bit of advice.

 

Backstory: Last week i replaced my 8TB parity drive with a new 14TB drive. (Old parity drive still good.)

Around the same time I noticed a different drive was beginning to have errors (disk4), but nothing that really worried me. (I've never lost a drive so wasn't very sensitive to the issues)

Yesterday i installed a LSI HBA and moved all drives to it. Installed a second cache SSD, and brought up the array. Began zeroing the old parity dive to add more space to the array (added as disk6).

Yesterday in the late afternoon, the problematic disk (disk4) started spitting out tons of errors. Unraid has not marked it failed yet, but all the tell tell signs are there. (been reading forms since i first encountered the issues)

 

Reallocated sector count: 102

Current pending sector: 2216

Offline uncorrectable: 266

UDMA CRC error count: 14

Smart short test Errored out.

 

I just want to take the disk that is zeroed (disk6)(not formatted yet) and replace the failing disk (disk4), but since I've added it to the array, despite not yet being formatted and whatnot, unraid sees it as an additional missing disk if i try to move it. Parity "should be" good at this point, however it checks monthly at the beginning of the month, which is in a few days.

 

Path forward?

Link to comment

I may have already started down another path by accident, and not sure how to proceed.

 

I stopped the array, and removed the failing device from the array.

Started the array.

I stopped the array with the hopes of moving the zeroed drive into slot 4. No dice (2 missing drives)

 

Trying to put things back to normal, added drive 4 back, started the array, and unraid is doing a rebuild on the failing drive (thinking its new for some reason).

 

What is the path forward now?

 

 

Link to comment

Ok, well 14 hours to go then.

Currently at:

Reallocated sector count: 290

Current pending sector: 936

Offline uncorrectable: 266

UDMA CRC error count: 14

 

I do have a new 8TB drive coming in on tuesday that i can just drop in. Just needs to make it until then.

Edited by bigrob8181
Link to comment

Ok, so the rebuild of the failing drive is finished, and at the moment there are no errors in unraid. The smart data is as follows:

Reallocated sector count: 525

Current pending sector: 56

Offline uncorrectable: 266

UDMA CRC error count: 14

 

With no more errors in unraid, would you suggest replacing the drive or see how things play out? I still have the one drive zeroed and ready to drop in, i also have an additional 8TB coming in the mail on wednesday. In the event i do not replace the drive, i would consider using the drive coming in the mail for a second parity, and the zeroed drive for additional storage space in the array. I am not interested in purchasing an additional drive in the near future to replace the failing drive, but also am not wanting to trash a drive that may still be good.

Link to comment

You should replace, but since you haven't posted diagnostics I hesitate to make any recommendations since you may have other problems you are unaware of.

 

On 6/26/2022 at 7:44 AM, bigrob8181 said:

all the tell tell signs are there

 

You should have been getting notifications from Unraid about this disk. Do you have Notifications setup to alert you by email or other agent as soon as a problem is detected? Do any of your other disks have SMART warnings on the Dashboard page?

Link to comment

Diagnostics are attached.

 

I was getting a lot of notifications this last week. I have not gotten any additional notifications nor emails since i accidentally rebuilt the failing drive onto itself.

 

Disregard, right after i sent this, i got a notification on disk error and the counter is at 4. I suppose that makes it an easy decision to just swap it out, and expand the storage on wednesday. I suppose i will just have to go dual parity in the future. 👍

zeus-diagnostics-20220627-1134.zip

Edited by bigrob8181
Link to comment
1 hour ago, trurl said:

I know you discussed adding a disk but not sure what happened with that. Was disk6 added but not formatted yet?

That's correct. I was adding to increase array size, but with disk 4 failing I decided it would be better suited replacing it. 

I have a rack server case being delivered today so I decided whatever I do, it'll have to start after the case migration. 

Link to comment

I guess my question wasn't clear. In your latest screenshot, it shows a disk assigned as disk6 but unmountable.

 

Did you replace disk4? Or did you rebuild to the original disk4?

 

Either way, did you also add a disk6, and it is unmountable because it hasn't been formatted yet?

 

 

Link to comment
3 hours ago, bigrob8181 said:

No other warnings.

I was talking about the Dashboard page but you posted a screenshot of Main - Array Devices. That did answer the question about the "counter" but it also prompted the other questions I had about disk6.

 

I was still wondering about whether or not you had any SMART warnings on the Dashboard page, so I looked at the SMART reports of each disk in your Diagnostics.

 

On disks 1, 2, 3, you should be getting warnings for UDMA CRC error, unless you have already acknowledged them.

 

The disk4 in your diagnostics has the same problems as before so I guess you rebuilt the original.

 

 

 

Link to comment
On 6/26/2022 at 8:22 AM, JorgeB said:

Wait for the rebuild to finish then do what I posted above. it can get more complicated if the rebuild doesn't finish due to errors, but errors are much more likely on reads vs. writes.

Ok, so i attempted what was instructed above and must have messed something up.

Disk 6 is now in disk 4 spot. Unraid is re-zeroing. It is NOT emulating the failing disk.

Parity is valid (although i am concerned, because the data is not emulated, so across the disks i would think it might mess something up)

No data from the failing disk 4 is on the array, however its still on the disk.

 

Is parity being marked valid when its clearly not going to be problematic?

 

I suppose i need to allow the zero to happen again and then use unbalance to move the data back onto the array. Is this the correct way to proceed?

Link to comment
2 hours ago, bigrob8181 said:

No data from the failing disk 4 is on the array, however its still on the disk.

What do you mean? How do you know it's still on the disk?

 

2 hours ago, bigrob8181 said:

Disk 6 is now in disk 4 spot. Unraid is re-zeroing. It is NOT emulating the failing disk.

 

Unraid will not have made any assignment changes itself. And it would only clear a disk added to a new slot. Something missing from your description.

 

Link to comment

I may be able to guess what you did by breaking this down some.

 

On 6/26/2022 at 7:50 AM, JorgeB said:

If disk6 was never formatted you can do a new config without it (tools -> new config)

 

New Config with nothing assigned as disk6, but with the original disk4 assigned as disk4, and all other disks assigned just as they were.

 

On 6/26/2022 at 7:50 AM, JorgeB said:

check "parity is already valid" and start the array

 

At this point, all disks would be accepted just as they were assigned.

 

On 6/26/2022 at 7:50 AM, JorgeB said:

stop the array, and replace disk4

 

Stop the array, assign the new disk as disk4. At this point, starting the array would make it rebuild the emulated disk4 onto the new disk.

 

What did you do instead of what I explained above?

 

For example, if you New Config without any disk assigned as disk4 and started the array, then added disk4, then it would begin clearing the added disk.

 

If that's what happened, then assuming no other writes to the array, it should still be possible to get it to emulate disk4.

 

Link to comment
7 hours ago, trurl said:

I may be able to guess what you did by breaking this down some.

 

 

New Config with nothing assigned as disk6, but with the original disk4 assigned as disk4, and all other disks assigned just as they were.

 

 

At this point, all disks would be accepted just as they were assigned.

 

 

Stop the array, assign the new disk as disk4. At this point, starting the array would make it rebuild the emulated disk4 onto the new disk.

 

What did you do instead of what I explained above?

 

For example, if you New Config without any disk assigned as disk4 and started the array, then added disk4, then it would begin clearing the added disk.

 

If that's what happened, then assuming no other writes to the array, it should still be possible to get it to emulate disk4.

 

I may have done a new config for both removing the zeroed drive AND for moving the zeroed drive to slot 4. I understand now what I should have done.

Thoughts on new config with the failing drive back in the original location start, stop, then replace with the good drive? Any concerns with parity on the other drives? If I need to do an rsync between the 2 drives after to ensure all data is correct that's fine.

 

At this point I still have the data, just looking for the quickest and most complete path forward.

 

Sorry for misunderstanding ND causing a mess although not a complete disaster.

Link to comment
1 hour ago, JorgeB said:

Should be OK, but some filesystem corruption can happen.

Would it be corruption only on the failing drives files or across the array? Since I still have the data I should be able to recover any corruption unless it's in a bad spot of the failing drive by a simple rsync command I would think.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...