Jump to content

Other disk failed after disk replacement


Recommended Posts

Hey,

 

following situation, my unraid server is version 6.11.5 (wanted to get my array to a good state again, before updating..)

 

- My UnRaid was running fine, build two years ago, with 2x 18TB (1 parity) and a 14TB added later; also using 4x 6TB from my old Synology (having already over 7 years of runtime).

- First disk started to show some errors (very few), parity was always able to correct it, disk was not disabled.

- A second disk (in the logs disk3) started to show CRC errors now and then, I saw it can just occur and nothing to worry about.
- I bought another 18TB, precleared it and kept it as hot spare

- Now a few days ago I finally decided to retire the first disk with the errors, and replace it by the hot spare 18TB. Used the normal guide for that, then started to rebuild the array.

- Suddenly, after a few hundred GB, disk3 started to throw a lot of errors (nearly as much errors as reads). Rebuild went down to 2MB/s for some hours. Later on, it continued normally with expected rebuild speeds. It finished a few hours ago, just showing that there were a lot of errors, but indicating parity still valid.

- directly afterwards, disk3 was disabled because of the error counts. After a reboot (where I switched to a different SATA cable), it now is not even mountable anymore ("Unmountable: Wrong or no file system"). SMART check shows nothing strange (from what I can see), except for CRC errors. A extended smart check is running now.

What should I do now? Might there be some data loss? I still have the old removed HDD, did nothing to it yet. The new 18TB HDD could hold the data of both failed disks without any issues, is there a way to move the data?

galileo-diagnostics-20240723-1912.zip

Link to comment
14 minutes ago, schmidtflo said:

Suddenly, after a few hundred GB, disk3 started to throw a lot of errors (nearly as much errors as reads).

At this point you should have stopped the rebuild, since it was rebuilding a corrupt disk.

 

Disk3 look OK, lots of CRC errors so probably just needs a new SATA cable, is old disk5 still readable?

Link to comment
Posted (edited)

Old party history shows nothing too strange except for the rebuild. 
image.thumb.png.17ecf3e5fa1d2e7b7e89cb75e7bad699.png

 

Quote

Disk3 look OK, lots of CRC errors so probably just needs a new SATA cable

I already switched to a different SATA port at the reboot, afterwards it was not mountable anymore.

 

Quote

is old disk5 still readable?

I have dismounted it from the server. Should I reinstall it again to check? It was not disabled and still readable when I removed it.

Edited by schmidtflo
Link to comment
30 minutes ago, schmidtflo said:

afterwards it was not mountable anymore.

Because it's disable, see if it mounts with UD.

 

30 minutes ago, schmidtflo said:

Should I reinstall it again to check?

Yes, check it UD also, with the array stopped for this one.

Link to comment

Both were mounted successfully (GUI popup said error while mounting, but syslog seems fine, they are listed as mounted in GUI and can be accessed via terminal [browsing the folder structure without any obvious errors)

image.thumb.png.27a7a9dddd59d05b4056d869b8663e45.png

 

I in between actually replaced the SATA cable (beforehand i just not used it anymore, but when reinstalling it the old disk i put back in also threw CRC errors, so I then used a factory new cable, now no error is shown anymore.

 

So for me, both disks seem to be in a fine-ish state. How to continue now? Probably disk3 (with the high error amounts) can be mounted again without errors now, as the SATA Cable was swapped? Is then everything fine, or do I have to "restart" the rebuild?

Link to comment

Yep, seems to be failing. 

My strategy would be, mount disk3 (that one with the broken SATA cable) again (how to do this, as unraid marks it as disabled?). Then, do I have to re-build the parity from disk5 (the actually exchanged one) again, like by formatting the new disk5 and start a rebuild? Or is it fine, as the rebuild wasn't cancelled by the system, even tho the errors occured?

galileo-smart-20240724-1446.zip

Link to comment
19 hours ago, schmidtflo said:

Old party history shows nothing too strange except for the rebuild. 

Any parity errors at all needs to be understood and not dismissed as "nothing too strange".

 

The screenshot suggests you have configured scheduled parity checks to correct parity errors. Better to schedule checks to not correct. If errors are found, determine the cause. After eliminating the cause, then correct parity.

 

You don't want to allow a bad disk or connection to corrupt parity.

 

The small number of parity errors on previous checks could be due to unclean shutdowns, or possibly RAM.

Link to comment
Quote

If you were running a correcting check when you had the large number of parity errors, then probably you have corrupted parity so rebuild will not be good.

The large number of errors only occured while rebuilding the array after replacing the faulty disk. Therefore my idea was to bring the SATA-cable-error disk back into the array (therefore array should be good except for the new disk with wrong parity calculated), then format the new disk to start another rebuild. 

 

What do you think of this, is this a viable solution?

Link to comment
50 minutes ago, JorgeB said:

Since disk5 is failing you can mount it with UD and try to copy what you can to new disk(s), or try to clone it with ddrescue, when done do a new config with the remaining disks to re-enable disk3.

Okay, just to make sure I understood correctly:

- clone old disk5 with ddrescue to a brand new one (I'll just go and buy one)

- take out new disk5 (18TB) that I just installed out of the NAS and not use it anymore (for now)

- use cloned disk5 as regular disk5 in the array from now on, create a new config with all disks (all disks, including disk3 which is then useable again with new config as well as cloned disk5), this will re-generate the parity on the parity-drive

 

Or can I use new disk5 (18TB), format it, and ddrescue old disk5 onto it, then using it as regular disk5 from now on?

Link to comment
11 minutes ago, schmidtflo said:

- clone old disk5 with ddrescue to a brand new one (I'll just go and buy one)

- take out new disk5 (18TB) that I just installed out of the NAS and not use it anymore (for now)

- use cloned disk5 as regular disk5 in the array from now on, create a new config with all disks (all disks, including disk3 which is then useable again with new config as well as cloned disk5), this will re-generate the parity on the parity-drive

Correct, cloned disk should be the same capacity as the old disk5, or it won't mount on the array, it will still mount with UD if needed.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...