Jump to content

2 disks redballed - what order to rebuild?


Recommended Posts

I have a slightly dodgy supermicro 5-in-3, which sometimes seems to cause read errors/redballs. I've checked all the SATA cables but I think it might be a power thing. Obviously I need to sort that out!

 

But this time it's hit 2 disks at the same time: Parity 2 and a data drive.

 

As I see it I have 3 options:

 

1) new config and force the disks back to good status (I hear this is a bad choice?)

2) rebuild the data disk to spare disk, then the parity to the data disk (I have a spare - safer this way?)

3) rebuild the parity disk to spare disk, then the data drive to parity disk.

 

I was planning either 2 or 3. Any recommendation which way to take it?

 

Thanks!

Link to comment

Hmm, after a bit more investigation, the plot thickens.

 

I actually now recall I had a power cut on 4th May, during a monthly scheduled correcting parity check (it started. I only powered the server back on today (long story due to our internet being down etc etc) to find the two red-balls.

 

And it looks like it was a pretty weird parity check from the data on the dashboard (see attached, but 1953506633 errors!?).

 

I now suspect that the backplane may have gone bad during the parity check.

 

But that means I'm not sure whether to trust the data disks or the parity. But I'm leaning towards the data disks given that if that error count is correct, it could have written bad parity data? If that's the case and I want to trust the data disks, how do I force a parity rebuild with 2 parity disks, 1 of which is currently red-balled?

 

Thoughts?

 

Thanks!

 

 

parity.jpg

Link to comment
4 minutes ago, ctrlbreak said:

If that's the case and I want to trust the data disks, how do I force a parity rebuild with 2 parity disks, 1 of which is currently red-balled?

 

At this point the system should be "emulating" the disabled data drive using the remaining data drives and the parity1 drive.   Does the content of the emulated drive look correct?  If so this would suggest that parity1 is probably good.

 

If you do the following:

  • Stop array
  • Unassign parity2
  • Unassign disabled data drive (to ensure it is left alone)
  • Start array to 'forget' parity2 assignment
  • Stop array
  • Assign parity2
  • Start array to rebuild parity2 based on the data drives (including the emulated one).

If at all possible you want to keep the physical drive underlying the disabled data drive intact as long as possible as its contents are probably fine and therefore this is the last resort in getting its contents back.  You could try mounting in read-only mode in UD to see if it mounts OK.  If the above rebuild of parity2 goes well you have the option of rebuilding to the spare disk you mentioned having which still keeps the 'old' data disk intact for any recovery purposes that might be required.

Link to comment
3 hours ago, itimpi said:

 

Does the content of the emulated drive look correct?

Very hard to tell - it's primarily large media files, so there could be corruption to them. I don't run regular hashes/file integrity checks (maybe I should start!).

 

But, being paranoid, what would cause so many parity errors/corrections from that last parity check? The server had been mostly idle since last parity check (due to aformentioned internet outage), so I doubt a lot of data was written to disks... makes me suspect that there was something else weird going on, so I wonder if it's worth digging deeper into the errors (no logs, sadly). I suppose I could try to compare the emulated disk 4 with the "real" disk 4 to look for any file hash discrepencies...

 

Link to comment

FYI - I tried mounting the data disk with UD, but it complains about a clash with the (emulated) device:

 

May 13 21:08:43 bigboi kernel: XFS (sdn1): Filesystem has duplicate UUID 041b0c28-6d93-4c28-89e4-c85ff7eef5d2 - can't mount
May 13 21:08:43 bigboi unassigned.devices: Mount of '/dev/sdn1' failed: 'mount: /mnt/disks/WDC_WD140EMFZ-11A0WA0_9RKDW2VC: wrong fs type, bad option, bad superblock on /dev/sdn1, missing codepage or helper program, or other error. '
May 13 21:08:43 bigboi unassigned.devices: Partition 'WDC_WD140EMFZ-11A0WA0_9RKDW2VC' cannot be mounted.

 

Any ideas how/if I can mount it on the same server?

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...