Jump to content

[SOLVED] Parity died? Suggestions on revive or rebuild


RobertP

Recommended Posts

1)

I have an 11 drive array, 2 are parity disks.  2 drives died when I kicked in Mover.  I shut down, pulled them out, put in replacements. Foolishly at the same time I moved my parity 2 drive to a different slot.  Powered up - the two data drives still show need for rebuild of course.  I started a rebuild and immediately parity 2 status changed to a red X.  I immediately canceled the rebuild, shut down, moved Parity 2 back to its old slot - still shows red X.  I've shut down and tried moving parity 2 to other slots, but no luck.  No visible damage to plug.

I've run quick smart test on parity 2 and is "passes with no errors", but still shows a red X.

I think my only choice is to replace parity 2 and rebuild my server from scratch - but wondering if anyone has suggestions on "reviving" a drive? (I can dream, can't I??)

Luckily this server is used to hold backups of my computers and my other unRAID, so TECHNICALLY nothing was lost (other than historical recoveries!), but DARN IT!  
Any suggestions on the best way to rebuild it from scratch but to remain all the share names?  Any easy way to save the data on the other drives that are not bad?  Isn't there a Windows tool that will read Linix/Unix drives, so I could plug them into my computer and copy off the data?

2)

I have an what feels like unusual number of drive failures in this system - sporadic.  I suspect bad SATA card or cables (or maybe some of the 5-unit drive cages).  I've tried unplugging and replugging cables and card - but failures continue. I put the "bad" drives in my other system and do a pre-clear and the preclear completes OK!?!?  What should I do to make sure they really are OK?  Could a bad (or underpowered) power supply cause this also?  Not sure how to test the power supply.

(On a side note - this system used to get kinda hot, but I've added additional fans and now the drives stay in the normal range.  Maybe the previous overheating caused pre-mature drive failures?  They never got in the ganger temp, just in warning temp zone.)

Thanks,

Bob

Smart report -parity 2 drive - 20220926-1324.txt

Link to comment

Not sure how accurate the diag will be, I've stopped and restarted system several times since the first issue.  I've also already pulled out the 2 "bad" drives  and inserted two new ones in prep for doing the rebuild.  I have already begun a PRE-CLEAR test on the two "bad" drives via my other unRAID system (so far no issues).  Attached is a screen shot taken after the 2 drives "failed" and before the parity 2 failed, the SMART reports from the two data drives, and the DIAG from this system.

Failed drives - 2022-09-22.jpg

ur1-smart-20220922-1149-ST16000NM001G-2KK103_ZxxxxxxL (sdg).txt ur1-smart-20220922-1155-ST16000NM001G-2KK103_Zxxxxxx (sdh).txt ur1-wopr-diagnostics-20220926-1656.zip

Edited by RobertP
Link to comment

You can tell which smart report is for what using the sdg and sdh.

I started the preclear testing on them for the same reason I moved parity drive 2 while system was still in an unstable state - 'cause I was not thinking!  I agree, it was a bad decision.  Hind sight is a wonderful thing . . .

 

At the time, I was sure the drives truly were bad and just wanted to confirm it via some testing before I tried getting any type of warrantee replacement - I did not figure any data could be pulled off of them nor did I think they would even function anymore.

 

I guess the serial # is not that critical/confidential?  Would it be better if I re-attached the reports w/o the blanked serial numbers?  Any security concerns w that?

 

Remember, I just did the diag today - after the "bad" drives have been pulled and replace by new drives (new drives that have not yet been "rebuilt" with the data due to the parity 2 issue).

 

So far the pre-clear on the "bad" drives (on my other unRAID system) have not reported any errors, but they are only in stage 2 (it takes a LLLOOONNNGGG time to preclear drives that big!).

 

Thanks for trying to help me figure out what is going on with this system.

Link to comment
16 hours ago, RobertP said:

You can tell which smart report is for what using the sdg and sdh.

Apparently you had already pulled them, so they no longer had any sd designation.

16 hours ago, RobertP said:

just did the diag today - after the "bad" drives have been pulled

Those SMART reports (that were not in the diagnostics) looked fine. Probably no good reason to continue to preclear them really. For future reference, you can run extended SMART self-test on a disk before even removing it from the array.

 

Also note that bad connections are much more common than bad disks and often the reason a disk gets kicked from the array since a failed write makes it out of sync.

 

16 hours ago, RobertP said:

serial # is not that critical/confidential?

No. Really the only way to reliably identify a disk. Those diagnostics you posted had SMART reports for all attached disks that included their serial #

Link to comment
7 hours ago, JorgeB said:

Diags are showing parity2 invalid? Looks valid on the screenshot, currently you won't be able to rebuild the 2 disabled disks.

Screen shot was from right after 2 data drives failed - before I started rebuild.  Parity 2 failed soon as I started rebuild, so screen shot is from before parity 2 death.

 

Link to comment

Here are the reports with the serial numbers intact.  As a reminder - the sequence of events:
* Started mover, disk 6 & 8 "failed" immediately.

* Took screen shot

* Ran and downloaded Smart reports on "bad" drives

* powered down, pulled out 6 & 8, replaced w new spare drives (already pre-cleared)

* powered up, started rebuild.

* Parity 2 died right at start of rebuild.

* Canceled rebuild

* ran and downloaded smart report on parity 2

* ran diagnostics report a day(?) later (thus the two "bad" data drives were no longer in system for DIAG report).

SO - screen shot is from BEFORE parity 2 drive died.  If I were to take screen shot now, 6 & 8 would show "emulated" (triangle icon) and parity 2 would show disabled (red X).  I have NOT tried reading any data since these two issues occurred.  It does seem odd that the rebuild did not cancel itself when it was trying to rebuild 2 drives with only 1 parity drive . . . 

I have NOT had a chance to pull this beast off the shelf to check cables and SATA card yet.
I have tried putting parity 2 drive in various different slots - no change.  (Server has 4 drive "cages(?)" w 5 drives slots each - I tried slots in various different cages).  If I remember correctly, all SATA cables come off of one "daughter card" - but I did not build the unit and have not been inside it in a long time, so my recollection of the cabling is a bit hazy.  So it MIGHT be possible a couple may come directly off of motherboard???

 

I have thought about replacing the daughter card (SATA card) and cables in past due to odd intermittent things.  Any suggestion on a good 20 drive daughter SATA card and cables?  System is Intel(R) Xeon(R) CPU E3-1230 v6 @ 3.50GHz motherboard in a full tower case.

Failed drives - 2022-09-22.png

ST16000NM001G-2KK103_ZL2P6T1L-20220922-1149.txt ST16000NM001G-2KK103_ZL28QVDG-20220922-1155.txt ST16000NM001G-2KK103_ZL28QVAG-20220926-1324.txt

Link to comment
  • 2 weeks later...

Sorry for the delay - got side tracked by a "little thing" called Hurricane Ian.  (We are OK, no major damage to our property.)
How can I re-enable parity2?  I don't see anything obvious to me.  I realize I could change it to NO DEVICE, reboot, and then put parity2 back to same drive, but I'm afraid the system will then zero out that drvie and try to rebuild parity - and remember that the initial problem is I already have TWO data drives being emulated.  (Not sure how TWO can be emulated with only ONE active parity drive!?!?)  Can you give me step-by-step on how to re-enable parity 2 (hopefully without losing any data)?

Thanks,

Bob

Link to comment
8 hours ago, RobertP said:

I realize I could change it to NO DEVICE, reboot, and then put parity2 back to same drive,

No, you cannot do that since there are two more invalid disk, you force enable parity2 to see if both disabled disks can still be emulated correctly, if yes you can then rebuild them.

 

This will only work if parity is still valid:

-Tools -> New Config -> Retain current configuration: All -> Apply
-Check all assignments and assign any missing disk(s) if needed.
-IMPORTANT - Check both "parity is already valid" and "maintenance mode" and start the array (note that the GUI will still show that data on parity disk(s) will be overwritten, this is normal as it doesn't account for the checkbox, but it won't be as long as it's checked)
-Stop array
-Unassign disks 6 and 8
-Start array (in normal mode now), ideally the emulated disks will now mount and contents look correct, if they don't post new diags
-If the emulated disks mount and contents look correct stop the array
-Re-assign disks 6 and 8 and start array to begin rebuilding.

 

 

Link to comment
  • RobertP changed the title to [SOLVED] Parity died? Suggestions on revive or rebuild

Thanks to all who tried to help.  Someplace along the line I must have missed a "parity is valid" check box, so I was unable to do the rebuild.  I was able to save a lot of the data first (copied it to my other server), but the two drives that "died" - their data was lost.  Luckily it is a server that I only use to house backups of the household PCs.  And I was able to get the server back up and running, and I am back to my normal backup routines.  It has not been running long enough to verify it is fully stable, but I did replace all the SATA cables, so far so good (knock on wood!)  All of the drives that "died" in this server - I was able to put them in my other server and do preclears successfully on all of them.  If this server starts flaking out again, I'll try replacing the SATA card.  If problems persist, I guess I'll try replacing the drive cages.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...