I've been randomly getting errors/warnings here and there for a bit now...how do I know if a disk is bad?


johmei

Recommended Posts

I had been randomly getting errors/warnings for a bit now and everything seemed to work fine, and most of the stuff that popped up in my notification box were actually just warnings (that I can't recall what those warnings were since I'm stupid and cleared the notifications) and I never really thought about it until now but...I don't really know how to recognize if a disk is bad and needs to be replaced or not.  I came across a post here that looked similar to my situation and after they shared their diagnostics it was confirmed that one of their disks were toast.  I have 2 4 TB parity drives and just actually bought 2 new 12 TB parity drives but I want to make sure everything is healthy before I replace those two because I'll ideally want to use the previous parity drives as extra storage for now.  Anyway, as per the post that I found, I'm going to attach my diagnostics file since I'm now worried and need to make sure my data is safe.

 

I do also upload all my data to Crashplan so I'm hoping that no matter what, I should be fine here.

 

A short and sweet generalization of things to look out for for future reference would be good too.  If there's a section in the wiki all about disk health (I imagine there must be), that would probably be best for me to check out too.

 

Thanks very much for the assistance!

johnsnas-diagnostics-20211010-2004.zip

Link to comment
57 minutes ago, JorgeB said:

It might help if you mention on what disk(s) you have been getting warnings.

Oh yeah lol.  Well let's see.  I'm in the log right now and it looks like there are quite a few read errors on disk1 from yesterday (10/10/2021) and...I don't see anything else but I swear there were other warnings for other disks (I just don't remember which ones and I cleared the notifications)...and actually I thought they were mostly warnings but all I see is several errors for Disk 1.  But under the dashboard it says there is also a SMART error on my first parity drive.  So I guess Disk 1 and Parity.

Edited by johmei
extra info/reword
Link to comment

Started that last night luckily for both the Parity and Disk1.  Parity says all is good, completed without error.  But Disk1 says Errors aoccured and to check SMART report, which I've included in this post.  I'm gonna read over it to see if I can make sense of it but I know you'll likely recognize issues immediately.

 

Thanks for the help in all this again btw.

johnsnas-smart-20211011-0357.zip

Link to comment
2 hours ago, trurl said:
# 1  Extended offline    Completed: read failure       10%     49547         7716317008

replace

what's the significance of the 10%?

 

And....what would be the best approach to replacing this drive?  I was planning to replace both my 4 TB Parity drives with 12 TB Parity drives but I wanted to make sure everything was healthy before I did.  I want to be careful because even with my offsite backup, and even with 2 parity drives, I just don't want to risk anything going wrong.  Having said that, would it be safe to replace my parity drives first?  Or is there something else I need to do because of this disk failure before I consider that?  Thanks!

Link to comment
19 hours ago, JorgeB said:

10% is the test amount remaining.

 

You should first replace the failing disk, if you want to upgrade parity and use one of the parity disks as a replacement for disk1 you can do a parity swap.

 

10% test remaining....does that mean it could not complete the final 10% of the test for whatever reason?

 

using one of the parity disks to replace disk 1 would be ideal, but how would I do this and would there be any risk?  Are you suggesting replace one of the parity drives with the 12 TB (so I'd have 1 x 4TB parity and 1 x 12 TB parity) and then take the previous parity drive to swap out the bad disk1?

 

I assume both doing a parity swap and replacing a bad drive would be simple enough to find procedures for in the unraid wiki/forums/reddit, but is there anything else I should know?

 

Thanks again!

Link to comment
12 minutes ago, johmei said:

assume both doing a parity swap and replacing a bad drive would be simple enough to find procedures for in the unraid wiki/forums/reddit, but is there anything else I should know?

Just use the ‘Manual’ link at the bottom of the Unraid GUI to get to the online documentation that covers this under the Storage Management section.

  • Thanks 1
Link to comment
53 minutes ago, itimpi said:

Just use the ‘Manual’ link at the bottom of the Unraid GUI to get to the online documentation that covers this under the Storage Management section.

Roger that, thanks.

 

If anybody has anything else to add or anything else that might be important to note, please let me know!  Otherwise, I'll check back if I run into any further issues.  Thanks!

Link to comment

Phew....okay.  I decided to make sure all my drives were good before I proceeded with the parity swap so I did an extended Smart test on all of them and It looks like disk 1 and disk 2 are bad (attached extended smart log below).  I'm already preclearing one of my 12 TB drives to do a parity swap but this is scary territory right now.  Suggestions to proceed?  Should I switch out both parity drives one at a time and then rebuild the bad drives with the old parity drives?  Switch one parity drive out, then rebuild one of the drives, switch the other one out and then rebuild the other?

 

Thanks,

 

johnsnas-smart-20211013-19502.zip

Link to comment
22 minutes ago, johmei said:

Should I switch out both parity drives one at a time and then rebuild the bad drives with the old parity drives? 

Not a possibility as once you have swapped out the old parity drives they are not available to support a rebuild :( 

 

23 minutes ago, johmei said:

Switch one parity drive out, then rebuild one of the drives, switch the other one out and then rebuild the other?

 

This is a possibility, especially if you swap out the drive that failed its SMART test first.   However you will not be protected against any other drive failing until the first parity rebuild completes.

 

28 minutes ago, johmei said:

Should I switch out both parity drives one at a time and then rebuild the bad drives with the old parity drives? 

Not tried doing simultaneous Parity Swap procedures but as long as Unraid lets you do that it would be the fastest.

 

Whichever route you go keep the old data disks intact until you are back fully protected.  If the Parity Swap goes wrong in any way there is a good chance that most of the data off these drives would be recoverable.

 

  • Thanks 1
Link to comment
4 hours ago, johmei said:

looks like disk 1 and disk 2 are bad

Apparently you have been ignoring SMART warnings on the Dashboard page, or at least never noticed them.

 

You must setup Notifications to alert you immediately by email or other agent as soon as a problem is detected. Don't let one problem become multiple problems and data loss.

  • Like 1
Link to comment
4 hours ago, trurl said:

Apparently you have been ignoring SMART warnings on the Dashboard page, or at least never noticed them.

 

You must setup Notifications to alert you immediately by email or other agent as soon as a problem is detected. Don't let one problem become multiple problems and data loss.

 

Something that I should of realized (when building a NAS of any kind) is that a drive failing will not necessarily be catastrophic and it will not necessarily be all red lights flashing in your face.  Because everything SEEMED to be working fine, I didn't know why the errors kept popping up.  After getting them for a bit I realized that I should really look into them.  Honestly, I don't know how much later I would have waited if I hadn't decided to buy two new parity drives when I did.

 

Now I know, and the mistake won't happen again (but it's also why I have a full online backup) but I must say, I do wish those notifications had been more "in your face"  A message stating, "This could be potentially very serious and may cause data loss if no action is taken" would be nice and would catch my eye a lot more than just random warnings and the occasional error.  Still, lesson learned, I'll be setting up email alerts or something after this is taken care of for sure.

Edited by johmei
Link to comment
8 hours ago, JorgeB said:

This is what I would do.

Do the manuals cover a simultaneous Parity Swap?  Is there anything specific about that procedure that I should know?  I assume at the very least, it would mean putting both 12 TB drives in at the same time and as such...I should do a preclear on both 12 TB drives first right?  Anything else I should know, or any other tips/suggestions that might help?

 

Thanks!

Link to comment
1 hour ago, trurl said:

You should also install the Fix Common Problems plugin. Not only can it help you fix common problems, but it will tell you about problems you didn't know about.

awesome, thank you so much for the suggestion!  I'll get that installed for definite.

 

Something else I wanted to ask; would it be safest if I went out and got 2 x 4 TB drives to rebuild the bad drives before messing with the parity drives?  Or is the risk the same doing a parity swap?  And the risk, as I understand it, is being that another drive will die while one drive is being rebuilt....not sure if there are other risks as well?\

 

Thanks!

Link to comment
1 hour ago, johmei said:

the risk, as I understand it, is being that another drive will die while one drive is being rebuilt

A disk actually dying is rare compared to other problems such as bad connections. People will often disturb connections any time they are messing around in the case. In your situation you do have multiple disks with known problems, but since these are both going to be eliminated during the double rebuild (or the double parity swap) that shouldn't be an issue.

 

You should either rebuild both at the same time to new disks, or do both parity swaps at the same time. You don't want either to be involved in rebuilding the other.

 

Parity swap doesn't seem any more risky to me than just a rebuild. The parity copy part of the procedure happens with the array offline, so there are no changes to parity during the copy. And that copy only concerns the new disk and parity. While copying parity you still have the original parity and could just New Config it back into the array and be back where you were. And the other part of the procedure is just a rebuild.

 

Ultimately, I always trust the advice given by this person:

11 hours ago, JorgeB said:

This is what I would do.

 

 

  • Thanks 1
Link to comment
26 minutes ago, trurl said:

Parity swap doesn't seem any more risky to me than just a rebuild. The parity copy part of the procedure happens with the array offline, so there are no changes to parity during the copy. And that copy only concerns the new disk and parity. While copying parity you still have the original parity and could just New Config it back into the array and be back where you were. And the other part of the procedure is just a rebuild.

I haven't finished reading through the manual yet (going through it right now), but I've a quick question; does unraid identify disks and what disk they are in the array based off of the S/N or something unique in the drive?  Or does it ID them another way?  I'm basically wondering if it matters what controller or what port any given drive is plugged into or not.  I'm assuming it doesn't matter and I can do the Parity swap with the 12 TB drives via the onboard SATA controller and then move them to my LSI controller when all the smoke clears and my data is safe.  But I want to make sure.

Link to comment
24 minutes ago, trurl said:

assuming the controller is giving that information correctly. That is one of the reasons RAID controllers and USB connections aren't recommended.

 

Looks like your hardware is good for that.

Ahh, I actually saw somebody mention that about USB drives in a post somewhere.  That makes sense.  The LSI SAS2008 controller has been the only one I've used so far, and the Z97 SATA controller is in AHCI mode so I'm assuming everything should be fine :) As soon as these pre-clears finish, I'll reboot and make absolute sure the Intel SATA controller doesn't have any funky settings though just to be safe.  I'll need it properly configured anyway because my LSI controller only has 8 drive support and I want to add at least 1 more drive.

Edited by johmei
Link to comment
12 hours ago, JorgeB said:

This is what I would do.

 

I have some questions specific to a simultaneous parity swap procedure.  The manual only outlines how to do this with a single parity drive and a single data drive but can I assume the procedure is the same except literally doing it twice at the same time?  To be more specific, would the procedure be as follows:

 

  1. Stop the array
  2. Unassign both disk 1 and disk 2 (the failed data drives)
  3. Start the array and choose "Yes I want to do this" checkbox.
  4. Stop the array again
  5. (I currently have both 12 TB drives connected to the Z97 Sata controller doing a pre-clear so I don't believe I need to mess with actually disconnecting any drives)
  6. Unassign BOTH Parity and Parity 2
  7. Assign the new 12 TB drives (after preclear finishes) to Parity and Parity 2
  8. Assign the old Parity and Parity 2 drives in the slots of disk 1 and disk 2 (I assume doesn't matter which)
  9. Go to the Main > Array Operation section and there should be a Copy button
  10. Put a check in the "Yes I want to do this" checkbox and click the "Copy" button and wait for a long long time until the copy completes.
  11. The start button will now be present with a description indicating that it is ready to start a Data-Rebuild...I should put a check in the "Yes I want to do this" checkbox and click "start"
  12. And then I just wait and minimize use of the array so it'll finish faster/less complications.

 

Is that about right more or less?  I quoted JorgeB as per their suggestion and trurl's vote of confidence in what they say, but really if anybody can answer and reassure me that I have the correct procedure then feel free to let me know :)

 

Thanks so much everybody!

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.