parity errors, possibly the same ones recurring from test to test


Recommended Posts

  • Replies 126
  • Created
  • Last Reply

Top Posters In This Topic

all drives show good smart reports, but even if they didn't, from a usability standpoint, if unRAID feels there are errors, it should make it easy to see exactly what those errors are, and perhaps offer to help/fix them.

 

Asking a "normal" user to run smart reports on all their drives, one at a time, then post those results to the forum and hope for someone to find the errors is NOT a good way to handle this situation.

 

unRAID "knows" what these 5 errors are, why should I have to go individually scanning a dozen different drives to try to figure it out?

 

Attached is a pic showing the SMART reporting for all my drives.  About 1/2 don't yet have extended tests, but I started an extended test for every one that still needs it.

SMART.png.22de4a32fec5ca7b4e4230cca7f849fe.png

Link to comment

The problem is that unRAID does NOT know which drive caused the errors - just that the data drives do not correspond to the parity drive.  That is a limitation of the simple XOR parity scheme currently being used.

 

When (if) unRAID moves to supporting dual parity drives, then I expect that the scheme chosen WILL allow identification of which drive caused an error.    However I would not hold your breath for that becoming available.

Link to comment

all drives show good smart reports, but even if they didn't, from a usability standpoint, if unRAID feels there are errors, it should make it easy to see exactly what those errors are, and perhaps offer to help/fix them.

 

Asking a "normal" user to run smart reports on all their drives, one at a time, then post those results to the forum and hope for someone to find the errors is NOT a good way to handle this situation.

 

unRAID "knows" what these 5 errors are, why should I have to go individually scanning a dozen different drives to try to figure it out?

 

Attached is a pic showing the SMART reporting for all my drives.  About 1/2 don't yet have extended tests, but I started an extended test for every one that still needs it.

 

The Smart Report that they are asking for is this one.  You get to it by double-clicking on 'Disk 1' (or 'Disk X' for the Xth disk) on the Main tab.  Then click on the 'HEALTH' tab and the 'Disk attributes' in the box.

 

 

Added info in Edit:  most of the time, attributes # 5, 196, 197, 198, and 199 are the ones that you should be concerned about.  IF any of them are non-zero, that is an indication of a problem. 

SmartReport.JPG.324f377ad67f21000009f7362d496260.JPG

Link to comment

all drives show good smart reports, but even if they didn't, from a usability standpoint, if unRAID feels there are errors, it should make it easy to see exactly what those errors are, and perhaps offer to help/fix them.

you're expecting too much at this point in time. sure it would be nice, but it's not feasible at this point in time except for SMART errors visible to unRAID6.  Even then, unRAID can't offer to fix them.

 

Asking a "normal" user to run smart reports on all their drives, one at a time, then post those results to the forum and hope for someone to find the errors is NOT a good way to handle this situation.

 

That's the nature of the beast right now.

 

unRAID "knows" what these 5 errors are, why should I have to go individually scanning a dozen different drives to try to figure it out?

Currently the driver doesn't know what the 5 errors are, only that there were 5 errors.

 

ALL the drive attributes need to be reviewed.

There could be drive errors, an interface issues, or possibly a memory issue.

 

Pending sectors are a key attribute to review.

 

If you have md5sums of the files, that can be used to see if there is any kind of bitrot read errors or corruption.

 

After the parity check, review the syslog to see if there were any ATA errors. That would be a tell tale sign and point to a specific drive.

Link to comment

Asking a "normal" user to run smart reports on all their drives, one at a time, then post those results to the forum and hope for someone to find the errors is NOT a good way to handle this situation.

 

unRAID "knows" what these 5 errors are, why should I have to go individually scanning a dozen different drives to try to figure it out?

As was mentioned unRAID does not know where the errors are.  At least with v6 it is easy to get the SMART reports via the standard GUI.  If you have notifications turned on then you can also get told about changes in key SMART attributes.
Link to comment

Okay, so I'm mistaken about unRAID knowing where the errors are; it happens :(

 

Meaning I have to find the problem(s) myself, with the smart reports (I've attached them all to this post).

 

I only saw a few errors while compiling the screenshots, most of which are on disk 5, which I mentioned in my first post, and linked to the thread discussing those errors.  They have not grown or changed for a long time (as far as I can tell), so I'm still not sure why running a correcting parity check would result in exactly 5 errors coming back every time I run a parity check.  I assume that they should be 'corrected' in the parity by this process, but it seems that's just not happening.

 

So, I still don't know what I can/should do to resolve this, nor do I know if my data is "okay" or not.  I have no MD5 information for anything, nor do I have a good grasp of how to generate such information.

SMART_2-a.png.fceb863b6f6a74f0b0886d0518074ebf.png

Link to comment

Meaning I have to find the problem(s) myself, with the smart reports (I've attached them all to this post).

Seem to be missing!

 

I only saw a few errors while compiling the screenshots, most of which are on disk 5, which I mentioned in my first post, and linked to the thread discussing those errors.  They have not grown or changed for a long time (as far as I can tell), so I'm still not sure why running a correcting parity check would result in exactly 5 errors coming back every time I run a parity check.  I assume that they should be 'corrected' in the parity by this process, but it seems that's just not happening.

That tends to mean that either a data disk is being read unreliably and does not return the same data each time, or there is a write issue on the parity disk so what is read back is not what was written.  Hopefully the reports will give a clue.

Link to comment

Disk 2, you have ID#199 CRC Errors.  This indicates a problem transferring data from the disk (where it was read correctly) to the Motherboard.  This generally indicates cabling issues.  Could be a bad cable, loose connection, or cross-talk between cables (caused by tying cable together to make things 'neat').  Remote possibility is a SATA controller. 

 

 

Disk 5, has ID# 5 reallocated sectors.  While the number is high, it is not an indication of a problem unless the number keeps increasing.

 

You have a number of disks reporting an ID# 187 Reported uncorrect errors on various disks.  I am not sure what the significance of this condition is...  (None of my disks even report this parameter and a quick Google search found nothing to answer this.)

 

EDIT:  Look here for information on SMART attributes:

 

        https://en.wikipedia.org/wiki/S.M.A.R.T.#Known_ATA_S.M.A.R.T._attributes

 

 

Link to comment

Justin => I have to wonder if somehow the correcting/non-correcting status bit isn't set correctly.

 

Try "toggling" it.

 

i.e. Uncheck the box; start a parity check;  then Stop the check.    Now Check the box to correct the errors and then run another parity check.

 

At the end of the check, it should show that it corrected 5 errors (assuming this remains consistent).  If so, then run it again and see if it's finally stable.

 

Link to comment

Thanks for the analysis.

 

So, check/replace cable from Disk2 as step 1.  I've not tied, or otherwise bundled any of the cables, but there are many cables, so it could just be loose from moving everything else around.

 

It seems that the only other 'issue' is the "ID# 187 Reported uncorrect errors" on disks 5, 8 and the cache disk.  All 3 disks are different manufacturers, sizes and ages, so nothing really in common with them, other than this error; strange.

 

When I review/fix the cable for disk2, I'll just double check those disks also, and confirm if they are all connecting to the same SATA controller, or anything else they may have in common.

 

I'm still not too sure why the correcting parity doesn't 'fix' the parity disk to reflect the array disks, even if one/some of the array disks have issues.  it seems like parity should adjust to match the array disks, but as you can see; i'm nowhere near an expert in any of this :)

Link to comment

Justin => I have to wonder if somehow the correcting/non-correcting status bit isn't set correctly.

 

Try "toggling" it.

 

i.e. Uncheck the box; start a parity check;  then Stop the check.    Now Check the box to correct the errors and then run another parity check.

 

At the end of the check, it should show that it corrected 5 errors (assuming this remains consistent).  If so, then run it again and see if it's finally stable.

 

good idea.  I'll do that after I get a chance to check the disk cables, to hopefully eliminate that as a potential issue also.

 

I'm going to wait until all the extended SMART reports finish, so it'll be a couple hours before i can do anything else.

Link to comment

Disk 5, has ID# 5 reallocated sectors.  While the number is high, it is not an indication of a problem unless the number keeps increasing.

This might account for the '5 errors that will not go away' as Pending sectors can return a different value each time they are read (which is one reason they tend to impact any recovery from failure).

Link to comment

Disk 5, has ID# 5 reallocated sectors.  While the number is high, it is not an indication of a problem unless the number keeps increasing.

This might account for the '5 errors that will not go away' as Pending sectors can return a different value each time they are read (which is one reason they tend to impact any recovery from failure).

 

It is my understanding that Reallocated Sectors are 'bad' sectors that have already been retired from service and have been replaced by 'good' sectors from a pool of sectors that the drive manufacturer set up for this propose.  Since there are no other parameters that are usually watched have any current failures, the drive may be OK.  What, as I understand it, is that we don't want to see the number increase. 

 

Oh, I went back through the disks again (I have problems reading the light gray text on black background) and noticed that Disks 4 and 6 also have CRC errors!  I would be checking to see if a SATA card was common to all of these disks!

Link to comment

Disk 5, has ID# 5 reallocated sectors.  While the number is high, it is not an indication of a problem unless the number keeps increasing.

This might account for the '5 errors that will not go away' as Pending sectors can return a different value each time they are read (which is one reason they tend to impact any recovery from failure).

 

No -- a reallocated sector is simply a sector that has been re-mapped to a spare sector.  It's NOT a bad sector.  This is a normal function for modern disks, which have a number of spare sectors that can be assigned to replace sectors that fail.    A more significant issue is "pending sectors" -- which are sectors that have exhibited issues but have not yet been reallocated.

 

 

Link to comment

Disk 5, has ID# 5 reallocated sectors.  While the number is high, it is not an indication of a problem unless the number keeps increasing.

This might account for the '5 errors that will not go away' as Pending sectors can return a different value each time they are read (which is one reason they tend to impact any recovery from failure).

 

No -- a reallocated sector is simply a sector that has been re-mapped to a spare sector.  It's NOT a bad sector.  This is a normal function for modern disks, which have a number of spare sectors that can be assigned to replace sectors that fail.    A more significant issue is "pending sectors" -- which are sectors that have exhibited issues but have not yet been reallocated.

You are right - I saw the figure '5' without noticing that it was Reallocated sectors rather than Pending Sectors.  Hope fully this was a bit more obvious by the fact that the text I posted mentioned Pending Sectors.
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.