Errors during a parity rebuild?

May 23, 200818 yr

I'm replacing my 750gig parity drive with a 1TB drive and rebuilding parity... but one of my data drives is reporting 128 errors (still rebuilding... so there could be more). Are errors during a parity rebuild ok? I've read of errors during a parity check happening because some data could be out of sync... but if my array is rebuilding parity and a drive is getting errors then those must be read errors of some sort?

I plan on adding my system log at the end of it's rebuild (2 hours from now). I'm really just curious right now if errors that happen during a rebuild are acceptable?

Quote

May 23, 200818 yr

I'm really just curious right now if errors that happen during a rebuild are acceptable?

That's an easy question: NOPE !!!

Actually, drive errors aren't acceptable at any time. Something is wrong.

Quote

May 23, 200818 yr

Author

I'm really just curious right now if errors that happen during a rebuild are acceptable?

That's an easy question: NOPE !!!

Actually, drive errors aren't acceptable at any time. Something is wrong.

Not the answer I was looking for

Syslog is attached. I'm seeing these errors:

May 22 21:00:10 NAS kernel: md2: read error!
May 22 21:00:10 NAS kernel: handle_stripe read error: 917132248/2, count: 1

Looks like they're all in the same block too... 128 errors is what... a sector? Not sure on what I'm talking about... but it seems like some area of the disk may be bad.

From what I've learned on this forum I'll run reiserfsck and see what that outputs. Is there something else I should check? Should I check parity once it's rebuilt? I'm guessing no... I should do more checks just on this drive.

Parity rebuild is still going on but the drive in question is done being read from.

Quote

May 23, 200818 yr

I'm really just curious right now if errors that happen during a rebuild are acceptable?

That's an easy question: NOPE !!!

Actually, drive errors aren't acceptable at any time. Something is wrong.

Not the answer I was looking for

Syslog is attached. I'm seeing these errors:
May 22 21:00:10 NAS kernel: md2: read error!
May 22 21:00:10 NAS kernel: handle_stripe read error: 917132248/2, count: 1
Looks like they're all in the same block too... 128 errors is what... a sector? Not sure on what I'm talking about... but it seems like some area of the disk may be bad.

From what I've learned on this forum I'll run reiserfsck and see what that outputs. Is there something else I should check? Should I check parity once it's rebuilt? I'm guessing no... I should do more checks just on this drive.

Parity rebuild is still going on but the drive in question is done being read from.

The errors you are seeing in the syslog are errors in reading from the physical disk. They are not the result of a corrupt file-system, but usually of some section of a disk platter that is unreadable. When a read error occurs, unRAID will reconstruct what it was trying to read from the disk by reading from parity and the other data disks. It then will attempt to write the sector back to the disk that had failed.

When a read failure occurs on a disk with "smart" features it is marked as a potential sector for re-assignment. That re-assignment will occur on the next "write" to the defective sector. Fortunately, with unRAID, that occurs when it attempts to write back the data on the sector that it failed to read.

As long as a drive has sufficient spare sectors to use in this re-assignment, unRAID, in combination with the "smart" disk, actually repairs bad sectors by reassignment of them to good sectors.

The errors you see on the main unRAID management page are these "read" errors. If a drive fails when writing to it, it is immediately taken off-line and you will see a red indicator on the management page. You can still read and write it, but when reading you are getting data re-constructed from parity in combination with all the other data drives. When writing, you are modifying the parity drive as if the actual disk was being written to, keeping it correct and up to date.

Since you did not see the drive off-line, odds are your disk has already re-allocated the bad sectors. You probably will not find any file-system corruption. You can certainly use the reiserfsck program to be sure.

You should download and run the smartctl proram and run it on your drives. It will give you data on what is happening in your drives. It is described here in the wiki: http://lime-technology.com/wiki/index.php?title=Troubleshooting#Hard_drive_failures By running it now, and then again in a week or so you can track if the drive is getting worse. When you start to see huge numbers of sectors re-assigned it is time to replace the drive.

Joe L.

Quote

May 23, 200818 yr

Author

smartctl is attached.

So are you saying that unRAID (may have) moved the data on the bad sectors to good sectors (keeping my data files working)? If this is the case, doesn't it still mean I need to replace the drive as it has gone bad? I guess the benefit of writing from the bad to good sectors is that when I add a new drive and rebuild from parity all the data written to the new drive will be good?

Edit: Just saw you edited your post and added some stuff. If I have bad sectors but they don't get worse then the drive is fine and there's nothing I need to worry about?

Also... what am I supposed to be looking for the the smart.txt file? I'm guessing "Current_Pending_Sector?" Right now that has a value of one. As long as it stays at 1 I shouldn't have any further issues?

Edit 2: Just a thought... earlier today I was writing to one of my shares (which was writing to the drive I got errors on) and during the write I added a drive to my unRAID server (not added to the array... just plugged it into the server and it got power and spun up). During that spin up I think the transfer to my share paused for maybe 10 seconds. So... could this have caused the bad sector? That pause was maybe caused by a power fluctuation? It sure seemed like when I added that one drive all the drives spun down for a second (I can't remember for sure though... I had a lot of background noise).

Quote

May 23, 200818 yr

Author

Ok... so from what I've read on here and what I've found on the net it's my understanding that a bad sector is nothing to worry about. What I've read is that some drives can come shipped new with a bad sector or two. The hard drive will locate these bad sectors and use extra sectors to replace the bad ones with. Problems start to occure when one bad sector turns into many.

I've also found people saying that one bad sector is a sign the drive may be failing and to RMA it. Once a drive has a bad sector it's more likely to fail both in the short and long run so it's better to just get rid of the drive and start with a new one.

I think I'm going to RMA the drive and get a new one. There's no point in keeping it and "hoping" it works for years to come. I really don't want to have to check this disk every week to see if the bad sectors are increasing. I'd rather just do a parity check once a month as unRAID maintenance to make sure everything is running smoothly.

But I'm confused on why these errors happened now? Why didn't these erorrs show up when I ran the countless parirty rebuilds and checks since I built my server about 3 weeks ago? Why today, did a sector go bad?

If the disk came with a bad sector from the factory then I should have gotten an error on my first parity build right? I would gather that this means the bad sector recently showed up which is probably worse then having a bad sector come straight from the factory.

Can someone just confirm some of my thoughts above? Basically... I can keep using this disk but I should keep checking every week or so to make sure no bad sectors are showing up... or I can RMA it and try a new one. Is that about right?

Also... can I get some clarification on what I should be looking for in the smart logs? I see "Current_Pending_Sector" and it has a value of 1 for me. I'm guessing that's "1 bad sector" is what it's trying to tell me. My googling has led me to believe that "Current_Pending_Sector" basically means "Bad Sectors."

Below that there's something called "Offline_Uncorrectable." What does this one mean? Is it a sector that couldn't be corrected and had to be taken offline?

Sorry for all the questions.

Quote

May 23, 200818 yr

From what I understand, the "current pending sector" is one that failed when an attempt was made to read it. The drive has remembered the sector number so that when the next write is performed to the same sector it will be re-mapped.

As far as to why sectors become unreadable over time... it could be mechanical tolerances are getting worse as the drive gets older, or it could be the magnetic material on the disk platter has a weak spot in the coating. Or the drive might have taken some abuse somewhere between the assembly line and your home. (Did the the UPS delivery person say "OOps")

Quote

May 23, 200818 yr

Author

Thanks for the reply Joe.

So the cause for a bad sector is unkown. It could be from the manufacturing, delivery from the factory to my doorstep, old parts getting tired, magnetics, and probably a few more things. It could be cause by something that won't affect the drive again in the future... but it could also be caused by something that will slowly (or quickly) degrade the quality of the drive. So the best course of action is to RMA the dive with WD (since I'm well within my warranty I might as well take advantage of it).

Looks like I'll be adding one of my new drives in place of drive2 with the bad sector. Parity should rebuild all the data from my old bad sector drive.

Quote

May 23, 200818 yr

A couple quick thoughts (don't have much time):

1 - ALWAYS run a full parity check on your array before doing drive maintenance! When I order new drives, I run parity check while they are en route so that I am ready to go when they arrive. If you are going to get sync errors, you want to get them while parity is in place!

2 - Your smart.txt file indicates that you have 1 sector marked as ready to be reallocated. It did NOT get reallocated yet. That means that there was one (physical) read error that the drive noticed, but there has been no subsequent write to that sector to cause the remap to occur.

3 - Your smart.txt file indicates that NO sectors have ALREADY been reallocated.

4 - Yours is the second smart.txt file I've seen where the drive is reporting minimal (if any) remapping activity, while unRAID is getting a substantial number of errors. (The other person has over 1000 errors without a single remap occurring). Obviously drives are returning errors on read that have nothing to do with physical media issues. I wish I understood the interface better so that I could do more than speculate as to what is causing these.

Quote

May 23, 200818 yr

Author

A couple quick thoughts (don't have much time):

1 - ALWAYS run a full parity check on your array before doing drive maintenance! When I order new drives, I run parity check while they are en route so that I am ready to go when they arrive. If you are going to get sync errors, you want to get them while parity is in place!

2 - Your smart.txt file indicates that you have 1 sector marked as ready to be reallocated. It did NOT get reallocated yet. That means that there was one (physical) read error that the drive noticed, but there has been no subsequent write to that sector to cause the remap to occur.

3 - Your smart.txt file indicates that NO sectors have ALREADY been reallocated.

4 - Yours is the second smart.txt file I've seen where the drive is reporting minimal (if any) remapping activity, while unRAID is getting a substantial number of errors. (The other person has over 1000 errors without a single remap occurring). Obviously drives are returning errors on read that have nothing to do with physical media issues. I wish I understood the interface better so that I could do more than speculate as to what is causing these.

If a drive is returning read errors without bad sectors wouldn't that most likely be memory related?

With my disk I have 1 error on the drive itself (I thought it was a bad sector... but you're saying it only "may" be a bad sector?). From my smart.txt file all the errors happen within 1 block of data (not random). From my understanding, all my read errors I got from my last parity rebuild were due to that 1 bad sector on my hard drive.

Quote

May 23, 200818 yr

6 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0

197 Current_Pending_Sector 0x0012 200 200 000 Old_age Always - 1

This indicates a spot on the drive went bad, It is waiting to be "reallocated" on the next write to that sector.

Now there is a way to calculate what that sector is, but it's complicated.

You do not need to RMA the drive yet. It's a minimal situation. Perhaps a few files may be involved in this area.

I would suggest doing a smartctl LONG test as this will go through all the sectors more thoroughly.

smartctl -d ata -tlong /dev/sd? where ?=drive letter.

Then post the smartctl.txt log for review.

if you really want to get rid of that error and have a spare drive, you can swap it out.

After the drive is rebuilt on your spare you could clear it with dd and have it write 0's to all sectors which would for a remap.

The current pending sector means this number of sectors had unrecoverable read errors and it will be remapped on the next write.

Had you done a parity "CHECK" before the swap, it would have gotten the read errors, rewrote the sectors from the parity drive thereby causing the drive to remap the sectors.

You can also do a

smartctl -d ata -H /dev/sd? to check the overall health of the drive.

Quote

May 23, 200818 yr

Author

I've already swapped out the drive and am rebuilding (data rebuild) the new drive right now.

You mentioned:

After the drive is rebuilt on your spare you could clear it with dd and have it write 0's to all sectors which would for a remap.

Are you referring to my drive that had errors? Once my new drive is done being rebuilt I could add my old error showing drive back to the array and clear it with dd? Clearing the old drive would force the bad sector to re-map then. I'm pretty sure this is what you were talking about... at first read it sounds like you were talking about rebuilding the new drive and then clearing that new drive (which didn't make much sense).

So... I should add the old drive back to the array... clear it with dd... and then run a smartctl LONG test? I can then post the results back here for you guys to interpret? After it remaps the bad sector we may have a better idea of understanding how the drive is performing?

And you're right... I should have done a parity CHECK (after finishing my parity rebuild) before adding this new drive to replace it. Which brings up another question... Is a file on my new drive now corrupt? Since the last parity build couldn't read a sector of the "bad" drive when it built parity, does that mean my new drive that was built from parity will be missing those bits of data?

Quote

Errors during a parity rebuild?

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)