Experiencing read errors but all tests normal


Recommended Posts

Your "read" errors are "media errors"

 

Jun 24 08:07:36 Tower kernel: ata4.00: irq_stat 0x40000001

Jun 24 08:07:36 Tower kernel: ata4.00: failed command: READ DMA EXT

Jun 24 08:07:36 Tower kernel: ata4.00: cmd 25/00:50:17:07:5c/00:02:e8:00:00/e0 tag 0 dma 303104 in

Jun 24 08:07:36 Tower kernel:          res 51/40:5f:f8:07:5c/00:01:e8:00:00/e0 Emask 0x9 (media error)

Jun 24 08:07:36 Tower kernel: ata4.00: status: { DRDY ERR }

Jun 24 08:07:36 Tower kernel: ata4.00: error: { UNC }

"These are errors where the checksum at the end of a sector on a disk being read does not match the contents of the sector.  (In other words, the disk considers the sector as un-readable, and un-correctable.  It tries multiple times before deciding it cannot read the sector and have it match the checksum  UNC = uncorrectable)

 

When "read"errors occur unRAID re-constructs the correct contents of the unreadable sector by reading parity in combination with all the other data disks in your server.  At the same time, it re-writes the same (previously unreadable ) sector so that the SMART firmware on the disk may re-allocate it is needed.  (assign a spare sector from its pool of spare sectors)

 

Odds are high your disk has sectrs that have been reallocated, and may have sectors pending re-allocation.  The only way to know its health is to get a SMART report of the disk.

 

To do this,on the command line type:

smartctl -a /dev/sde

and post the output in this thread.

 

We are looking at the numbers in the "RAW" column for re-allocated sectors and sectors pending re-allocation.

 

Joe L.

Link to comment

Is this what I needed?

 

SMART Attributes Data Structure revision number: 16

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate    0x002f  200  200  051    Pre-fail  Always      -      0

  3 Spin_Up_Time            0x0027  164  148  021    Pre-fail  Always      -      8775

  4 Start_Stop_Count        0x0032  100  100  000    Old_age  Always      -      770

  5 Reallocated_Sector_Ct  0x0033  200  200  140    Pre-fail  Always      -      0

  7 Seek_Error_Rate        0x002e  100  253  000    Old_age  Always      -      0

  9 Power_On_Hours          0x0032  067  067  000    Old_age  Always      -      24120

10 Spin_Retry_Count        0x0032  100  100  000    Old_age  Always      -      0

11 Calibration_Retry_Count 0x0032  100  100  000    Old_age  Always      -      0

12 Power_Cycle_Count      0x0032  100  100  000    Old_age  Always      -      224

192 Power-Off_Retract_Count 0x0032  200  200  000    Old_age  Always      -      137

193 Load_Cycle_Count        0x0032  180  180  000    Old_age  Always      -      61169

194 Temperature_Celsius    0x0022  117  109  000    Old_age  Always      -      35

196 Reallocated_Event_Count 0x0032  200  200  000    Old_age  Always      -      0

197 Current_Pending_Sector  0x0032  200  200  000    Old_age  Always      -      13

198 Offline_Uncorrectable  0x0030  200  200  000    Old_age  Offline      -      0

199 UDMA_CRC_Error_Count    0x0032  200  200  000    Old_age  Always      -      0

200 Multi_Zone_Error_Rate  0x0008  200  177  000    Old_age  Offline      -      0

 

SMART Error Log Version: 1

No Errors Logged

 

SMART Self-test log structure revision number 1

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error

# 1  Short offline      Completed without error      00%    23887        -

# 2  Short offline      Completed without error      00%    23887        -

# 3  Short offline      Completed: read failure      70%    23886        3905729434

 

SMART Selective self-test log data structure revision number 1

SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS

    1        0        0  Not_testing

    2        0        0  Not_testing

    3        0        0  Not_testing

    4        0        0  Not_testing

    5        0        0  Not_testing

Selective self-test flags (0x0):

  After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.

 

It is also having the IRQ16 shutdown bug again, which I have not an issue with this board/cpu before and nothings changed so I am at a loss with that one.

 

Whether it is or not I ordered a new drive to replace it or just add it to the array...lol...always use more sapce  :)

 

 

Link to comment

Well, replacement is certainly an option, but according to these lines in the report there are 13 sectors pending re-allocation:

  5 Reallocated_Sector_Ct  0x0033  200  200  140    Pre-fail  Always      -      0

197 Current_Pending_Sector  0x0032  200  200  000    Old_age  Always      -      13

 

The way most users of unRAID would handle this is to have those sectors re-written, and re-allocated.  This can be done by first making a copy of any critical data on that disk (just in case) and then:

 

1. stop the array

2.  Make a copy of the "config" directory on the flash drive while the array is stopped.  Save it someplace safe.  (We should not need it, but just in case we can revert to this configuration easily with it)

3. un-assign the disk with the read errors.

4. start the array with the disk un-assigned  (this will allow unRAID to forget its model/serial number so it can be used as its own replacement)

5. stop the array once more

6. re-assign the disk. It will be then written as its own replacement (upon which it will be re-constructed and all the sectors pending re-allocation should be re-allocated.)  Basically, everything on the disk will be re-written in place.    When it gets to the 13 sectors pending re-allocation the disk will first try to re-write the existing sector and checksum.  If that works, the sector will not be re-allocated since it will then be readable and its affiliated check-sum match.  If not successful, it will be re-allocated from the pool of spare sectors.

 

Note that the re-construction process will take about as long as the initial parity sync, and during that interval you'll not be protected by parity if another disk should fail.

 

Link to comment

Since I have the disc coming in and would have to use that as the backup, can I instead just preclear, and install as a new disc then use it to copy all the data to it, then use the current disc as a new disc.

 

Questions:

Will have to go through the pre-clear process again on the old disc?.

Will that accomplish the same thing plus give me the added space?

 

Wouldn't this be okay as well and since I use a second (free) unraid install as an OS to pre-clear it should leave the only downtime being the parity sync right?

 

Or am I better off doing the re-write

Link to comment

Since I have the disc coming in and would have to use that as the backup, can I instead just preclear, and install as a new disc then use it to copy all the data to it, then use the current disc as a new disc.

 

Questions:

Will have to go through the pre-clear process again on the old disc?.

Will that accomplish the same thing plus give me the added space?

 

Wouldn't this be okay as well and since I use a second (free) unraid install as an OS to pre-clear it should leave the only downtime being the parity sync right?

 

Or am I better off doing the re-write

Yes, you will have to preclear the old disk if you remove it from the configuration and want to add it to a new slot.

Yes, preclearing will force the drive to read and write all involved sectors, thus allowing reallocation to work as needed.

 

Since you will be preclearing the new disk for testing purposes anyway, it makes perfect sense to add it to the array and copy the files from the drive having issues to the new drive. Theoretically you will still be protected from a drive failure during the entire procedure so far, and would only lose protection when you remove the drive and recalc parity. You will still be unprotected for the same length of time, but you would have two copies of the data in question during the at risk period.

Link to comment

Here is the new log:

 

I had to remove about 3k lines of:

Jun 24 14:06:45 Tower kernel: REISERFS error (device md3): vs-4080 _reiserfs_free_block: block 229612613: bit already cleared error notices in order to get it uploaded.

That indicates the file system has probably been set to read-only to prevent further corruption.

 

You need to un-mount disk3 and then run

reiserfsck --check /dev/md3

to have it tell you what command needs to be run next to fix the corruption.

 

Details are in the wiki under "check file systems"

Link to comment

Thanks Joe L.

 

Okay I did what it said and this is what it put out (seems like it is still working at this point?)

 

Replaying journal: Done.

Reiserfs journal '/dev/md3' in blocks [18..8211]: 0 transactions replayed

 

Should I use the --fix-fixable switch now? or where do I go from here?

Normally, it would tell you to run fix-fixable if it was needed.  If it is not yet done, let it finish.

 

(probably would not hurt anything to run fix-fixable, but it will tell you once the current check is complete.

Just don't go running anything further unless it tells you to.)

Link to comment
Any idea how long this will take to finish, it's 1.5TB of data but it hasn't done anything for almost 3 hours, I haven't a clue (snicker) so I am asking.

 

I am assuming that it is done, frozen or screwed up. Don't hjave the time to baby sit, must use the server later tonight so it will have to wait.

 

Okay, so I am back at copying the data to a 4tb external (but 1mb/s is just brutal), then I guess I will try the first suggestion mentioned here as the reiserfsck does not seem to be doing anything at all. I was reading and shouldn't it take less than an hour to complete the test?  It has take 12 hours and not budged. So I will probably go the other way and hope that will fix the issue

Link to comment

I believe so since nothing has worked so far, I am now asking if I am just better off saving the data (in process), and doing what was mentioned first

 

Well, replacement is certainly an option, but according to these lines in the report there are 13 sectors pending re-allocation:

  5 Reallocated_Sector_Ct  0x0033  200  200  140    Pre-fail  Always      -      0

197 Current_Pending_Sector  0x0032  200  200  000    Old_age  Always      -      13

 

The way most users of unRAID would handle this is to have those sectors re-written, and re-allocated.  This can be done by first making a copy of any critical data on that disk (just in case) and then:

 

1. stop the array

2.  Make a copy of the "config" directory on the flash drive while the array is stopped.  Save it someplace safe.  (We should not need it, but just in case we can revert to this configuration easily with it)

3. un-assign the disk with the read errors.

4. start the array with the disk un-assigned  (this will allow unRAID to forget its model/serial number so it can be used as its own replacement)

5. stop the array once more

6. re-assign the disk. It will be then written as its own replacement (upon which it will be re-constructed and all the sectors pending re-allocation should be re-allocated.)  Basically, everything on the disk will be re-written in place.    When it gets to the 13 sectors pending re-allocation the disk will first try to re-write the existing sector and checksum.  If that works, the sector will not be re-allocated since it will then be readable and its affiliated check-sum match.  If not successful, it will be re-allocated from the pool of spare sectors.

 

Note that the re-construction process will take about as long as the initial parity sync, and during that interval you'll not be protected by parity if another disk should fail.

 

And woulfd that fix this, I am almost to the poiint of removing (saving) all 8TB's of data and restarting this thing from scratch...it would have been done by now...lol.  :(

 

 

Link to comment

I believe so since nothing has worked so far, I am now asking if I am just better off saving the data (in process), and doing what was mentioned first

 

Well, replacement is certainly an option, but according to these lines in the report there are 13 sectors pending re-allocation:

  5 Reallocated_Sector_Ct  0x0033  200  200  140    Pre-fail  Always      -      0

197 Current_Pending_Sector  0x0032  200  200  000    Old_age  Always      -      13

 

The way most users of unRAID would handle this is to have those sectors re-written, and re-allocated.  This can be done by first making a copy of any critical data on that disk (just in case) and then:

 

1. stop the array

2.  Make a copy of the "config" directory on the flash drive while the array is stopped.  Save it someplace safe.  (We should not need it, but just in case we can revert to this configuration easily with it)

3. un-assign the disk with the read errors.

4. start the array with the disk un-assigned  (this will allow unRAID to forget its model/serial number so it can be used as its own replacement)

5. stop the array once more

6. re-assign the disk. It will be then written as its own replacement (upon which it will be re-constructed and all the sectors pending re-allocation should be re-allocated.)  Basically, everything on the disk will be re-written in place.    When it gets to the 13 sectors pending re-allocation the disk will first try to re-write the existing sector and checksum.  If that works, the sector will not be re-allocated since it will then be readable and its affiliated check-sum match.  If not successful, it will be re-allocated from the pool of spare sectors.

 

Note that the re-construction process will take about as long as the initial parity sync, and during that interval you'll not be protected by parity if another disk should fail.

 

And woulfd that fix this, I am almost to the poiint of removing (saving) all 8TB's of data and restarting this thing from scratch...it would have been done by now...lol.  :(

 

The procedure outlined should correct the pending sectors. Once the disk surface is corrected then you can start fixing its contents, i.e., the file system.

Link to comment

Okay it is doing the rebuild...but at 3.22mb/s is this right? 2TB in 10049 minutes???

 

It also seems to be staying as a read-only file system on 'disk3'

 

Sorry, jutst while I was tying it went down to 2.58...I think at this ppoint I will just back-up the dat on all drives and reset the entire array, maybe even change to a different server as this has been problematic at best, almost from minute one

 

I do know I am not waiting 170 Hours (7 days) to use this when I can have ithe data reloaded onto a another system in 1 day

 

Guess I am stuck with this.....

 

Who knew that the WD20EURS is somehow smaller than the WD20EVDS when precleared.....oh well, I guess i'llk just have to throw the monbey at it and build a new large and try something different.

Link to comment

Okay it is doing the rebuild...but at 3.22mb/s is this right? 2TB in 10049 minutes???

Attach a new syslog.

It also seems to be staying as a read-only file system on 'disk3'

Correct. The physical disk surface must be correct before the file system can be corrected. See my last post.

Sorry, jutst while I was tying it went down to 2.58...I think at this ppoint I will just back-up the dat on all drives and reset the entire array, maybe even change to a different server as this has been problematic at best, almost from minute one

 

I do know I am not waiting 170 Hours (7 days) to use this when I can have ithe data reloaded onto a another system in 1 day

 

Guess I am stuck with this.....

Cannot provide any insight without a new syslog. Attach a new syslog. Rebuilding from scratch will take exactly as long if the hardware problems are not corrected first.

 

Who knew that the WD20EURS is somehow smaller than the WD20EVDS when precleared.....oh well, I guess i'llk just have to throw the monbey at it and build a new large and try something different.

Those drives are the same size. All modern drives have standardized sizes. Any system will have problems due to the hardware errors that your experiencing.

Link to comment

 

Array StatusSTARTED; 6 disks in array.

Rebuilding disk3 Total Size 1,953,514,552  KB

Current 526,064,632  (26.9%)

Speed 3,691  KB/sec

Finish 6426  minutes

Syslog attached

 

 

Who knew that the WD20EURS is somehow smaller than the WD20EVDS when precleared.....oh well, I guess i'llk just have to throw the monbey at it and build a new large and try something different.

Those drives are the same size. All modern drives have standardized sizes. Any system will have problems due to the hardware errors that your experiencing.

 

I also wasn't aware of any Hardware errors As I assumed they were just corrupt sectors not damaged ones. If it is indeed the case than I am going to have to go through this crap again when I replace the drive, which I will be doing instantly if there is damaged hardware

syslog-2013-06-30.txt

Link to comment

I wonder why this hasn't happened before since the board is newer (DoM 2011, BIOS 2012) and has been running in this particular unraid install since over a year now.

 

Or are you seing the drive that I precleared on the older Gigabyte MB (I used another system for this particular preclear) that I tried to install

Who knew that the WD20EURS is somehow smaller than the WD20EVDS when precleared

That would explain why that drive was stated as smaller by unRAID. I didn't use that drive so there shouldn't be an issue, hopefully you are reading that error.

 

All of the currently used drives are listing 'LBA48 user addressable sectors: 3907029168' and even the parity drive 'LBA48 user addressable sectors: 5860533168'

 

Otherwise, wouldn't this have happened long ago?

 

And no the rebuild is only at 31%

 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.