icedragonslair Posted June 24, 2013 Share Posted June 24, 2013 I am still experiencing a lot of read errors on one of my drives (drive 3), but all tests and parity comes back normal/successful. Should I just replace the drive on general principle and try to submit for warranty replacement, at this point it may still be easy since I have access to it and just copy the files to another drive. syslog-2013-06-24.txt Quote Link to comment
Joe L. Posted June 24, 2013 Share Posted June 24, 2013 Your "read" errors are "media errors" Jun 24 08:07:36 Tower kernel: ata4.00: irq_stat 0x40000001 Jun 24 08:07:36 Tower kernel: ata4.00: failed command: READ DMA EXT Jun 24 08:07:36 Tower kernel: ata4.00: cmd 25/00:50:17:07:5c/00:02:e8:00:00/e0 tag 0 dma 303104 in Jun 24 08:07:36 Tower kernel: res 51/40:5f:f8:07:5c/00:01:e8:00:00/e0 Emask 0x9 (media error) Jun 24 08:07:36 Tower kernel: ata4.00: status: { DRDY ERR } Jun 24 08:07:36 Tower kernel: ata4.00: error: { UNC } "These are errors where the checksum at the end of a sector on a disk being read does not match the contents of the sector. (In other words, the disk considers the sector as un-readable, and un-correctable. It tries multiple times before deciding it cannot read the sector and have it match the checksum UNC = uncorrectable) When "read"errors occur unRAID re-constructs the correct contents of the unreadable sector by reading parity in combination with all the other data disks in your server. At the same time, it re-writes the same (previously unreadable ) sector so that the SMART firmware on the disk may re-allocate it is needed. (assign a spare sector from its pool of spare sectors) Odds are high your disk has sectrs that have been reallocated, and may have sectors pending re-allocation. The only way to know its health is to get a SMART report of the disk. To do this,on the command line type: smartctl -a /dev/sde and post the output in this thread. We are looking at the numbers in the "RAW" column for re-allocated sectors and sectors pending re-allocation. Joe L. Quote Link to comment
icedragonslair Posted June 24, 2013 Author Share Posted June 24, 2013 Is this what I needed? SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 164 148 021 Pre-fail Always - 8775 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 770 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 100 253 000 Old_age Always - 0 9 Power_On_Hours 0x0032 067 067 000 Old_age Always - 24120 10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 224 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 137 193 Load_Cycle_Count 0x0032 180 180 000 Old_age Always - 61169 194 Temperature_Celsius 0x0022 117 109 000 Old_age Always - 35 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 13 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 177 000 Old_age Offline - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 23887 - # 2 Short offline Completed without error 00% 23887 - # 3 Short offline Completed: read failure 70% 23886 3905729434 SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. It is also having the IRQ16 shutdown bug again, which I have not an issue with this board/cpu before and nothings changed so I am at a loss with that one. Whether it is or not I ordered a new drive to replace it or just add it to the array...lol...always use more sapce Quote Link to comment
Joe L. Posted June 24, 2013 Share Posted June 24, 2013 Well, replacement is certainly an option, but according to these lines in the report there are 13 sectors pending re-allocation: 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 13 The way most users of unRAID would handle this is to have those sectors re-written, and re-allocated. This can be done by first making a copy of any critical data on that disk (just in case) and then: 1. stop the array 2. Make a copy of the "config" directory on the flash drive while the array is stopped. Save it someplace safe. (We should not need it, but just in case we can revert to this configuration easily with it) 3. un-assign the disk with the read errors. 4. start the array with the disk un-assigned (this will allow unRAID to forget its model/serial number so it can be used as its own replacement) 5. stop the array once more 6. re-assign the disk. It will be then written as its own replacement (upon which it will be re-constructed and all the sectors pending re-allocation should be re-allocated.) Basically, everything on the disk will be re-written in place. When it gets to the 13 sectors pending re-allocation the disk will first try to re-write the existing sector and checksum. If that works, the sector will not be re-allocated since it will then be readable and its affiliated check-sum match. If not successful, it will be re-allocated from the pool of spare sectors. Note that the re-construction process will take about as long as the initial parity sync, and during that interval you'll not be protected by parity if another disk should fail. Quote Link to comment
icedragonslair Posted June 24, 2013 Author Share Posted June 24, 2013 Since I have the disc coming in and would have to use that as the backup, can I instead just preclear, and install as a new disc then use it to copy all the data to it, then use the current disc as a new disc. Questions: Will have to go through the pre-clear process again on the old disc?. Will that accomplish the same thing plus give me the added space? Wouldn't this be okay as well and since I use a second (free) unraid install as an OS to pre-clear it should leave the only downtime being the parity sync right? Or am I better off doing the re-write Quote Link to comment
JonathanM Posted June 24, 2013 Share Posted June 24, 2013 Since I have the disc coming in and would have to use that as the backup, can I instead just preclear, and install as a new disc then use it to copy all the data to it, then use the current disc as a new disc. Questions: Will have to go through the pre-clear process again on the old disc?. Will that accomplish the same thing plus give me the added space? Wouldn't this be okay as well and since I use a second (free) unraid install as an OS to pre-clear it should leave the only downtime being the parity sync right? Or am I better off doing the re-write Yes, you will have to preclear the old disk if you remove it from the configuration and want to add it to a new slot. Yes, preclearing will force the drive to read and write all involved sectors, thus allowing reallocation to work as needed. Since you will be preclearing the new disk for testing purposes anyway, it makes perfect sense to add it to the array and copy the files from the drive having issues to the new drive. Theoretically you will still be protected from a drive failure during the entire procedure so far, and would only lose protection when you remove the drive and recalc parity. You will still be unprotected for the same length of time, but you would have two copies of the data in question during the at risk period. Quote Link to comment
icedragonslair Posted June 24, 2013 Author Share Posted June 24, 2013 Thanks loads for the help, you answered everything perfectly/...now one more thing. I seemed to have lost all of my permissions (I decided to run the smb script top see if that was the problem). But it seems it made all of my shares read only access, what can I do to fix this? Thanks, Ice Quote Link to comment
icedragonslair Posted June 24, 2013 Author Share Posted June 24, 2013 Here is the new log: I had to remove about 3k lines of: Jun 24 14:06:45 Tower kernel: REISERFS error (device md3): vs-4080 _reiserfs_free_block: block 229612613: bit already cleared error notices in order to get it uploaded. syslog-2013-06-24.txt Quote Link to comment
Joe L. Posted June 24, 2013 Share Posted June 24, 2013 Here is the new log: I had to remove about 3k lines of: Jun 24 14:06:45 Tower kernel: REISERFS error (device md3): vs-4080 _reiserfs_free_block: block 229612613: bit already cleared error notices in order to get it uploaded. That indicates the file system has probably been set to read-only to prevent further corruption. You need to un-mount disk3 and then run reiserfsck --check /dev/md3 to have it tell you what command needs to be run next to fix the corruption. Details are in the wiki under "check file systems" Quote Link to comment
icedragonslair Posted June 24, 2013 Author Share Posted June 24, 2013 Lol...I guess I'm just daft. I couldn't get it to even find md3 once it was unmounted, but it ran okay mounted just skipped the journal replay (probably what we need, right?) Also, I am not finding anything that states 'check file systems' in the wiki link in your sig. Quote Link to comment
Joe L. Posted June 24, 2013 Share Posted June 24, 2013 Lol...I guess I'm just daft. I couldn't get it to even find md3 once it was unmounted, but it ran okay mounted just skipped the journal replay (probably what we need, right?) Also, I am not finding anything that states 'check file systems' in the wiki link in your sig. http://lime-technology.com/wiki/index.php/Check_Disk_Filesystems Quote Link to comment
icedragonslair Posted June 24, 2013 Author Share Posted June 24, 2013 Thanks Joe L. Okay I did what it said and this is what it put out (seems like it is still working at this point?) Replaying journal: Done. Reiserfs journal '/dev/md3' in blocks [18..8211]: 0 transactions replayed Should I use the --fix-fixable switch now? or where do I go from here? Quote Link to comment
Joe L. Posted June 24, 2013 Share Posted June 24, 2013 Thanks Joe L. Okay I did what it said and this is what it put out (seems like it is still working at this point?) Replaying journal: Done. Reiserfs journal '/dev/md3' in blocks [18..8211]: 0 transactions replayed Should I use the --fix-fixable switch now? or where do I go from here? Normally, it would tell you to run fix-fixable if it was needed. If it is not yet done, let it finish. (probably would not hurt anything to run fix-fixable, but it will tell you once the current check is complete. Just don't go running anything further unless it tells you to.) Quote Link to comment
icedragonslair Posted June 24, 2013 Author Share Posted June 24, 2013 Any idea how long this will take to finish, it's 1.5TB of data but it hasn't done anything for almost 3 hours, I haven't a clue (snicker) so I am asking. I am assuming that it is done, frozen or screwed up. Don't hjave the time to baby sit, must use the server later tonight so it will have to wait. Okay, so I am back at copying the data to a 4tb external (but 1mb/s is just brutal), then I guess I will try the first suggestion mentioned here as the reiserfsck does not seem to be doing anything at all. I was reading and shouldn't it take less than an hour to complete the test? It has take 12 hours and not budged. So I will probably go the other way and hope that will fix the issue Quote Link to comment
dgaschk Posted June 25, 2013 Share Posted June 25, 2013 Are you running reiserfsck on the disk with 13 pending sectors? If so, the pending sectors must be cleared before reiserfsck will work. Quote Link to comment
icedragonslair Posted June 25, 2013 Author Share Posted June 25, 2013 I believe so since nothing has worked so far, I am now asking if I am just better off saving the data (in process), and doing what was mentioned first Well, replacement is certainly an option, but according to these lines in the report there are 13 sectors pending re-allocation: 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 13 The way most users of unRAID would handle this is to have those sectors re-written, and re-allocated. This can be done by first making a copy of any critical data on that disk (just in case) and then: 1. stop the array 2. Make a copy of the "config" directory on the flash drive while the array is stopped. Save it someplace safe. (We should not need it, but just in case we can revert to this configuration easily with it) 3. un-assign the disk with the read errors. 4. start the array with the disk un-assigned (this will allow unRAID to forget its model/serial number so it can be used as its own replacement) 5. stop the array once more 6. re-assign the disk. It will be then written as its own replacement (upon which it will be re-constructed and all the sectors pending re-allocation should be re-allocated.) Basically, everything on the disk will be re-written in place. When it gets to the 13 sectors pending re-allocation the disk will first try to re-write the existing sector and checksum. If that works, the sector will not be re-allocated since it will then be readable and its affiliated check-sum match. If not successful, it will be re-allocated from the pool of spare sectors. Note that the re-construction process will take about as long as the initial parity sync, and during that interval you'll not be protected by parity if another disk should fail. And woulfd that fix this, I am almost to the poiint of removing (saving) all 8TB's of data and restarting this thing from scratch...it would have been done by now...lol. Quote Link to comment
dgaschk Posted June 25, 2013 Share Posted June 25, 2013 I believe so since nothing has worked so far, I am now asking if I am just better off saving the data (in process), and doing what was mentioned first Well, replacement is certainly an option, but according to these lines in the report there are 13 sectors pending re-allocation: 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 13 The way most users of unRAID would handle this is to have those sectors re-written, and re-allocated. This can be done by first making a copy of any critical data on that disk (just in case) and then: 1. stop the array 2. Make a copy of the "config" directory on the flash drive while the array is stopped. Save it someplace safe. (We should not need it, but just in case we can revert to this configuration easily with it) 3. un-assign the disk with the read errors. 4. start the array with the disk un-assigned (this will allow unRAID to forget its model/serial number so it can be used as its own replacement) 5. stop the array once more 6. re-assign the disk. It will be then written as its own replacement (upon which it will be re-constructed and all the sectors pending re-allocation should be re-allocated.) Basically, everything on the disk will be re-written in place. When it gets to the 13 sectors pending re-allocation the disk will first try to re-write the existing sector and checksum. If that works, the sector will not be re-allocated since it will then be readable and its affiliated check-sum match. If not successful, it will be re-allocated from the pool of spare sectors. Note that the re-construction process will take about as long as the initial parity sync, and during that interval you'll not be protected by parity if another disk should fail. And woulfd that fix this, I am almost to the poiint of removing (saving) all 8TB's of data and restarting this thing from scratch...it would have been done by now...lol. The procedure outlined should correct the pending sectors. Once the disk surface is corrected then you can start fixing its contents, i.e., the file system. Quote Link to comment
icedragonslair Posted June 26, 2013 Author Share Posted June 26, 2013 Thank you, will try once again, once I have the data backed up Quote Link to comment
icedragonslair Posted June 29, 2013 Author Share Posted June 29, 2013 Okay it is doing the rebuild...but at 3.22mb/s is this right? 2TB in 10049 minutes??? It also seems to be staying as a read-only file system on 'disk3' Sorry, jutst while I was tying it went down to 2.58...I think at this ppoint I will just back-up the dat on all drives and reset the entire array, maybe even change to a different server as this has been problematic at best, almost from minute one I do know I am not waiting 170 Hours (7 days) to use this when I can have ithe data reloaded onto a another system in 1 day Guess I am stuck with this..... Who knew that the WD20EURS is somehow smaller than the WD20EVDS when precleared.....oh well, I guess i'llk just have to throw the monbey at it and build a new large and try something different. Quote Link to comment
dgaschk Posted June 29, 2013 Share Posted June 29, 2013 Okay it is doing the rebuild...but at 3.22mb/s is this right? 2TB in 10049 minutes??? Attach a new syslog. It also seems to be staying as a read-only file system on 'disk3' Correct. The physical disk surface must be correct before the file system can be corrected. See my last post. Sorry, jutst while I was tying it went down to 2.58...I think at this ppoint I will just back-up the dat on all drives and reset the entire array, maybe even change to a different server as this has been problematic at best, almost from minute one I do know I am not waiting 170 Hours (7 days) to use this when I can have ithe data reloaded onto a another system in 1 day Guess I am stuck with this..... Cannot provide any insight without a new syslog. Attach a new syslog. Rebuilding from scratch will take exactly as long if the hardware problems are not corrected first. Who knew that the WD20EURS is somehow smaller than the WD20EVDS when precleared.....oh well, I guess i'llk just have to throw the monbey at it and build a new large and try something different. Those drives are the same size. All modern drives have standardized sizes. Any system will have problems due to the hardware errors that your experiencing. Quote Link to comment
icedragonslair Posted June 30, 2013 Author Share Posted June 30, 2013 Array StatusSTARTED; 6 disks in array. Rebuilding disk3 Total Size 1,953,514,552 KB Current 526,064,632 (26.9%) Speed 3,691 KB/sec Finish 6426 minutes Syslog attached Who knew that the WD20EURS is somehow smaller than the WD20EVDS when precleared.....oh well, I guess i'llk just have to throw the monbey at it and build a new large and try something different. Those drives are the same size. All modern drives have standardized sizes. Any system will have problems due to the hardware errors that your experiencing. I also wasn't aware of any Hardware errors As I assumed they were just corrupt sectors not damaged ones. If it is indeed the case than I am going to have to go through this crap again when I replace the drive, which I will be doing instantly if there is damaged hardware syslog-2013-06-30.txt Quote Link to comment
dgaschk Posted July 1, 2013 Share Posted July 1, 2013 Jun 29 14:49:49 Tower kernel: ata6.00: HPA detected: current 3907027055, native 3907029168 It looks like the MB is corrupting the disk. See here: http://lime-technology.com/forum/index.php?topic=10866.0 Did the rebuild complete? Quote Link to comment
icedragonslair Posted July 1, 2013 Author Share Posted July 1, 2013 I wonder why this hasn't happened before since the board is newer (DoM 2011, BIOS 2012) and has been running in this particular unraid install since over a year now. Or are you seing the drive that I precleared on the older Gigabyte MB (I used another system for this particular preclear) that I tried to install Who knew that the WD20EURS is somehow smaller than the WD20EVDS when precleared That would explain why that drive was stated as smaller by unRAID. I didn't use that drive so there shouldn't be an issue, hopefully you are reading that error. All of the currently used drives are listing 'LBA48 user addressable sectors: 3907029168' and even the parity drive 'LBA48 user addressable sectors: 5860533168' Otherwise, wouldn't this have happened long ago? And no the rebuild is only at 31% Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.