July 12, 201312 yr The following popped up in the syslog while migrating data to unraid locally. The documentation says the problem was repaired by the unraid parity. But it doesn't say to replace the drive...this is the only group of errors, and no errors were found during the preclear script. The only smart events that could be pertinent is ATA_Error_Count is 2, Reallocated_Sector_Ct is 1, Offline_Uncorrectable is 0, UDMA_CRC_Error_Count is 1. Any suggestions? Jul 11 12:51:26 UnRAID kernel: sd 9:0:0:0: [sdd] (Drive related) Jul 11 12:51:26 UnRAID kernel: ASC=0x11 ASCQ=0x4 Jul 11 12:51:26 UnRAID kernel: sd 9:0:0:0: [sdd] CDB: (Drive related) Jul 11 12:51:26 UnRAID kernel: cdb[0]=0x28: 28 00 73 57 2f 28 00 04 00 00 Jul 11 12:51:26 UnRAID kernel: end_request: I/O error, dev sdd, sector 1935094104 (Errors) Jul 11 12:51:26 UnRAID kernel: ata7: EH complete (Drive related) Jul 11 12:51:26 UnRAID kernel: md: disk4 read error, sector=1935094040 (Errors) Jul 11 12:51:26 UnRAID kernel: md: disk4 read error, sector=1935094048 (Errors) Jul 11 12:51:26 UnRAID kernel: md: disk4 read error, sector=1935094056 (Errors) Jul 11 12:51:26 UnRAID kernel: md: disk4 read error, sector=1935094064 (Errors) Jul 11 12:51:26 UnRAID kernel: md: disk4 read error, sector=1935094072 (Errors) Jul 11 12:51:26 UnRAID kernel: md: disk4 read error, sector=1935094080 (Errors) Jul 11 12:51:26 UnRAID kernel: md: disk4 read error, sector=1935094088 (Errors) Jul 11 12:51:26 UnRAID kernel: md: disk4 read error, sector=1935094096 (Errors) Jul 11 12:51:26 UnRAID kernel: md: disk4 read error, sector=1935094104 (Errors) Jul 11 12:51:26 UnRAID kernel: md: disk4 read error, sector=1935094112 (Errors) Jul 11 12:51:26 UnRAID kernel: md: disk4 read error, sector=1935094120 (Errors) Jul 11 12:51:26 UnRAID kernel: md: disk4 read error, sector=1935094128 (Errors) Jul 11 12:51:26 UnRAID kernel: md: disk4 read error, sector=1935094136 (Errors) Jul 11 12:51:26 UnRAID kernel: md: disk4 read error, sector=1935094144 (Errors) Jul 11 12:51:26 UnRAID kernel: md: disk4 read error, sector=1935094152 (Errors) Jul 11 12:51:26 UnRAID kernel: md: disk4 read error, sector=1935094160 (Errors) Jul 11 12:51:26 UnRAID kernel: md: disk4 read error, sector=1935094168 (Errors) Jul 11 12:51:26 UnRAID kernel: md: disk4 read error, sector=1935094176 (Errors) Jul 11 12:51:26 UnRAID kernel: md: disk4 read error, sector=1935094184 (Errors) Jul 11 12:51:26 UnRAID kernel: md: disk4 read error, sector=1935094192 (Errors) Jul 11 12:51:26 UnRAID kernel: md: disk4 read error, sector=1935094200 (Errors) Jul 11 12:51:26 UnRAID kernel: md: disk4 read error, sector=1935094208 (Errors) Jul 11 12:51:26 UnRAID kernel: md: disk4 read error, sector=1935094216 (Errors) Jul 11 12:51:26 UnRAID kernel: md: disk4 read error, sector=1935094224 (Errors) Jul 11 12:51:26 UnRAID kernel: md: disk4 read error, sector=1935094232 (Errors) Jul 11 12:51:26 UnRAID kernel: md: disk4 read error, sector=1935094240 (Errors) Jul 11 12:51:26 UnRAID kernel: md: disk4 read error, sector=1935094248 (Errors) Jul 11 12:51:26 UnRAID kernel: md: disk4 read error, sector=1935094256 (Errors) Jul 11 12:51:26 UnRAID kernel: md: disk4 read error, sector=1935094264 (Errors) Jul 11 12:51:26 UnRAID kernel: md: disk4 read error, sector=1935094272 (Errors) Jul 11 12:51:26 UnRAID kernel: md: disk4 read error, sector=1935094280 (Errors) Jul 11 12:51:26 UnRAID kernel: md: disk4 read error, sector=1935094288 (Errors) Jul 11 12:51:26 UnRAID kernel: md: disk4 read error, sector=1935094296 (Errors) Jul 11 12:51:26 UnRAID kernel: md: disk4 read error, sector=1935094304 (Errors) Jul 11 12:51:26 UnRAID kernel: md: disk4 read error, sector=1935094312 (Errors) Jul 11 12:51:26 UnRAID kernel: md: disk4 read error, sector=1935094320 (Errors) Jul 11 12:51:26 UnRAID kernel: md: disk4 read error, sector=1935094328 (Errors) Jul 11 12:51:26 UnRAID kernel: md: disk4 read error, sector=1935094336 (Errors) Jul 11 12:51:26 UnRAID kernel: md: disk4 read error, sector=1935094344 (Errors) Jul 11 12:51:26 UnRAID kernel: md: disk4 read error, sector=1935094352 (Errors) Jul 11 12:51:26 UnRAID kernel: md: disk4 read error, sector=1935094360 (Errors) Jul 11 12:51:26 UnRAID kernel: md: disk4 read error, sector=1935094368 (Errors) Jul 11 12:51:26 UnRAID kernel: md: disk4 read error, sector=1935094376 (Errors) Jul 11 12:51:26 UnRAID kernel: md: disk4 read error, sector=1935094384 (Errors) Jul 11 12:51:26 UnRAID kernel: md: disk4 read error, sector=1935094392 (Errors) Jul 11 12:51:26 UnRAID kernel: md: disk4 read error, sector=1935094400 (Errors) Jul 11 12:51:26 UnRAID kernel: md: disk4 read error, sector=1935094408 (Errors) Jul 11 12:51:26 UnRAID kernel: md: disk4 read error, sector=1935094416 (Errors) Jul 11 12:51:26 UnRAID kernel: md: disk4 read error, sector=1935094424 (Errors) Jul 11 12:51:26 UnRAID kernel: md: disk4 read error, sector=1935094432 (Errors) Jul 11 12:51:26 UnRAID kernel: md: disk4 read error, sector=1935094440 (Errors) Jul 11 12:51:26 UnRAID kernel: md: disk4 read error, sector=1935094448 (Errors) Jul 11 12:51:26 UnRAID kernel: md: disk4 read error, sector=1935094456 (Errors) Jul 11 12:51:26 UnRAID kernel: md: disk4 read error, sector=1935094464 (Errors) Jul 11 12:51:26 UnRAID kernel: md: disk4 read error, sector=1935094472 (Errors) Jul 11 12:51:26 UnRAID kernel: md: disk4 read error, sector=1935094480 (Errors) Jul 11 12:51:26 UnRAID kernel: md: disk4 read error, sector=1935094488 (Errors) Jul 11 12:51:26 UnRAID kernel: md: disk4 read error, sector=1935094496 (Errors)
July 12, 201312 yr It would be good to look at the whole of the SMART data on your disk4. Could you post that, along with a syslog? Also, if you still have the full log for the same disc when it completed the pre_clear script, that also would be good to post. Did you ever see any HPA warnings? Not sure of the size of the discs are anyhting yet... just a thought there... It may also help to know from what disc, to what disc you were migrating data... Also I am assuming you are reffering to a pre_clear being run on disk4, correct? How many cycles were run? Was the Reallocated_Sector_Ct = 1 before and after the pre_clear?
July 12, 201312 yr This message alone is telling you the drive cannot read certain sectors reliably. Jul 11 12:51:26 UnRAID kernel: md: disk4 read error, sector=1935094040 (Errors) Stop the unRAID Array. issue a smartctl -t short /dev/sd? look at the smart log in a few minutes. Then brace yourself for some hours of idle time. issue a smartctl -t long /dev/sd? to do a long test. This could take up to 5 hours. Leave the drive alone come back after 5-6 hours grab a smart log and review. If there are bad LBA#s or pending sectors, the drive is having difficulty reading or verifying certain sectors. The only smart events that could be pertinent is... What about Pending Sectors? Post a smartctl log for the community to review.
July 13, 201312 yr Author July 11th was the pending errors... attached smartctl -a -A /dev/sdd attached syslog attached preclear start and report no additional errors have occurred on any other drives after this (fluke?). No drives have pending sectors preclear_start__JP2921HQ0GVU9A_2013-07-03.txt preclear_rpt__JP2921HQ0GVU9A_2013-07-03.txt smart_SDD.txt syslog_04x10-11-12.txt
July 13, 201312 yr ok, it looks like the pre_clear did show a concern, but there were NO re-allocated sectors still at that point... During pre_clear cycle, the Raw Read Error rate increased, reducing the FAITH in this drive. ATTRIBUTE NEW_VAL OLD_VAL FAILURE_THRESHOLD STATUS RAW_VALUE Raw_Read_Error_Rate = 95 85 16 ok 327685 9.8 POWER ON days after the pre_clear cycle, there was an error, that looks like was the action that may have been when the re-allocation occured. Error 2 occurred at disk power-on lifetime: 28340 hours (1180 days + 20 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 d0 58 31 57 00 Error: UNC 208 sectors at LBA = 0x00573158 = 5714264 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 25 00 00 28 2f 57 e3 08 49d+00:01:26.724 READ DMA EXT 25 00 00 28 2b 57 e0 08 49d+00:01:25.918 READ DMA EXT 25 00 00 28 27 57 e3 08 49d+00:01:25.905 READ DMA EXT 35 00 08 a0 0c 57 e3 08 49d+00:01:25.875 WRITE DMA EXT 35 00 b8 68 08 57 e3 08 49d+00:01:25.858 WRITE DMA EXT I would get a drive ready for replacement, to make a swap when convenient. It may be ok for a while, but with how much change there was in the Raw_Read_Error_Rate parameter with a pre_clear cycle, I would not trust that drive much longer.
July 13, 201312 yr ok, it looks like the pre_clear did show a concern, but there were NO re-allocated sectors still at that point... During pre_clear cycle, the Raw Read Error rate increased, reducing the FAITH in this drive. ATTRIBUTE NEW_VAL OLD_VAL FAILURE_THRESHOLD STATUS RAW_VALUE Raw_Read_Error_Rate = 95 85 16 ok 327685 The Raw_Read_Error_Rate improved; changing from 85 to 95. It fails at 16. This should increase your faith in the drive if you wish to ascribe any meaning...
July 14, 201312 yr SMART Self-test log structure revision number 1 No self-tests have been logged. [To run self-tests, use: smartctl -t] Run the smart long test. If it passes that without issue, chances are very good the drive still has some life left in it. You can't avoid running the long test by running the preclear. The long test will be the the final test as it is the drive's firmware doing all the testing. It will take many hours, but if it passes that after all the other test, then I would have a higher level of confidence. I've had drives that passed higher level tests only to fail the long test and discover an LBA that was problematic. I've also found by running badblocks in 4 pass read/wrote mode, the bad LBA's were refreshed and the drive came back to life. This is what the preclear tends to do, however it could mask a problematic sector. Do a smart long test, post that log and then you'll know if you should still continue using the drive.
July 15, 201312 yr Author Does the smartctl -t require the array to be down the entire time? AND Can all of the drives handle a smartctl -t simultaniously (probably different term sessions)?
July 15, 201312 yr Does the smartctl -t require the array to be down the entire time? AND Can all of the drives handle a smartctl -t simultaniously (probably different term sessions)? a short test is done within 2-3 minutes. So it usually does not require the array to be down. The long test takes 5-6 hours on larger drives. If the array is up, it might spin down the drive and interrupt the test. If array is up, any activity on the drive could possibly interrupt the test. (firmware dependent). if the array is up and the long test is active, the array's performance will suffer. If the array is stopped. you can run multiple long tests on different drives simultaneously. You don't even need to do it on different terminals. Once you submit the long test, it will give you an approximate time of when the long test will be done. Smartctl will finish and the prompt will come back. The key is. no activty, no spin downs, leave the drives alone for as long as you can.
July 15, 201312 yr Author Are there any recommended switches? I've read the man page for smartctl and it doesn't look like it should, the -t is "tolerant" of nonstandard smart fields (http://smartmontools.sourceforge.net/man/smartctl.8.html) is there any other switches that would be helpful for the long smart check?
July 15, 201312 yr you can do a quick -t short to do the short test, if that works, then you can schedule and do the -t long test. smartctl -t short /dev/sd? where ?= drive letter in question. wait a few minutes dump the log with smartctl -a /dev/sd? Review. Schedule a period of time when the server will be unused an uninterrupted smartctl -t long /dev/sd? where ?=drive letter in question wait hours (approx 2-6 hours at least) to dump the smart log smartctl -a /dev/sd?
Archived
This topic is now archived and is closed to further replies.