February 21, 201016 yr I changed out a fan in the case and I had to loosen all the drives to get to it. Upon booting up I was going to do a parity check as a safety check and moments after starting the check stopped and drive2 got labeled with DISK_DSBL. I shut down, reseated the power and data cables and booted back up. The drive is still labeled DISK_DSBL and red ball. It's set to read-only and I can see the folders and files on it. I'm running the SMART tests. The short completed fine. I'm assuming the problem was one of the cables which are now seated properly. I had just finished several days of pre-clearing all my data drives and did parity sync followed by 3 parity checks so I believe the drive is in good condition. So, when it passes the long SMART test, what do I need to do to get it back into the array with parity all synced up? syslog-2010-02-21.txt
February 21, 201016 yr Author Now that I look closer at the SMART test I don't think it's good after all. Here are the results. Please advise. Statistics for /dev/sdb ST3500320AS_9QM26B35 smartctl version 5.38 [i486-slackware-linux-gnu] Copyright © 2002-8 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF INFORMATION SECTION === Model Family: Seagate Barracuda 7200.11 Device Model: ST3500320AS Serial Number: 9QM26B35 Firmware Version: SD15 User Capacity: 500,107,862,016 bytes Device is: In smartctl database [for details use: -P show] ATA Version is: 8 ATA Standard is: ATA-8-ACS revision 4 Local Time is: Sun Feb 21 12:36:19 2010 GMT+8 SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 25) The self-test routine was aborted by the host. Total time to complete Offline data collection: ( 625) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 111) minutes. Conveyance self-test routine recommended polling time: ( 2) minutes. SCT capabilities: (0x103b) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 119 099 006 Pre-fail Always - 219208706 3 Spin_Up_Time 0x0003 095 091 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 098 098 020 Old_age Always - 3042 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 073 060 030 Pre-fail Always - 4319341606 9 Power_On_Hours 0x0032 089 089 000 Old_age Always - 10449 10 Spin_Retry_Count 0x0013 100 098 097 Pre-fail Always - 1694 12 Power_Cycle_Count 0x0032 098 098 020 Old_age Always - 2982 184 Unknown_Attribute 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 061 061 000 Old_age Always - 39 188 Unknown_Attribute 0x0032 100 060 000 Old_age Always - 9286 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 068 053 045 Old_age Always - 32 (Lifetime Min/Max 31/32) 194 Temperature_Celsius 0x0022 032 047 000 Old_age Always - 32 (0 20 0 0) 195 Hardware_ECC_Recovered 0x001a 041 024 000 Old_age Always - 219208706 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 1 SMART Error Log Version: 1 ATA Error Count: 42 (device log contains only the most recent five errors) CR = Command Register [HEX] FR = Features Register [HEX] SC = Sector Count Register [HEX] SN = Sector Number Register [HEX] CL = Cylinder Low Register [HEX] CH = Cylinder High Register [HEX] DH = Device/Head Register [HEX] DC = Device Command Register [HEX] ER = Error register [HEX] ST = Status register [HEX] Powered_Up_Time is measured from power on, and printed as DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, SS=sec, and sss=millisec. It "wraps" after 49.710 days. Error 42 occurred at disk power-on lifetime: 10252 hours (427 days + 4 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 78 36 01 00 Error: UNC at LBA = 0x00013678 = 79480 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 25 00 d8 47 34 01 e0 00 12:35:19.495 READ DMA EXT 27 00 00 00 00 00 e0 00 12:35:19.494 READ NATIVE MAX ADDRESS EXT ec 00 00 00 00 00 a0 02 12:35:19.474 IDENTIFY DEVICE ef 03 45 00 00 00 a0 02 12:35:19.464 SET FEATURES [set transfer mode] 27 00 00 00 00 00 e0 00 12:35:19.444 READ NATIVE MAX ADDRESS EXT Error 41 occurred at disk power-on lifetime: 10252 hours (427 days + 4 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 78 36 01 00 Error: UNC at LBA = 0x00013678 = 79480 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 25 00 d8 47 34 01 e0 00 12:35:16.505 READ DMA EXT 27 00 00 00 00 00 e0 00 12:35:16.504 READ NATIVE MAX ADDRESS EXT ec 00 00 00 00 00 a0 02 12:35:16.484 IDENTIFY DEVICE ef 03 45 00 00 00 a0 02 12:35:16.465 SET FEATURES [set transfer mode] 27 00 00 00 00 00 e0 00 12:35:16.444 READ NATIVE MAX ADDRESS EXT Error 40 occurred at disk power-on lifetime: 10252 hours (427 days + 4 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 78 36 01 00 Error: UNC at LBA = 0x00013678 = 79480 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 25 00 d8 47 34 01 e0 00 12:35:13.465 READ DMA EXT 27 00 00 00 00 00 e0 00 12:35:13.464 READ NATIVE MAX ADDRESS EXT ec 00 00 00 00 00 a0 02 12:35:13.444 IDENTIFY DEVICE ef 03 45 00 00 00 a0 02 12:35:13.433 SET FEATURES [set transfer mode] 27 00 00 00 00 00 e0 00 12:35:13.414 READ NATIVE MAX ADDRESS EXT Error 39 occurred at disk power-on lifetime: 10252 hours (427 days + 4 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 78 36 01 00 Error: UNC at LBA = 0x00013678 = 79480 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 25 00 d8 47 34 01 e0 00 12:35:10.495 READ DMA EXT 27 00 00 00 00 00 e0 00 12:35:10.494 READ NATIVE MAX ADDRESS EXT ec 00 00 00 00 00 a0 02 12:35:10.474 IDENTIFY DEVICE ef 03 45 00 00 00 a0 02 12:35:10.460 SET FEATURES [set transfer mode] 27 00 00 00 00 00 e0 00 12:35:10.444 READ NATIVE MAX ADDRESS EXT Error 38 occurred at disk power-on lifetime: 10252 hours (427 days + 4 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 78 36 01 00 Error: UNC at LBA = 0x00013678 = 79480 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 25 00 d8 47 34 01 e0 00 12:35:07.485 READ DMA EXT 27 00 00 00 00 00 e0 00 12:35:07.484 READ NATIVE MAX ADDRESS EXT ec 00 00 00 00 00 a0 02 12:35:07.464 IDENTIFY DEVICE ef 03 45 00 00 00 a0 02 12:35:07.453 SET FEATURES [set transfer mode] 27 00 00 00 00 00 e0 00 12:35:07.434 READ NATIVE MAX ADDRESS EXT SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 10448 - # 2 Extended offline Interrupted (host reset) 90% 10448 - # 3 Extended offline Aborted by host 90% 10448 - # 4 Short offline Completed without error 00% 10448 - # 5 Short offline Completed without error 00% 10448 - # 6 Extended offline Completed without error 00% 10416 - # 7 Short offline Aborted by host 30% 10413 - # 8 Extended offline Aborted by host 70% 10411 - # 9 Extended offline Aborted by host 90% 10410 - #10 Short offline Completed without error 00% 10410 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
February 21, 201016 yr What exactly in the SMART report makes you think the drive is bad? Not sure I'm seeing what you are concerned about? Joe L.
February 21, 201016 yr I changed out a fan in the case and I had to loosen all the drives to get to it. Upon booting up I was going to do a parity check as a safety check and moments after starting the check stopped and drive2 got labeled with DISK_DSBL. That indicates a "write" to the drive failed. You might have dislodged either the power or SATA connector to it. I shut down, reseated the power and data cables and booted back up. The drive is still labeled DISK_DSBL and red ball.[/Quote]Yes, this is normal, even if the cable is now seated correctly. Since the "write" to the drive failed, the ONLY correct copy of what was written is in parity in combination with all the other data drives. It's set to read-only and I can see the folders and files on it. The files you are able to see are a result of the "simulated" drive data as reeconstructed from parity and all the other data drives. The physical drive is indeed disabled until you take the required steps to re-enable it and to re-construct the data onto it. (remember, a write to it failed, so you cannot just put it online) You must stop the array, un-assign the drive, start the array with it un-assigned. (You'll still be able to see your files on it, as an un-assigned drive is treated exactly the same as a failed drive, or a disabled drive) Un-assigning it and starting the array will cause unRAID to forget the serial number of the drive, so when you next stop the array and re-assign the drive it will think it is a new drive when next started. Then, when you re-assign the drive and start the array a final time it will re-construct the contents onto the physical drive, including the "write" that had originally failed, putting its contents back as they should. This will take 6 or more hours if you have a large array. You will not be parity protected from a second concurrent drive failure until it completes, so keep your hands out of the server and do not move any of the cables in it. Whatever you do, DO NOT PRESS THE BUTTON LABELED as "Restore" as it will immediately invalidate parity and make it impossible to reconstruct the old data on the disabled drive. It will be as if you asked the array to completely forget the disabled drive ever had any parity protection. If the disk is actually failed, there would be no way to get back your data. So DO NOT PRESS RESTORE. Only press "Start" I'm running the SMART tests. The short completed fine. I'm assuming the problem was one of the cables which are now seated properly. I had just finished several days of pre-clearing all my data drives and did parity sync followed by 3 parity checks so I believe the drive is in good condition. So, when it passes the long SMART test, what do I need to do to get it back into the array with parity all synced up? Described above. To pass a long test you will need to disable disk spin-down, otherwise the explicit spin-down will terminate the long test. Long tests typically take 4 hours or so.
February 21, 201016 yr Author What exactly in the SMART report makes you think the drive is bad? Not sure I'm seeing what you are concerned about? Joe L. It looked like the tests were offline. I guessed that meant the drive was offline. Actually, I have no idea how to read the SMART tests. I can't even tell when one is still running or has finished. And if it did finish, what it all means.
February 21, 201016 yr Author Ok, I'm following your directions. I won't touch the machine until the rebuild is finished. 1. I stopped the array 2. unassigned the disabled disk2 drive 3. Started the array (disk2 now shows unassigned.) 4. Stopped the array 5. re-assigned drive (disk2 now has dark blue ball) Paused to take a breath 6. Checked the "I really want to do this" under the Start button. 7. Pressed Start (disk2 now has orange ball and rebuild is in progress) Ok, so I'm seeing unRAID in action. If this finishes does it qualify as an unRAID level 1 test?
February 22, 201016 yr Author The rebuild worked so I shutdown and finished moving stuff in the case for now. I'm running a parity check after booting back up to make sure I have good cable connections. I replaced the intake 120mm fan with a quieter one and added one hard drive fan to cool the uppermost drive a little more. It dropped the temps a couple of degrees and now it's silent enough I can hear the hard drives spinning.
Archived
This topic is now archived and is closed to further replies.