January 1, 201610 yr Last night my server did its monthly full array check. Three hours into the check the error log started filling with errors related to the one drive. Jan 1 03:00:13 Tower kernel: sd 3:0:3:0: [sdi] Unhandled error code (Errors) Jan 1 03:00:13 Tower kernel: sd 3:0:3:0: [sdi] (Drive related) Jan 1 03:00:13 Tower kernel: Result: hostbyte=0x04 driverbyte=0x00 (System) Jan 1 03:00:13 Tower kernel: sd 3:0:3:0: [sdi] CDB: (Drive related) Jan 1 03:00:13 Tower kernel: cdb[0]=0x2a: 2a 00 56 be 44 a7 00 00 08 00 Jan 1 03:00:13 Tower kernel: sd 3:0:3:0: [sdi] Unhandled error code (Errors) Jan 1 03:00:13 Tower kernel: sd 3:0:3:0: [sdi] (Drive related) Jan 1 03:00:13 Tower kernel: Result: hostbyte=0x04 driverbyte=0x00 (System) Jan 1 03:00:13 Tower kernel: sd 3:0:3:0: [sdi] CDB: (Drive related) Jan 1 03:00:13 Tower kernel: cdb[0]=0x2a: 2a 00 56 be 44 9f 00 00 08 00 Jan 1 03:00:13 Tower kernel: md: disk4 write error, sector=1455310784 (Errors) Jan 1 03:00:13 Tower kernel: md: disk4 write error, sector=1455310776 (Errors) Jan 1 03:00:13 Tower kernel: md: disk4 write error, sector=1455310768 (Errors) Jan 1 03:00:13 Tower kernel: md: disk4 write error, sector=1455310760 (Errors) Jan 1 03:00:13 Tower kernel: md: disk4 write error, sector=1455310752 (Errors) Jan 1 03:00:13 Tower kernel: md: disk4 write error, sector=1455310744 (Errors) Jan 1 03:00:13 Tower kernel: md: disk4 write error, sector=1455310736 (Errors) Jan 1 03:00:13 Tower kernel: md: disk4 write error, sector=1455310728 (Errors) When I try to run HDParm info against the drive I am getting no information. The Smart Status Report gives me the following: Smartctl: Device Read Identity Failed: Input/output error A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options. So is disk4 dead/dying? What is my next steps in verifying the failure / fixing it. Is this the correct replacement procedure? 1. Buy new drive 2. Preclear / test new drive on another machine. Stop array and shutdown. 3. Replace drive, bring up system 4. Add new drive into disk 4's slot 5. Rebuild array while crossing fingers On a related note, right now my drives are all 2 TB or smaller. Can I replace the failing drive with a larger drive and then also replace the parity with a larger drive? Or am I stuck since I need the parity drive to be same/larger to rebuild the array? syslog-2016-01-01.zip
January 1, 201610 yr No you can not add a larger data disk as it violates the requirement that the parity disk is as large or larger than the biggest data disk. However there is a special procedure known as swap/disable to follow which replaces the parity disk with a larger parity disk and then rebuilds the failed data disk onto what was previously the old parity disk. Read the instructions carefully as they must be correctly followed to ensure that the process is successful. Once the rebuild of the failed data drive is complete you can then replace the rebuilt data drive with a larger disk.
January 1, 201610 yr Author So it sounds like my best bet would be to order a new drive the same size, as I don't want to add complications. Will my array work normally (but degraded) until I can replace the failed disk 4? I know normal raid5 would still function in this scenario but in a degraded state as it is using the parity to calculate what should be on disk 4 whenever I access information. Does unraid follow the same principle?
January 1, 201610 yr However you may wish to wait for advice from people who are experienced with handling a data disk failure during a parity check as I am not sure whether the failure might have marked the parity disk as invalid and you may need to perform additional step first.
January 1, 201610 yr So it sounds like my best bet would be to order a new drive the same size, as I don't want to add complications. Will my array work normally (but degraded) until I can replace the failed disk 4? I know normal raid5 would still function in this scenario but in a degraded state as it is using the parity to calculate what should be on disk 4 whenever I access information. Does unraid follow the same principle? Yes, the failed disk will be emulated by all other disks + parity, performance will be degraded.
January 1, 201610 yr I would strongly suggest rebooting the system to see if the problem drive now comes online so that a SMART report can be obtained. More often than not when a disk start throwing errors it is caused by an external factor (such as a loose cable) and the drive itself is fine. If that is the case and the disk that was reporting errors is actually fine then there may be valid alternative approaches going forward. For example: Stop the array and back up the contents of the USB drive somewhere (e.g. your PC) Do a New Config assigning all the current data disks, and assign a new larger parity disk. In this case you should select for the parity disk the new maximum disk size you intend to use going forward as the parity drive can never be smaller than the data drives. Start the array to build new parity from the current set of data disks. Keep the old parity disk intact in case anything goes wrong and you need to revert to the current configuration. If at the end everything goes well it can be added to the array as an additional data disk if so desired
January 1, 201610 yr Author I will try rebooting the system later today to see if the drive comes back online so I can get the Smart report from it. Would it be better to try running diagnostics from the server using the disk management tools or pulling the drive to another machine to run the diagnostics? Edit: I also did order a new 2TB as a direct replacement for the drive as well that should be here on Sunday.
January 1, 201610 yr I will try rebooting the system later today to see if the drive comes back online so I can get the Smart report from it. Would it be better to try running diagnostics from the server using the disk management tools or pulling the drive to another machine to run the diagnostics? Edit: I also did order a new 2TB as a direct replacement for the drive as well that should be here on Sunday. The SMART reports go with the drive, so it is up to you whether you obtain them on the server or plug it elsewhere to get the report. Do whatever is the most convenient.
January 2, 201610 yr Author So I rebooted the server and afterwords the drive still shows red status after reboot. I was able to run HDParm info and Smart status report and short smart test. ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 85 3 Spin_Up_Time 0x0027 164 161 021 Pre-fail Always - 6800 4 Start_Stop_Count 0x0032 096 096 000 Old_age Always - 4352 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 042 042 000 Old_age Always - 42720 10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 69 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 37 193 Load_Cycle_Count 0x0032 159 159 000 Old_age Always - 125327 194 Temperature_Celsius 0x0022 123 107 000 Old_age Always - 27 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 2 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 199 000 Old_age Offline - 80 Full report is attached. So on one hand, I don't see anything major in the Smart report, but it is odd that the drive still shows red status and that it does not show the temp. I have a drive that should be delivered tomorrow, but it will take 2-3 days to preclear/check the disc. What should I do in the mean time? I have not written any new data to the server since the error. Is it okay to start the array? And if so should I change disk 4 to no device in the device status? HDPArm_Info.txt Smart_Status.txt syslog-2016-01-02.zip
January 2, 201610 yr unRAID will not use a drive it has disabled until it is rebuilt. Also, that drive has 2 pending. What do I do if I get a red X next to a hard disk?
January 2, 201610 yr Author So I am guessing the 2 pending does mean the drive is failing and it should probably be replaced correct? So I can just start the array and Unraid will ignore the drive for now and I can run the array in a degraded state for the next 2-3 days until I add the new precleared drive? The array is only in a home, so I will not write any new data, and can keep the reading from the "emulated" drive to a minimum. Or would it be crucial to turn off the server until I add the new precleared drive and can rebuild? Edit: I found the Current_pending_sector in: http://lime-technology.com/wiki/index.php/Troubleshooting#What_do_I_do_if_I_get_a_red_X_next_to_a_hard_disk.3F and it does sound pretty bad. "An equally important attribute is the "Current_Pending_Sector", the RAW_VALUE is a count of suspect sectors pending reallocation. It should ALWAYS be zero and must be zero if the drive is to be used to reconstruct another. If it's not zero, then you will probably (but not always) see the Reallocated Sector Count increase in the future, when this does return to zero. Before remapping a suspect sector, it tests it one last time, and *may* pass it and not remap it. (There are good reasons why it is designed to work this way.)"
January 2, 201610 yr So I am guessing the 2 pending does mean the drive is failing and it should probably be replaced correct? So I can just start the array and Unraid will ignore the drive for now and I can run the array in a degraded state for the next 2-3 days until I add the new precleared drive? The array is only in a home, so I will not write any new data, and can keep the reading from the "emulated" drive to a minimum. Or would it be crucial to turn off the server until I add the new precleared drive and can rebuild? I think all of this is answered at the link I gave if you just keep reading, including pending sectors. I wouldn't necessarily give up on that drive.
January 2, 201610 yr Author Yeah I saw the section in the wiki of: Resolving a Pending Sector Pending sectors occur as a result of a read failures. An unreadable sector will interfere with the reconstruction of a failed drive. Pending sectors need to be cleared as soon as possible because 2 drives with unreadable sectors will most likely be unrecoverable within unRAID. Data disks with a small number of pending sectors should be fairly easy to recover with utilities in Linux or Windows and If anyone knows of a Mac utility that can recover Reiserfs please update this entry. The safest procedure is to replace the drive with a pre-cleared spare. The original drive can then be pre-cleared and the pending sector count should go to zero. The original drive can then be used as a spare. Multiple pre-clear cycles should not be required and the disk should be RMAed if 1 cycle doesn't work. If the drive cannot be returned then multiple cycles may restore the drive to a usable state. If no spare is available then follow the next procedure to re-enable the drive. The pending sector count should be zero after rebuilding. If not then replace. So I think I might replace the drive with a precleared drive and then run the preclear on the old drive to see if the drive is actually failing. Given the relatively cheap cost of drives though, I don't know if I would ever really trust that drive again though. 80 bucks for the new drive is probably worth the peace of mind. This whole thing is really making me question how to do better backups of my Unraid system. Alas, I don't think there really is a very good backup for an Unraid system besides a 2nd system.
January 2, 201610 yr Yeah I saw the section in the wiki of: Resolving a Pending Sector Pending sectors occur as a result of a read failures. An unreadable sector will interfere with the reconstruction of a failed drive. Pending sectors need to be cleared as soon as possible because 2 drives with unreadable sectors will most likely be unrecoverable within unRAID. Data disks with a small number of pending sectors should be fairly easy to recover with utilities in Linux or Windows and If anyone knows of a Mac utility that can recover Reiserfs please update this entry. The safest procedure is to replace the drive with a pre-cleared spare. The original drive can then be pre-cleared and the pending sector count should go to zero. The original drive can then be used as a spare. Multiple pre-clear cycles should not be required and the disk should be RMAed if 1 cycle doesn't work. If the drive cannot be returned then multiple cycles may restore the drive to a usable state. If no spare is available then follow the next procedure to re-enable the drive. The pending sector count should be zero after rebuilding. If not then replace. So I think I might replace the drive with a precleared drive and then run the preclear on the old drive to see if the drive is actually failing. Given the relatively cheap cost of drives though, I don't know if I would ever really trust that drive again though. 80 bucks for the new drive is probably worth the peace of mind. This whole thing is really making me question how to do better backups of my Unraid system. Alas, I don't think there really is a very good backup for an Unraid system besides a 2nd system. Sounds reasonable and probably what I would do too since I can afford it. As for backups, some do have a 2nd system, others just choose what is important and back that up. unRAID parity is not a backup in any sense. The only thing that counts as a backup of a file is another copy of the file. It makes sense to have priorities if you aren't able/willing to have another copy of everything. 1st priority is anything that is truly irreplaceable, like personal documents, photos, videos, anything like that which you have created and so has no other source.
Archived
This topic is now archived and is closed to further replies.