May 15, 201511 yr I recently had a drive crash hard with write errors. It was immediately red-balled by unRAID, and a few minutes later it dropped off the bus completely. I keep a warm spare, so I stopped the array, assigned the warm spare to the failed drive's slot, and restarted the array. However, I'm constantly getting warnings about an error with that disk having invalid data. I mean, it did just start rebuilding so I suppose it's technically correct, but seems unnecessarily dire. The disk should be simulated until the rebuild is complete. Is this expected behavior? I'm on 5.0.6 and a screenshot of the error is attached. If it's not expected, I'll work to pull a log (it's a bit unresponsive at the moment).
May 15, 201511 yr Author So, some bad news. Combing through the syslog, it seems my parity drive threw 17 errors when starting the rebuild process. That drive is fairly new, has never shown any adverse effects save for a loose cable awhile back, and just passed a full SMART test a few days ago. So, it would appear I've had a multi-drive failure. Any suggestions at this point? The rebuild is still going fine without complaint since those errors at start-up. Should I let it proceed? Start the rebuild over somehow and try again, maybe it was a fluke? Any suggestions? Kind of freaking out and pretty bummed even being vigilant and maintaining a warm spare might not have saved me from data loss. Wish a second parity drive was an option Syslog attached. Here's the smart report for the parity drive (the drive in slot 2 that failed is completely offline, so no report available) smartctl -a -d ata /dev/sdj (parity) smartctl 6.2 2013-07-26 r3841 [i686-linux-3.9.11p-unRAID] (local build) Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Device Model: WDC WD40EFRX-68WT0N0 Serial Number: WD-WCC4E7V47FA9 LU WWN Device Id: 5 0014ee 20a67ffe1 Firmware Version: 80.00A80 User Capacity: 4,000,787,030,016 bytes [4.00 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Rotation Rate: 5400 rpm Device is: Not in smartctl database [for details use: -P showall] ATA Version is: ACS-2 (minor revision not indicated) SATA Version is: SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s) Local Time is: Thu May 14 23:16:51 2015 PDT SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (52560) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 526) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SCT capabilities: (0x703d) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 58 3 Spin_Up_Time 0x0027 182 182 021 Pre-fail Always - 7858 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 26 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 090 090 000 Old_age Always - 7347 10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 6 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 1 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 114 194 Temperature_Celsius 0x0022 122 117 000 Old_age Always - 30 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 1 198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 100 253 000 Old_age Offline - 0 SMART Error Log Version: 1 ATA Error Count: 1 CR = Command Register [HEX] FR = Features Register [HEX] SC = Sector Count Register [HEX] SN = Sector Number Register [HEX] CL = Cylinder Low Register [HEX] CH = Cylinder High Register [HEX] DH = Device/Head Register [HEX] DC = Device Command Register [HEX] ER = Error register [HEX] ST = Status register [HEX] Powered_Up_Time is measured from power on, and printed as DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, SS=sec, and sss=millisec. It "wraps" after 49.710 days. Error 1 occurred at disk power-on lifetime: 4879 hours (203 days + 7 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 38 b8 75 73 e1 Error: UNC 56 sectors at LBA = 0x017375b8 = 24343992 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 38 a8 75 73 e1 08 4d+18:18:59.015 READ DMA c8 00 28 80 75 73 e1 08 4d+18:18:59.015 READ DMA c8 00 28 58 75 73 e1 08 4d+18:18:59.015 READ DMA c8 00 28 30 75 73 e1 08 4d+18:18:59.015 READ DMA c8 00 d8 58 74 73 e1 08 4d+18:18:59.014 READ DMA SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Conveyance offline Completed without error 00% 7008 - # 2 Short offline Completed without error 00% 7008 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. syslog-2015-05-14.txt.zip
May 15, 201511 yr The parity drive is reporting 1 pending sector which would have resulted in a read failure during a rebuild. With any luck that will just have affected one sector on the rebuilt drive so that although a specific file might end up corrupt the rest of the data is fine. You will have to wait and see what the final result of the rebuild is. You will want to get the pending sector on the parity drive 'fixed' for reliable operation going forward but I would suggest that can wait until you have recovered the failed drive.
May 15, 201511 yr Author Assuming the remainder of the rebuild goes off without issue, is there a way to determine which file was the unlucky recipient of the failure? Also, in the syslog the parity drive reported errors on a number of sectors (17 in total). True, only one pending sector is listed. Does that mean subsequent reads from those other sectors eventually succeeded and were able to provide valid data? Here's the relevant section where the parity drive crapped out May 14 20:57:56 Hyperion kernel: ata10.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 May 14 20:57:56 Hyperion kernel: ata10.00: irq_stat 0x40000001 May 14 20:57:56 Hyperion kernel: ata10.00: failed command: READ DMA EXT May 14 20:57:56 Hyperion kernel: ata10.00: cmd 25/00:00:68:82:73/00:02:01:00:00/e0 tag 0 dma 262144 in May 14 20:57:56 Hyperion kernel: res 51/40:7f:e0:83:73/00:00:01:00:00/e0 Emask 0x9 (media error) May 14 20:57:56 Hyperion kernel: ata10.00: status: { DRDY ERR } May 14 20:57:56 Hyperion kernel: ata10.00: error: { UNC } May 14 20:57:56 Hyperion kernel: ata10.00: configured for UDMA/133 May 14 20:57:56 Hyperion kernel: sd 11:0:0:0: [sdj] Unhandled sense code May 14 20:57:56 Hyperion kernel: sd 11:0:0:0: [sdj] May 14 20:57:56 Hyperion kernel: Result: hostbyte=0x00 driverbyte=0x08 May 14 20:57:56 Hyperion kernel: sd 11:0:0:0: [sdj] May 14 20:57:56 Hyperion kernel: Sense Key : 0x3 [current] [descriptor] May 14 20:57:56 Hyperion kernel: Descriptor sense data with sense descriptors (in hex): May 14 20:57:56 Hyperion kernel: 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 May 14 20:57:56 Hyperion kernel: 01 73 83 e0 May 14 20:57:56 Hyperion kernel: sd 11:0:0:0: [sdj] May 14 20:57:56 Hyperion kernel: ASC=0x11 ASCQ=0x4 May 14 20:57:56 Hyperion kernel: sd 11:0:0:0: [sdj] CDB: May 14 20:57:56 Hyperion kernel: cdb[0]=0x88: 88 00 00 00 00 00 01 73 82 68 00 00 02 00 00 00 May 14 20:57:56 Hyperion kernel: end_request: I/O error, dev sdj, sector 24347616 May 14 20:57:56 Hyperion kernel: ata10: EH complete May 14 20:57:56 Hyperion kernel: md: disk0 read error, sector=24347552 May 14 20:57:56 Hyperion kernel: md: multiple disk errors, sector=24347552 May 14 20:57:56 Hyperion kernel: md: disk0 read error, sector=24347560 May 14 20:57:56 Hyperion kernel: md: multiple disk errors, sector=24347560 May 14 20:57:56 Hyperion kernel: md: disk0 read error, sector=24347568 May 14 20:57:56 Hyperion kernel: md: multiple disk errors, sector=24347568 May 14 20:57:56 Hyperion kernel: md: disk0 read error, sector=24347576 May 14 20:57:56 Hyperion kernel: md: multiple disk errors, sector=24347576 May 14 20:57:56 Hyperion kernel: md: disk0 read error, sector=24347584 May 14 20:57:56 Hyperion kernel: md: multiple disk errors, sector=24347584 May 14 20:57:56 Hyperion kernel: md: disk0 read error, sector=24347592 May 14 20:57:56 Hyperion kernel: md: multiple disk errors, sector=24347592 May 14 20:57:56 Hyperion kernel: md: disk0 read error, sector=24347600 May 14 20:57:56 Hyperion kernel: md: multiple disk errors, sector=24347600 May 14 20:57:56 Hyperion kernel: md: disk0 read error, sector=24347608 May 14 20:57:56 Hyperion kernel: md: multiple disk errors, sector=24347608 May 14 20:57:56 Hyperion kernel: md: disk0 read error, sector=24347616 May 14 20:57:56 Hyperion kernel: md: multiple disk errors, sector=24347616 May 14 20:57:56 Hyperion kernel: md: disk0 read error, sector=24347624 May 14 20:57:56 Hyperion kernel: md: multiple disk errors, sector=24347624 May 14 20:57:56 Hyperion kernel: md: disk0 read error, sector=24347632 May 14 20:57:56 Hyperion kernel: md: multiple disk errors, sector=24347632 May 14 20:57:56 Hyperion kernel: md: disk0 read error, sector=24347640 May 14 20:57:56 Hyperion kernel: md: multiple disk errors, sector=24347640 May 14 20:57:56 Hyperion kernel: md: disk0 read error, sector=24347648 May 14 20:57:56 Hyperion kernel: md: multiple disk errors, sector=24347648 May 14 20:57:56 Hyperion kernel: md: disk0 read error, sector=24347656 May 14 20:57:56 Hyperion kernel: md: multiple disk errors, sector=24347656 May 14 20:57:56 Hyperion kernel: md: disk0 read error, sector=24347664 May 14 20:57:56 Hyperion kernel: md: multiple disk errors, sector=24347664 May 14 20:57:56 Hyperion kernel: md: disk0 read error, sector=24347672 May 14 20:57:56 Hyperion kernel: md: multiple disk errors, sector=24347672 May 14 20:57:56 Hyperion kernel: md: disk0 read error, sector=24347680 May 14 20:57:56 Hyperion kernel: md: multiple disk errors, sector=24347680 I'm now about 500GB into a 4TB recovery (should speed up substantially once past 2TB) with no further errors.
May 15, 201511 yr As far as I know there is no easy way to map a particular sector on a disk to the files it might affect. This is why many of the more experienced suggest that you keep checksums of all your files so that you can detect which files have corruption. You can then restore these files from your backups. Also I have no idea if a single pending sector can cause multiple sector read failures? I have a feeling that the mapping between physical media and logical sectors seems to vary with disk model. As I understand it modern drives tend to work with much larger physical blocks so that multiple sectors can be mapped to one physical block.
May 15, 201511 yr Author It turns out there is a pretty easy (but tedious) way to figure out what files are mapped to specific sectors, but it would have to have been a data disk ( http://unix.stackexchange.com/questions/171022/how-to-find-out-which-file-is-affected-by-a-bad-sector ). Since it's the parity, I'm not sure there's any way to figure out what block on the parity drive corresponds to a block on a data drive, though in this case they are the exact same manufacturer/model/size so maybe there is. So a pending sector means there was a read error and the disk hasn't decided what to do with it yet. It could be weak, or it could have failed, we won't know until a subsequent read. So after the rebuild passes, the next step would be to run a non-correcting parity check? Then if pending sectors goes back to zero and reallocated sectors stays at zero, it successfully read the data. In that case we can maybe run a parity check with corrections turned on (assuming unRAID knows disk2 is the dirty disk, which it appears to). If pending stays high the sector is probably lost. What then? Pull the drive out of the array and pre_clear it so the sector is written and reallocated?
May 15, 201511 yr http://lime-technology.com/wiki/index.php/Troubleshooting#Resolving_a_Pending_Sector
May 15, 201511 yr Author http://lime-technology.com/wiki/index.php/Troubleshooting#Resolving_a_Pending_Sector I read that, but wasn't sure if the advice is different if it's a parity volume, which it seems to me is a special case. Unless I misunderstand, there you can very easily force a write to the pending sector by merely running a correcting parity check, which has the advantage of not leaving your array vulnerable while the parity disk is being rebuilt from your spare.
May 15, 201511 yr http://lime-technology.com/wiki/index.php/Troubleshooting#Resolving_a_Pending_Sector I read that, but wasn't sure if the advice is different if it's a parity volume, which it seems to me is a special case. Unless I misunderstand, there you can very easily force a write to the pending sector by merely running a correcting parity check. It's not a special case at all. The advice applies to any array drive. But you are correct that a special fix may work for correcting a pending sector on the parity disk. A correcting parity check may fix the problem.
May 15, 201511 yr Author http://lime-technology.com/wiki/index.php/Troubleshooting#Resolving_a_Pending_Sector I read that, but wasn't sure if the advice is different if it's a parity volume, which it seems to me is a special case. Unless I misunderstand, there you can very easily force a write to the pending sector by merely running a correcting parity check. It's not a special case at all. The advice applies to any array drive. But you are correct that a special fix may work for correcting a pending sector on the parity disk. A correcting parity check may fix the problem. Wait, I'm sorry, I'm confused. You said it's not a special case but then went on to describe why it is or might be a special case?
May 15, 201511 yr http://lime-technology.com/wiki/index.php/Troubleshooting#Resolving_a_Pending_Sector I read that, but wasn't sure if the advice is different if it's a parity volume, which it seems to me is a special case. Unless I misunderstand, there you can very easily force a write to the pending sector by merely running a correcting parity check. It's not a special case at all. The advice applies to any array drive. But you are correct that a special fix may work for correcting a pending sector on the parity disk. A correcting parity check may fix the problem. Wait, I'm sorry, I'm confused. You said it's not a special case but then went on to describe why it is or might be a special case? 1. The advise given in the wiki is correct for any array disk and the solutions given there are applicable to a data or parity disk. 2. Additionally, a pending sector on the the parity disk may be corrected during a correcting parity check. This fix is only applicable to the case when a parity disk has pending sector. Both of these statements are true. You can safely ignore statement 2. The advise given in the wiki is correct and applicable to your situation. Please read the instructions fully and follow them carefully.
May 15, 201511 yr Author http://lime-technology.com/wiki/index.php/Troubleshooting#Resolving_a_Pending_Sector I read that, but wasn't sure if the advice is different if it's a parity volume, which it seems to me is a special case. Unless I misunderstand, there you can very easily force a write to the pending sector by merely running a correcting parity check. It's not a special case at all. The advice applies to any array drive. But you are correct that a special fix may work for correcting a pending sector on the parity disk. A correcting parity check may fix the problem. Wait, I'm sorry, I'm confused. You said it's not a special case but then went on to describe why it is or might be a special case? 1. The advise given in the wiki is correct for any array disk and the solutions given there are applicable to a data or parity disk. 2. Additionally, a pending sector on the the parity disk may be corrected during a correcting parity check. This fix is only applicable to the case when a parity disk has pending sector. Both of these statements are true. You can safely ignore statement 2. The advise given in the wiki is correct and applicable to your situation. Please read the instructions fully and follow them carefully. Sorry, I'm only getting more confused. You say for (2) this fix is only applicable to the case when a parity disk has a pending sector, which as noted above is indeed the current situation. Then you say I can safely ignore statement 2 and to perform 1, which is the opposite. Clearly I'm missing something really obvious here. Maybe a recap of the current situation would help? * 2 weeks ago parity check completed with no errors and no SMART warnings * Yesterday disk2 failed during a write and was red-balled * Last night I assigned a warm pre-cleared spare to the failed drive and started the rebuild process * Early in the rebuild process (<2%), the partity disk (disk0) had 17 read errors, ultimately resulting in 1 pending sector * Rebuild is ongoing, now at 20%, with no further issues so far.
May 15, 201511 yr You have a disk with a pending sector. The wiki describes the correct options to resolve the problem. These may not be the only options that solve the problem. I strongly suggest that you follow the instructions in the wiki. If there is a particular section of the wiki that is not clear please point it out.
May 15, 201511 yr Concerning the "disk has invalid data" notification at the top, I've asked around, and that notification happens when the ball color changes to yellow, which is normal for the rebuild. It does NOT mean invalid data was discovered, but that the drive is not considered valid until the rebuild is complete. Since this is v5, there's little hope that that notification will be corrected.
May 15, 201511 yr Author You have a disk with a pending sector. The wiki describes the correct options to resolve the problem. These may not be the only options that solve the problem. I strongly suggest that you follow the instructions in the wiki. If there is a particular section of the wiki that is not clear please point it out. Nope, the wiki is clear, it's just some of your statements here that confused me. Anyway, it seemed to me like the parity disk was a special case, you confirmed it is indeed a special case that could potentially be resolved by a correcting parity check, and despite that you'd suggest I replace the drive with a spare via the usual process anyway. From what you've said, I guess you think it's worth the risk of leaving the array unprotected for 3 days over parity correcting option. Concerning the "disk has invalid data" notification at the top, I've asked around, and that notification happens when the ball color changes to yellow, which is normal for the rebuild. It does NOT mean invalid data was discovered, but that the drive is not considered valid until the rebuild is complete. Since this is v5, there's little hope that that notification will be corrected. Thanks for the confirmation. That is helpful.
May 15, 201511 yr You have a disk with a pending sector. The wiki describes the correct options to resolve the problem. These may not be the only options that solve the problem. I strongly suggest that you follow the instructions in the wiki. If there is a particular section of the wiki that is not clear please point it out. Nope, the wiki is clear, it's just some of your statements here that confused me. Anyway, it seemed to me like the parity disk was a special case, you confirmed it is indeed a special case that could potentially be resolved by a correcting parity check, and despite that you'd suggest I replace the drive with a spare via the usual process anyway. From what you've said, I guess you think it's worth the risk of leaving the array unprotected for 3 days over parity correcting option. I don't understand this statement. There is no way to run a parity check with the array in it's current state. It's only worth letting the rebuild run to completion if you desire to preserve the data in the array.
May 15, 201511 yr Author You have a disk with a pending sector. The wiki describes the correct options to resolve the problem. These may not be the only options that solve the problem. I strongly suggest that you follow the instructions in the wiki. If there is a particular section of the wiki that is not clear please point it out. Nope, the wiki is clear, it's just some of your statements here that confused me. Anyway, it seemed to me like the parity disk was a special case, you confirmed it is indeed a special case that could potentially be resolved by a correcting parity check, and despite that you'd suggest I replace the drive with a spare via the usual process anyway. From what you've said, I guess you think it's worth the risk of leaving the array unprotected for 3 days over parity correcting option. I don't understand this statement. There is no way to run a parity check with the array in it's current state. It's only worth letting the rebuild run to completion if you desire to preserve the data in the array. I think we've found the disconnect! Yes, as noted in Post #5 the question was intended to be about what to do once the rebuild of disk2 is complete (currently chugging along at 25%). Sorry if that wasn't clear. Once disk2 is back up and running, best case I know the array still has at least 1 sector on the parity drive that needs attention. From there it seems there are two options: 1) Replace the parity drive with a spare and preclear it in an attempt to force that pending sector to allocate. If successful, keep as a spare. The strongly preferred method, but array is vulnerable during rebuild. 2) Run a correcting parity check. With luck the parity mismatch on that sector will be noted, and new parity will be calculated based on the remainder of the array. The write will trigger the drive to reallocate that sector so it's no longer pending. Either way, data in that sector on disk2 is toast. It's probably worth noting I have two more of these drives in the protected array. They all came from different vendors, but have similar power on time. They were also scheduled to be retired and replaced with new drives next week, which was intended merely for capacity reasons. There might be some advantage to replacing those as soon as possible given the nature of disk2's failure.
May 17, 201511 yr Author Quick update. Rebuild succeeded without further issue. Started a non-correcting parity check to see if a subsequent read to that pending sector would succeed, which it did. Pending Sectors is now 0 and so is Reallocated sectors. The error occurred very early on in the check, at about 7GB, so I figured I'd cancel the parity check and restart it to see if it would fail again. This time that previous sectors in the syslog that had read errors did succeed for a second time, but I got errors on some nearby sectors (at ~8GB) in the syslog that must have eventually succeeded on retries (Pending Sectors/Reallocated Sectors still at 0). So, intermittent read errors on that parity disk aren't reassuring, especially since they're not showing up in the SMART reports after succeeding on retries. I'm going to let this parity check run to completion, but it doesn't appear the drive can be trusted and will need replaced or at least pre-cleared extensively. I have a new drive pre-clearing now, and another two drives arriving tomorrow. I'll probably replace the parity disk and one of the 2TB WD EARS that match the one that just failed, keeping the third as a spare. Hopefully that will work for awhile.
Archived
This topic is now archived and is closed to further replies.