steve1977 Posted November 11, 2017 Share Posted November 11, 2017 I am in the process my parity disk to a larger one. Rebuild is ongoing. All disks show "green", but disk 3 shows 128 errors. Smart test looks ok. Does it make sense to let the parity rebuild complete? Any advice on what to do? Quote Link to comment
testdasi Posted November 13, 2017 Share Posted November 13, 2017 What kind of errors? Quote Link to comment
steve1977 Posted November 13, 2017 Author Share Posted November 13, 2017 Disk read errors. Disk does not disable and parity rebuild is still ongoing. Error log below: https://pastebin.com/hKRjCEGN Quote Link to comment
JorgeB Posted November 13, 2017 Share Posted November 13, 2017 Without the diagnostics it's hard to say more but you'll need at least to do a correcting parity check since current parity is not 100% valid. Quote Link to comment
steve1977 Posted November 13, 2017 Author Share Posted November 13, 2017 Thanks. How to trigger a "correcting parity"? Do I need to rebuild or just run another parity check after this is completed? It will still take another 12 hours for the parity build to complete. Will wait your guidance before replacing another drive. Diagnostic attached. Thanks for your help! tower-diagnostics-20171113-1711.zip Quote Link to comment
JorgeB Posted November 13, 2017 Share Posted November 13, 2017 I looks like a disk problem: 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 1 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 1 You can confirm by running an extended SMART test. Quote Link to comment
steve1977 Posted November 13, 2017 Author Share Posted November 13, 2017 Got it. So, need to switch to a new disk? Is my array still safe as long as the currently ongoing parity built will complete? As mentioned, the drive still shows "green" and the parity rebuild is underway. Quote Link to comment
JorgeB Posted November 13, 2017 Share Posted November 13, 2017 23 minutes ago, steve1977 said: So, need to switch to a new disk? Yes if it fails the extended SMART test. 24 minutes ago, steve1977 said: Is my array still safe as long as the currently ongoing parity built will complete? As mentioned, the drive still shows "green" and the parity rebuild is underway. No because like I mentioned earlier parity is not valid. Quote Link to comment
steve1977 Posted November 13, 2017 Author Share Posted November 13, 2017 What does this mean for me? Right now (during parity rebuild), I have access to all my data. What part do you expect that I'll lose? Anything I can do to prevent data loss? I'll run extended smart once the parity is fully built. Quote Link to comment
JorgeB Posted November 13, 2017 Share Posted November 13, 2017 And don't run a correcting parity check now with a suspected bad disk, post the SMART report after the test is complete. Quote Link to comment
pwm Posted November 13, 2017 Share Posted November 13, 2017 Note that a disk that says it has one sector offline uncorrectable doesn't mean the disk need to be toast. But it means that based on statistics, there is an increased probability that the drive will fail more - or totally - within a limited time span. Some disks just may get a bad sector because of a defect on the surface that wasn't noticed during the original factory scan, but there is a danger that the problem isn't just a tiny spot but a larger surface area that isn't good or that there is some issue with the head or other parts of the drive, in which case the drive is dangerous to continue to use. It also means there is one sector that can't be read out correctly because the error correction code (ECC) for that sector isn't enough to correct the bit errors. If you already know the contents of that sector and tries to overwrite the sector then the disk can make use of a spare sector to store the correct data, making your RAID have a full set of disks with all correct data again. As johnnie.black notes, you most definitely do not want to rebuild your parity at this stage, since the current parity is one way to recompute what contents that should have been stored in the offline uncorrectable sector (unless you happen to have a backup of the specific file data for the file that happens to make use of this specific disk sector). Anyway - after a extended SMART scan, the disk will be able to tell which sector it finds the first error on. And it might potentially also increase the number of bad sectors. Quote Link to comment
steve1977 Posted November 14, 2017 Author Share Posted November 14, 2017 Thanks. As you'd anticipated, the SMART failed. Please find attached. Please advice what to do next? Replace the drive and rebuild? tower-smart-20171114-0210.zip Quote Link to comment
pwm Posted November 14, 2017 Share Posted November 14, 2017 The result was as expected - the offline uncorrectable sector will stay uncorrectable - only a direct write to that address has a chance to clear the error counter. You did get to know that the drive didn't find any more errors over the first 60% of the surface - and you got the address of that uncorrectable sector - LBA 91525368. I would recommend to do a selective test where you start testing from the next sector and scan the rest of the drive to see if more errors shows up. If you connect using ssh you can run smartctl and specify smartctl -t select,91525369-max /dev/<drive> the drive will continue from the first sector after the error and to the end of the drive. Quote Link to comment
steve1977 Posted November 14, 2017 Author Share Posted November 14, 2017 Thanks. Let me do this later and get back to you. I don't mind if I lose a few files from this disk, but would be a pain if I lose the full disk. Quote Link to comment
JorgeB Posted November 14, 2017 Share Posted November 14, 2017 If you have the space you can just copy everything from disk3 to other disk(s), the file(s) on the damaged sector(s) will give an I/O error, restore those from backup, with some luck it will be only 1 or 2. Quote Link to comment
steve1977 Posted November 14, 2017 Author Share Posted November 14, 2017 Don't have the space per se, but could find a way. Would this give me better result than just replacing the disk? Quote Link to comment
JorgeB Posted November 14, 2017 Share Posted November 14, 2017 3 minutes ago, steve1977 said: Would this give me better result than just replacing the disk? If parity finished syncing replacing the disk is also an option, but if you don't have checksums (or the disks is btrfs) the rebuilt disk will have some corrupt file(s) and you'll have no way of knowing which ones. By copying/moving the data manually you'll know which files need to be restored. Quote Link to comment
steve1977 Posted November 14, 2017 Author Share Posted November 14, 2017 Got it. So, copying indeed may be preferred. Could chkdsk or a variance thereof also be an option? Quote Link to comment
JorgeB Posted November 14, 2017 Share Posted November 14, 2017 Could chkdsk or a variance thereof also be an option?Not sure what you mean, if you mean marking the bad sectors there's no equivalent, also that would get the same end result as a rebuild, corrupt files. Quote Link to comment
steve1977 Posted November 14, 2017 Author Share Posted November 14, 2017 I was thinking of fixing or deleting the files within the bad sectors. Copying will take a lot longer and typically copying from corrupt disks turns difficult. So, I a trying to avoid it, but don't like the idea that I have some corrupt files and don't know what they are. Quote Link to comment
JorgeB Posted November 14, 2017 Share Posted November 14, 2017 That's why I think having checksums or using an auto checksum filesystem is highly recommended, for situations like this. Quote Link to comment
pwm Posted November 14, 2017 Share Posted November 14, 2017 Copying from the corrupt disk shouldn't be problematic. If you use rsync for example, you can have it continue with other files after a read error. And unless the last 40% of the drive have more errors, you will only have one single file that will fail to copy. Another advantage with rsync is that it is well suited to restart the copy if you for some reason get it interrupted. It is normally also possible to look up what file is using the specific LBA that the SMART test indicated. Exactly how to do that will depend on used file system. If this is a file you have a backup of or do not care about, then you can overwrite the file and have a large probability of zeroing the unrecoverable sector count. If you have read out all the data you can recover from the problematic disk, then you could also have unRAID restore the damaged file by replacing the problematic disk with a new disk and have unRAID recompute the content from the other parity and data disks. The main thing is that you want to keep as much redundancy as possible for as long as possible. Rebuilding the parity now would make the parity computed based on failed sector(s) of the problematic disk. And replacing the disk will make you vulnerable to other issues for the fill time until unRAID have recovered the full content by use of the parity data. Quote Link to comment
steve1977 Posted November 14, 2017 Author Share Posted November 14, 2017 Let me clarify - my parity rebuild has completed (the SMART was done after the rebuild created). Actually the error only occured during rebuild. If I were to move the file to an UD, this would take quite some hours. If I were to replace the drive, I would need to rebuild the parity again, then delete the disk and then copy it back. Also, how is rsync different from "mv -r"? I am not fully clear on the exact suggested next steps. Also, why would chkdsk or scandisk not identify corrupt files? I could then just delete the files and rebuild the disk using parity. Quote Link to comment
JorgeB Posted November 14, 2017 Share Posted November 14, 2017 1 minute ago, steve1977 said: Let me clarify - my parity rebuild has completed (the SMART was done after the rebuild created). Actually the error only occured during rebuild. And like I already mentioned this is why current parity is not 100% valid. 3 minutes ago, steve1977 said: If I were to move the file to an UD, this would take quite some hours. If I were to replace the drive, I would need to rebuild the parity again, then delete the disk and then copy it back Not quite following here. 4 minutes ago, steve1977 said: Also, why would chkdsk or scandisk not identify corrupt files? I could then just delete the files and rebuild the disk using parity. AFAIK xfs_repair has no option to scan the complete filesystem, but even if it could identify the files that are on the bad sectors you couldn't rebuild from parity as your current parity is not valid. Quote Link to comment
steve1977 Posted November 14, 2017 Author Share Posted November 14, 2017 I am still not 100% sure whether I fully understand, but let me follow the next steps: * I can add an additional 6TB disk as unassigned disk (UD) * I can the move all files from the "corrupt" disk to the UD ("mv -r") * I can then pull the corrupt disk and put the UD into the old "corrupt" slot * Reconfigure and rebuild parity Makes sense? Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.