Benbou Posted April 13, 2023 Share Posted April 13, 2023 Hello, I'm a long time Unraid user, but it's my 1st time posting on the forum. I recently experienced an unclean shutdown on my Unraid server due to a power outage. During the subsequent parity check, I received a notification about "Current pending sector" errors on one of my drives (disk 3). This drive has also regularly reported "UDMA CRC errors" during previous parity checks. The parity check found 7 errors (this is the 1st time encounter parity sync errors), and I suspect that disk 3 is bad. I have ordered a replacement drive, but I need help with the process of replacing it to prevent data loss and get my server up and running. Here's my current plan: Keep VMs and Docker disabled for now and prevent write to the server as much as possible. Shutdown server, add new disk and preclear it when I receive it (I have a few available bays). Shutdown server, remove disk 3, start the array (to flag disk3 as missing), stop the array, and assign the new precleared drive as disk 3. Wait for Unraid to reconstruct the data. Run a parity check and hope for 0 error found. However, I'm not sure if this plan is the best course of action to prevent data loss and ensure a smooth transition to the new drive. I would appreciate any recommendations or ideas to help me complete this process with confidence. Thank you in advance for your help. PS: I have attached my diagnostic file to this post PS2. I also ran a SMART extended self-test on all my drives. And no more error were discovered. tower-diagnostics-20230412-2143_anon.zip Quote Link to comment
JorgeB Posted April 13, 2023 Share Posted April 13, 2023 There are no disk errors for now, suggesting the pending sectors are "false positives", parity sync errors are likely from the unclean shutdown, and they are only on parity2, since the disk is not reporting errors for now and it passed a recent extended SMART test I would do a correcting check first, then you can replace the disk later if there really are issues. 1 Quote Link to comment
itimpi Posted April 13, 2023 Share Posted April 13, 2023 A few comments: Your plan should work, but it is possible it may not be necessary if the disk is not really going bad. One thing you could do now is run an Extended SMART test on the drive and see if that passes and if it does the drive is quite likely to be OK. A few sync errors after an unclean shutdown is expected. The pending sectors on disk3 is not a good sign, but sometimes these can go back to 0 if they are rewritten. Ideally CRC errors rarely in themselves indicate a problem with the actual drive as the relate to the connection between the drive and host. the Preclear on the replacement disk is not strictly necessary unless you want it as a stress test. However a rebuil onto that drive would also act as a good stress test and would be quicker (assuming it works without error). Since you already have a drive on order maybe the best thing to do is to proceed with your plan, and keep disk3 with its contents intact until the rebuild onto the replacement disk3 completes successfully. After that you can then run a pre-clear on the old drive to see if it passes that OK. 1 Quote Link to comment
Benbou Posted April 13, 2023 Author Share Posted April 13, 2023 5 hours ago, JorgeB said: [...] and they are only on parity2, [...]. Is that mean that the parity1 is still valid? I didn't know that both parity were somehow not tied together. Also, how did you find this information, I would like to be able to diagnose it in the future. 5 hours ago, JorgeB said: There are no disk errors for now, suggesting the pending sectors are "false positives", parity sync errors are likely from the unclean shutdown, [...] since the disk is not reporting errors for now and it passed a recent extended SMART test I would do a correcting check first, then you can replace the disk later if there really are issues. Please tell me if I'm wrong, but my understanding was that pending sectors were marked as so when the drive was unable to read data from those sector (and would attempt to recover the data later). With that understanding, I was under the impression that the 8 pending sectors would likely cause the 7 parity sync errors. However, now you are suggesting a correcting check. This check will use drives data to rebuild the parity, right? So I'm worried that if some data is not recoverable on disk3, it will prevent me to getting it back from the parity. On the other hand, if the parity is really invalid and the data on disk3 is fine, attempting to rebuild the drive will cause some data corruption/loss. Is there's a way to certain of the proper course of actions? Thanks for you time and your help! PS: Writing this give me a new idea, that might allow me to be certain that the data will be saved. Excluding disk3 from all my shares, to prevent any data change on the drive. Find a way to run a checksum on all the files of the drive. Replace the disk3, with the new drive and reconstruct it (keep old disk3 in a safe spot for now). Re-run the checksum on all the reconstructed files of the drive. Diff the 2 list of files and ensure that everything is identical However, writing those bullet points make me wonder what would be the interpretation of those result if the data is not identical on those 2 lists... disk3 or parity is bad? Quote Link to comment
Benbou Posted April 13, 2023 Author Share Posted April 13, 2023 5 hours ago, itimpi said: A few comments: [...] One thing you could do now is run an Extended SMART test on the drive and see if that passes and if it does the drive is quite likely to be OK. [...] I did run an Extended SMART test, and no new error were reported. Only the same 8 pending sectors. 5 hours ago, itimpi said: A few comments: [...] The pending sectors on disk3 is not a good sign, but sometimes these can go back to 0 if they are rewritten. Ideally [...] This would not prevent any read on those sector? (which would invalidate the parity) 5 hours ago, itimpi said: [...] Since you already have a drive on order maybe the best thing to do is to proceed with your plan, and keep disk3 with its contents intact until the rebuild onto the replacement disk3 completes successfully. After that you can then run a pre-clear on the old drive to see if it passes that OK. But at this point I'm not sure if the data on disk3 is corrupted or if the parity is. So if the parity check (last step of my plan) do report 0 error, what that would mean? If the parity was corrupted, it would just corrupt the data on disk3 and report 0 parity sync error? But if the disk3 was corrupted, it wound indeed fix the corrupted data and report 0 parity error, right? How can I be sure that I don't loss any data? The data from the old disk3 will still be intact, but it will most likely be different that the data on the new disk3, how to know which drive have the "good" data? Thank you for your time and your help! Quote Link to comment
itimpi Posted April 13, 2023 Share Posted April 13, 2023 13 minutes ago, Benbou said: The data from the old disk3 will still be intact, but it will most likely be different that the data on the new disk3, how to know which drive have the "good" data? You cannot be certain with XFS file systems unless you have checksums of from when the data is known to be good. This is why some people use the File Integrity plugin in conjunction with XFS file systems. You also have the option of using BTRFS (and shortly ZFS) file systems on array drives which have built in checksum checking of files as they are read or written which although it does not correct such errors lets you know immediately they are detected and in which files. Quote Link to comment
JorgeB Posted April 13, 2023 Share Posted April 13, 2023 42 minutes ago, Benbou said: Is that mean that the parity1 is still valid? Correct, no errors found on P (parity1), only Q (parity2). 1 Quote Link to comment
Benbou Posted April 13, 2023 Author Share Posted April 13, 2023 3 minutes ago, itimpi said: [...] This is why some people use the File Integrity plugin in conjunction with XFS file systems. [...] You can be assured that I will setup this plugin as soon as this power outage aftermath will be behind me. So maybe I should find a way to checksum all the files from both drives (the old and new disk3) and diff the result. If any files don't have the same hash, I can try to open them and figure out which one is corrupted. Quote Link to comment
Benbou Posted April 13, 2023 Author Share Posted April 13, 2023 6 minutes ago, JorgeB said: Correct, no errors found on P (parity1), only Q (parity2). This is a great news! So that mean that it's certain that the files on disk3 are fine? I mean, if corrupted both parity would have reported errors, right? Quote Link to comment
JorgeB Posted April 13, 2023 Share Posted April 13, 2023 11 minutes ago, Benbou said: So that mean that it's certain that the files on disk3 are fine? I mean, if corrupted both parity would have reported errors, right? Yep. Quote Link to comment
Benbou Posted April 13, 2023 Author Share Posted April 13, 2023 4 hours ago, JorgeB said: Correct, no errors found on P (parity1), only Q (parity2). But where, in the diagnostic files, did you found that the errors were only on Q? Despite my attempts, I was unable to locate it. Quote Link to comment
JorgeB Posted April 14, 2023 Share Posted April 14, 2023 Apr 10 23:49:25 Tower kernel: mdcmd (40): check nocorrect Apr 10 23:49:25 Tower kernel: md: recovery thread: check P Q ... Apr 11 01:40:56 Tower kernel: md: recovery thread: Q incorrect, sector=2220657896 Apr 11 01:41:00 Tower kernel: md: recovery thread: Q incorrect, sector=2221874816 Apr 11 13:17:36 Tower kernel: md: recovery thread: Q incorrect, sector=12884901840 Apr 11 13:17:36 Tower kernel: md: recovery thread: Q incorrect, sector=12884901848 Apr 11 13:17:52 Tower kernel: md: recovery thread: Q incorrect, sector=12889075728 Apr 11 13:23:34 Tower kernel: md: recovery thread: Q incorrect, sector=12986064728 Apr 12 18:01:08 Tower kernel: md: sync done. time=31194sec Apr 12 18:01:08 Tower kernel: md: recovery thread: exit status: 0 Quote Link to comment
Benbou Posted April 14, 2023 Author Share Posted April 14, 2023 10 hours ago, JorgeB said: Apr 10 23:49:25 Tower kernel: mdcmd (40): check nocorrect Apr 10 23:49:25 Tower kernel: md: recovery thread: check P Q ... Apr 11 01:40:56 Tower kernel: md: recovery thread: Q incorrect, sector=2220657896 Apr 11 01:41:00 Tower kernel: md: recovery thread: Q incorrect, sector=2221874816 Apr 11 13:17:36 Tower kernel: md: recovery thread: Q incorrect, sector=12884901840 Apr 11 13:17:36 Tower kernel: md: recovery thread: Q incorrect, sector=12884901848 Apr 11 13:17:52 Tower kernel: md: recovery thread: Q incorrect, sector=12889075728 Apr 11 13:23:34 Tower kernel: md: recovery thread: Q incorrect, sector=12986064728 Apr 12 18:01:08 Tower kernel: md: sync done. time=31194sec Apr 12 18:01:08 Tower kernel: md: recovery thread: exit status: 0 Thank you! Note to myself: next time I can use "incorrect" as a keyword to find in files. 1 Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.