Ross Posted February 3, 2022 Share Posted February 3, 2022 I have been running Unraid for years without any issue, but the day has finally arrived. A scheduled parity check finished but reported: Last check completed on Wed 02 Feb 2022 03:30:45 PM CST (yesterday) Finding 511132 errors Duration: 14 hours, 30 minutes, 44 seconds. Average speed: 153.2 MB/sec I have three 8TB WD Red drives, and a small SSD cache drive. There are no SMART errors reported and all disks report as "Healthy". Parity was reported as Valid. When the errors were emailed to me I did shutdown the server, because I did not know when I would have time to get to it. I guess that was a mistake since it emptied the log files... So I think I need to run another parity check, without correction, and see what I get? I also think I may want to get a second parity drive after correcting this problem, because I realize how important it might be to identify which disk is having issues... I thought having parity check report errors would be a common problem and that there would be a guide setup for my scenario, but under the troubleshooting areas, I did not yet find a guide for my situation. Is there one out there? Thanks in advance for any help! Ross Quote Link to comment
JorgeB Posted February 3, 2022 Share Posted February 3, 2022 Just now, Ross said: So I think I need to run another parity check, without correction, and see what I get? Yes, if the first one was correcting. Quote Link to comment
Ross Posted February 3, 2022 Author Share Posted February 3, 2022 I manually started a parity check and immediately got errors: Feb 3 11:31:59 WServer1 kernel: md: recovery thread: check P ... Feb 3 11:32:00 WServer1 kernel: md: recovery thread: P incorrect, sector=2488 Feb 3 11:32:00 WServer1 kernel: md: recovery thread: P incorrect, sector=12720 Feb 3 11:32:00 WServer1 kernel: md: recovery thread: P incorrect, sector=14768 Feb 3 11:32:00 WServer1 kernel: md: recovery thread: P incorrect, sector=15792 Feb 3 11:32:00 WServer1 kernel: md: recovery thread: P incorrect, sector=16824 Feb 3 11:32:00 WServer1 kernel: md: recovery thread: P incorrect, sector=10680 Feb 3 11:32:00 WServer1 kernel: md: recovery thread: P incorrect, sector=24656 Feb 3 11:32:00 WServer1 kernel: md: recovery thread: P incorrect, sector=30192 Feb 3 11:32:00 WServer1 kernel: md: recovery thread: P incorrect, sector=39408 Feb 3 11:32:00 WServer1 kernel: md: recovery thread: P incorrect, sector=47264 Quote Link to comment
JorgeB Posted February 3, 2022 Share Posted February 3, 2022 Run a correcting check, then a non correcting one, if the second one finds errors post new diags, all without rebooting. Quote Link to comment
Ross Posted February 3, 2022 Author Share Posted February 3, 2022 I can do that, but I have figured out that there were two correcting parity checks run, and both reported a lot of errors (first time was a week ago), and the server was not rebooted between those two runs. The latest non-correcting one I ran for a few minutes and aborted and it logged over 3k errors. I realize now that my scheduled checks should not have correcting on, but they did.n. So have have had repeat of this error. So I can run two more back to back, will take over a day, but I am wondering if the above information confirms the issue?. I suppose now that I have run two correcting ones already, that if I corrupted my data, it is probably already too late? That's why my next addition will be a secondary parity drive... Thanks again for any help! Ross Quote Link to comment
JorgeB Posted February 3, 2022 Share Posted February 3, 2022 1 minute ago, Ross said: but I am wondering if the above information confirms the issue? Not without seeing if the previous ones were really correcting or not. Quote Link to comment
Ross Posted February 6, 2022 Author Share Posted February 6, 2022 So after rebooting the array I ran a correcting parity check and it corrected thousands of errors. I have since run two non- correcting parity checks and they finished without errors. None of the drives have any SMART errors reported. After years of no parity errors, I’m contemplating about this situation. The array is mostly a media file server and some backup storage, so not a lot of write activity. So when parity is “corrected”, I’m not sure I understand how it determines truth. If there were drive errors, then I could see how I might figure out which drive was the source of error. I’m assuming that either a data drive or the parity drive could develop an issue. So even in my small array I was thinking now that having a second parity drive might help figure out which drive has the “bad” data needing to be corrected. I also saw discussions about an app that calculates a checksum that allows detection of file degradation aside from disk failure. So my server seems fine now but I’m not sure I did not lose something… Is a second parity drive a good idea? Thanks for any guidance! Ross Quote Link to comment
itimpi Posted February 6, 2022 Share Posted February 6, 2022 When ‘correcting’ parity the assumption is that the data drives are good and parity needs to be updated to match. This is the same whether you have single or dual parity, in neither case is a problem drive identifiable as the cause of a parity error. Quote Link to comment
Ross Posted February 6, 2022 Author Share Posted February 6, 2022 Thanks for the quick reply. But is that assumption, that the data drives are always good, valid? Why do we believe the data drives are always valid and only the parity drive can be wrong? Maybe the app that creates file checksums is the way to go… It’s a little disconcerting to see over 500,000 errors need correction on a system that aside from auto updates of the software is just being read. And the drives are old enough that failure is definitely getting to be more likely soon. I’m not sure how I can validate that I do not have file corruption… Thanks again for any information or suggestions. Ross Quote Link to comment
itimpi Posted February 6, 2022 Share Posted February 6, 2022 14 minutes ago, Ross said: Why do we believe the data drives are always valid and only the parity drive can be wrong? Because there if there is a problem on a drive there is no way to identify which one it might be. 17 minutes ago, Ross said: I’m not sure how I can validate that I do not have corruption. With modern drives the assumption is that they will return an error if they do not read a sector successfully, but it is possible that is not always the case. checksums is the only way to be certain (either built into the file system or via an add on are the only way to validate this. Quote Link to comment
Ross Posted February 6, 2022 Author Share Posted February 6, 2022 Got it. I was thinking that a dual parity drive setup would allow the system to figure out which drive was the “odd man out”, and therefore the one drive needing to be corrected. But that would not be true if the parity process had an issue and wrote the wrong parity to both parity drives. I think I’m going to look at that file checksum program to see if that is the way to go. What is the theoretical cause of a stable array (for years) developing over 500,000 parity errors in one week? Ross Quote Link to comment
bombz Posted March 3, 2023 Share Posted March 3, 2023 On 2/6/2022 at 3:01 PM, itimpi said: When ‘correcting’ parity the assumption is that the data drives are good and parity needs to be updated to match. This is the same whether you have single or dual parity, in neither case is a problem drive identifiable as the cause of a parity error. Hello, Came across parity sync errors as well There are no disk errors at the moment. I am having concerns pinpointing what may be the cause of this. I did run into a concern last weekend when updating a plugin the GUI download halted. I performed a clean shutdown of the server rebooted the server the USB /sda was erroring or corrupted (server would not boot) I pulled the USB and restored my backup on a WinOS system booted the server and ran a parity check the first run saw 2 errors yesterdays scheduled run sees 592 errors Wondering if I should be concerned with this, or where the errors are. Are they possible sync errors? Looking forward to some clarification. Thank you kindly, Cheers diagnostics-20230303-0705.zip Quote Link to comment
JorgeB Posted March 3, 2023 Share Posted March 3, 2023 25 minutes ago, bombz said: Wondering if I should be concerned with this Yes, run another check and post new diags if new errors are found. Quote Link to comment
bombz Posted March 4, 2023 Share Posted March 4, 2023 On 3/3/2023 at 7:43 AM, JorgeB said: Yes, run another check and post new diags if new errors are found. Hello, Appreciate the prompt follow-up. I will kick off another parity sync, or possibly wait for the next monthly scheduled sync to see how things pan out. I will report back regardless of the findings. Thank you kindly! Quote Link to comment
bombz Posted April 4, 2023 Share Posted April 4, 2023 On 3/3/2023 at 7:43 AM, JorgeB said: Yes, run another check and post new diags if new errors are found. Hello again, Ran another pass on both my servers and this monthly pass showed no errors reposted. Strange why they produced in the first place. As always I appreciate the feedback, support and assistance with this inquiry. Thank you again! 1 Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.