autumnwalker Posted October 9, 2019 Share Posted October 9, 2019 Just ran my monthly parity check. It found exactly one error. Thoughts? Quote Link to comment
JonathanM Posted October 9, 2019 Share Posted October 9, 2019 Is it a correcting check? If so, run a non-correcting check and see if it's still there. If it's a non-correcting check, verify the health of all the array drives then run a correcting check. Follow up with a non-correcting check to see if it goes away. Don't run a correcting check with a known bad drive involved. Quote Link to comment
autumnwalker Posted October 10, 2019 Author Share Posted October 10, 2019 It was a non-correcting check. By validate health do you mean "green balls"? If so, each is showing healthy. None have failed SMART. Quote Link to comment
JonathanM Posted October 10, 2019 Share Posted October 10, 2019 12 minutes ago, autumnwalker said: By validate health do you mean "green balls"? That's one sign of health. When you click on each drive in the main GUI it takes you to a page that lists smart attributes and allows you to run smart tests and view the results. In the ideal world you would want each drive to have recently passed a long smart test. However, those take several hours to complete, and don't need to be run regularly since unraid reads each sector during a parity check. Drive health is a complicated subject. Quote Link to comment
autumnwalker Posted October 10, 2019 Author Share Posted October 10, 2019 Understood. Should I trust the SMART information that Unraid is currently displaying? Run short tests? Quote Link to comment
JorgeB Posted October 10, 2019 Share Posted October 10, 2019 16 hours ago, autumnwalker said: Just ran my monthly parity check. It found exactly one error. Thoughts? Assuming it was non correct run another one, to see if still finds the same error, single error could be from a bit flip, if not using ECC RAM, if it was correct still run again and if it was a bit flip it should also find 1 error. Also check or post the syslog around the time the error happened, if no disk related entries unlikely for them to be the reason. Quote Link to comment
autumnwalker Posted October 10, 2019 Author Share Posted October 10, 2019 (edited) Log from start to finish of the parity check (non-correcting) here: Oct 8 22:58:37 nas01 kernel: mdcmd (64): check nocorrect Oct 8 22:58:37 nas01 kernel: md: recovery thread: check P ... Oct 8 22:58:37 nas01 kernel: md: using 1536k window, over a total of 3907018532 blocks. Oct 9 02:26:14 nas01 afpd[5310]: Reading IPC header failed (-1 of 14 bytes read): Connection reset by peer Oct 9 10:33:18 nas01 kernel: md: recovery thread: P incorrect, sector=4646561192 Oct 9 11:18:28 nas01 kernel: md: recovery thread: completion status: 0 System using non ECC RAM. Edited October 10, 2019 by autumnwalker Quote Link to comment
JorgeB Posted October 10, 2019 Share Posted October 10, 2019 Very unlikely to be disk related. Quote Link to comment
autumnwalker Posted October 10, 2019 Author Share Posted October 10, 2019 So run another non-correcting parity check and see if I get the same error as below? Oct 9 10:33:18 nas01 kernel: md: recovery thread: P incorrect, sector=4646561192 Quote Link to comment
JorgeB Posted October 10, 2019 Share Posted October 10, 2019 6 hours ago, autumnwalker said: So run another non-correcting parity check and see if I get the same error as below? Correct, if the first one was also non-correct and you don't get an error it was likely a memory bit flip, if it was correcting you should now get the same error. Quote Link to comment
autumnwalker Posted October 10, 2019 Author Share Posted October 10, 2019 Running now. Both initial and current run are non-correcting. Will report back with results. Quote Link to comment
autumnwalker Posted October 11, 2019 Author Share Posted October 11, 2019 Non correcting returned same error: Oct 10 22:12:58 nas01 kernel: md: recovery thread: P incorrect, sector=4646561192 Quote Link to comment
autumnwalker Posted October 11, 2019 Author Share Posted October 11, 2019 What's my next step here? Quote Link to comment
JonathanM Posted October 11, 2019 Share Posted October 11, 2019 Correcting check. Quote Link to comment
John_M Posted October 11, 2019 Share Posted October 11, 2019 As I see it, if all your disks are healthy, your only real option is to run a correcting parity check. That will update the parity disk to match what's on the data disks, making the assumption that the error is in the parity since there is no way of knowing which actual disk is in error. How valid that assumption is is debatable. Before you do run a correcting parity check though you might want to do either or both of the following: Post your diagnostics or check the SMART data of your disks yourself Run a MemTest to check for bad RAM You might want to consider the use of a checksumming technique to detect data corruption going forward. It won't help with your current situation but it might help in the future. I use the Dynamix File Integrity plugin. Some people use btrfs as the format on their data disks. Quote Link to comment
autumnwalker Posted October 11, 2019 Author Share Posted October 11, 2019 SMART is coming back clean on each disk. Is there any way to memtest with the system live (afaik there is not). I will check out the FIle Integrity plugin, but I'm still on reiserfs so I cannot use that until I migrate. Yet another reason for me to migrate now. Quote Link to comment
John_M Posted October 11, 2019 Share Posted October 11, 2019 Running MemTest86 involves a reboot, I'm afraid. Quote Link to comment
JorgeB Posted October 11, 2019 Share Posted October 11, 2019 Like mentioned best option now is to run a correcting check, if it was bad ram it wouldn't be finding an error on the exact same sector, so not so clear what caused the error, hopefully a single event and no more will be detected in the near future, one of those situations where having checksums is valuable. Quote Link to comment
autumnwalker Posted October 12, 2019 Author Share Posted October 12, 2019 Correcting check ran, fixed one error. Just ordered a new drive to start migrating away from reiserfs and I'll look at checksums. I suspect this was related to my failing PSU which was causing my SATA card to crap out (remember that?). Quote Link to comment
autumnwalker Posted October 12, 2019 Author Share Posted October 12, 2019 Thanks all! Quote Link to comment
John_M Posted October 12, 2019 Share Posted October 12, 2019 2 hours ago, autumnwalker said: I suspect this was related to my failing PSU Failing PSUs can be responsible for the most obscure problems. I think that's a very plausible explanation. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.