(SOLVED) Parity Errors remain after two correcting runs


jb426

Recommended Posts

Hey, need some help as I feel I'm stuck in a bizarre parity check loop.

 

Put my server into some new hardware that I thought was stable, turns out it freezes when a VM is running. So there have been a few unclean shutdowns recently. So until I figure that out I put my drives back into the old hardware that had been stable with no issues before.

 

After booting into the old HW I: 

1. Ran a error correcting parity check, found and corrected 865 errors.

2. Ran a non-correcting check to see if the errors were fixed, they were not so I stopped it.

3. Ran another error correcting check, found even more errors (1355)

4. Ran another non-correcting check, errors were still popping up immediately.

5. Stopped array, ran xfs_repair on the drives, doesn't appear to have any errors

6. Currently running File Integrity on my drives, no errors or corruption showing up yet, so is it just the parity drive messed up atm?

 

Not sure what is going on, no drives have smart errors that I can see, the stability issues are not present in this machine that I'm using, I'm considering just rebuilding parity from scratch at this point to see if that fixes it. Any ideas? I've attached my diagnostics. Thanks in advance.

 

tower-diagnostics-20191005-1207.zip

Edited by jb426
Link to comment

After looking at your syslog, it appears it is correcting the same sectors. Bad memory will typically result in random sectors being corrected since the specific memory accessed during disk reading will not be deterministic.

 

I suspect a controller issue or an actual issue with one or more disks causing the same sectors to return bad data.

 

I didn't notice any issues with the SMART reports on any of the array disks. You might try an Extended SMART test for each of them. Click on a disk to get to its page to get to the Self Test.

  • Thanks 1
Link to comment

Ran the long smart tests, came up clean as well. I think I'm going to run more memtests even though I did an overnight one the other day, and check some of the ports or switch them around to see if that does anything. Thanks for suggestions, this is tricky to figure out but I appreciate it.

Link to comment

Hey guys, just posting an update. Managed to figure out my issue, it was indeed bad RAM. I had run a lengthy memtest before but it had shown no errors, but I came across another weird issue at the same time with torrents where they would be downloaded 100%, but the files would have corrupt sections upon playing and force rechecking the torrent would show it was 98% with small missing pieces in the files. Classic bad memory sign. My docker image was getting btrfs corruption too. So I just took out the ram and replaced it and everything seems fine now. I ran a correcting parity check that found ~500 errors and am now running a non-correcting parity check with no errors yet at 30%, so far so good since errors would pop up immediately before. Torrents are working correctly again. I'll move some files onto the array afterwards and run another check just to be sure. Big thanks to @trurl for the help in diagnosing that it was likely a memory issue. Will mark this post as solved.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.