Help : Parity check finding 88 errors. What to do?


Recommended Posts

My parity check found 88 errors. My unraid share drives doesnt work after few hours the server is up. Still like that even after few reboots. I had to force reboot(hardware reboot) as reboot ssh command doesnt to work. 'reboot' in ssh only give 'The system is going for reboot NOW!' message. No reboot even after few hours.

 

These are the message in unraid dahboard

 

Last check completed on Sat 29 Apr 2017 08:45:28 AM SGT (today), finding 88 errors. 

 

What is the problem? Why it happens? What should I do? 

 

Attach log files.

 

Thanks.

syslog.zip

Link to comment

Post your diagnostics (tools - diagnostics).  If the web UI isn't working because you've attempted to reboot, then from the command prompt, type diagnostics and then upload the file.  There are far more files in there than simply the last syslog (and yours as posted has rotated)

 

38 minutes ago, publicENEMY said:

'reboot' in ssh only give 'The system is going for reboot NOW!' message. No reboot even after few hours.

Most common cause of this is a ssh session open with the current directory pointed to a mount point (ie: /mnt/user)

 

 

  • Upvote 1
Link to comment
46 minutes ago, publicENEMY said:

My parity check found 88 errors.

 

These are not disk errors but sync errors.

 

The diagnostics file would allow more research to see if you are having disk issues, but most common reason for sync errors is a dirty / hard shutdown, for example if you had a power cut or wife tripped over the power cord while vacuuming. :)

 

Normally a correcting check cleans this up with no trouble. That's because most modern file systems handle this type of calamity without much difficulty, whereas unRAID's filesystem-less parity disk is more likely to lack updates. Since the correcting check relies on the data disks, and updates parity, this is a good thing.

  • Upvote 1
Link to comment

ROFL. I see all!

 

Normally after a hard boot, unRAID will initiate a correcting check. Your log is truncated, and I cannot see the log entry for where the check was kicked off, which would confirm a correcting check was kicked off. But I expect your parity errors are behind you. You can run a second check to confirm.

  • Upvote 1
Link to comment

Parity check has already been covered by others in the thread, but parity errors by themselves should not cause the other symptoms.

 

It looks like you have told FCP to ignore unclean shutdowns. I guess this is because you currently are having issues that force you to hard powerdown frequently.

 

After you get parity fixed, you might try putting FCP in Troubleshooting mode so we can get better diagnostics if it hangs again.

  • Upvote 1
Link to comment

Sounds like an alarm, possible high CPU temp, as it's rather higher than normal, check cooling.

 

As for the unresponsive array, you're using reiserfs on all disks, that would be my #1 suspect, you can confirm by converting only your cache to XFS, disable the mover temporarily so that all writes are limited to cache, test for a few days and if it works convert the remaining disks.

 

PS: mover was run during the parity check, this should be avoided as it will slow down both operations considerably.

  • Upvote 1
Link to comment
3 hours ago, johnnie.black said:

Sounds like an alarm, possible high CPU temp, as it's rather higher than normal, check cooling.

 

As for the unresponsive array, you're using reiserfs on all disks, that would be my #1 suspect, you can confirm by converting only your cache to XFS, disable the mover temporarily so that all writes are limited to cache, test for a few days and if it works convert the remaining disks.

 

PS: mover was run during the parity check, this should be avoided as it will slow down both operations considerably.

 

Yes, I am using reiserfs on all disks. Should I convert the filesystem?

 

I have stopped default parity check after boot from unclean shutdown and manually check Write correction to parity. Then I start parity check.

 

10 hours later, after parity check has completed, there is still 88 parity check errors.

 

Included are the latest logs from Fix common problem troubleshooitng logs.

 

How do I fix the 88 errors?

 

tower-diagnostics-20170430-1859.zip

logs.7z

Link to comment
8 minutes ago, publicENEMY said:

Yes, I am using reiserfs on all disks. Should I convert the filesystem?

 

There are many reports of unresponsive v6 server system when using reiserfs, usually converting all disks to xfs fixes the problem, and IMO you should convert anyway since reiserfs is no longer properly maintained and it can have terrible performance in some situations.

 

10 minutes ago, publicENEMY said:

10 hours later, after parity check has completed, there is still 88 parity check errors.

 

It's normal for the correcting check to find the same errors, but they are now corrected and next check should find 0 errors.

  • Like 1
  • Upvote 1
Link to comment
6 minutes ago, johnnie.black said:

It's normal for the correcting check to find the same errors, but they are now corrected and next check should find 0 errors.

 

Ok. I will do parity check now and will report back later. Btw, does it stated in the logs that the errors are fixed? Just curious.

 

Thanks.

Link to comment

There are two types of parity checks - correcting and non-correcting. You can run non-correcting over and over and nothing is ever fixed. If you run one correcting one, the corrections are written and you are done.

 

One of the great mysteries of unRaid have been an odd flip flopping behavior where a correcting check fixes a set of parity errors, and then a subsequent check flags exactly the same sectors and flips them back. This had been documented 4 or 5 times but is exceedingly rare. 

 

I think you just never ran a correcting check. In the logs it will say CORRECT or NOCORRECT when the check is started.

 

On the Reiserfs front, two words - get off. It was a good filesystem in its day, but with larger drive sizes it slows down dramatically, causing time outs and poor performance. Its limitations have justifiably lead to disuse, and its disuse lead to less diligence in making required updates. One update made it into the golden release that corrupted people's data. The fewer the people that use it, the higher the risk bugs creep in. And the one place you want no bugs is in your file system! 

 

The final reason, if those are not enough, is the author brutally killed his wife and is rotting in jail. And if you think he's sorry, he was just involved in a lawsuit with his children. I would not want to honor him in any way by using an invention of his evil mind.

 

There is a sticky with a link to an excellent wiki by RobJ on how to convert, and a very lengthy discussion thread. I recommend XFS.

 

 

  • Upvote 1
Link to comment
21 minutes ago, publicENEMY said:

 

Ok. I will do parity check now and will report back later. Btw, does it stated in the logs that the errors are fixed? Just curious.

 

Thanks.

 

I hadn't check the logs, but that last check was noncorrect:

 

Apr 30 08:12:28 Tower kernel: mdcmd (51): check nocorrect

 

 

  • Upvote 1
Link to comment
43 minutes ago, johnnie.black said:
1 hour ago, publicENEMY said:

 

Ok. I will do parity check now and will report back later. Btw, does it stated in the logs that the errors are fixed? Just curious.

 

Thanks.

 

I hadn't check the logs, but that last check was noncorrect:

 


Apr 30 08:12:28 Tower kernel: mdcmd (51): check nocorrect

 

 

 

When you in the the Main page look under "Array Operations"  You will see an button named "Check"   Double check that the box that says "Write Corrections to Parity " IS checked to get a correcting check.  Now click that button you will get a 'correcting check'.

 

I personally would stop the current parity check (won't harm a thing but the loss of a bit of time) and restart it making absolutely sure that everything is 'checked' correctly to get a correcting parity check this time!

  • Upvote 1
Link to comment
Quote

Last check completed on Mon 01 May 2017 06:34:35 AM SGT (today), finding 0 errors. 

 

Thanks all. I didnt know that Auto Parity Check after unclean shutdown doesnt fix parity error. You have to manually initiate parity check with checked "Write correction to parity" to fix parity check errors.

 

To unraid devs, I would recommend  

  1. an indication whether Write correction to parity is check or unchecked during parity checked
  2. information how much parity errors are corrected out of parity errors detected on parity checked before

To those that are having the same problem as me, I would suggest to initiate parity check with checked "Write correction to parity", twice.

 

Anyway, everything is okay now.

 

Thanks.

Link to comment

It used to be the default. In the version I run it still is the default. I wish we had better tools in the event of a parity error - to research and maybe figure out why it was caused, but lacking something like that, there is not much alternative to allowing the parity to be updated. I DO strongly recommend that people create md5 (or similiar) checksums of all of their files, so in the event of possible corruption, you can positively determine if corruption occurred, and identify which file(s) were impacted. It is a bad feeling to suspect you have corruption with no way to know for sure or identify likely suspects. But having parity out of sync with your data disks is a bad situation. Even if one of your data disks is possibly corrupted, you would want parity consistent with that corruption so that the rebuilding of another disk can be performed correctly. But I will tell you, I have such md5s and never have had a hard shutdown cause corruption on my disks. Updating parity was always the right thing to do.

 

I do suggest getting busy on the XFS conversion! Here is the link to the sticky to point you to the wiki and discussion thread.

 

  • Upvote 1
Link to comment
7 hours ago, publicENEMY said:

To unraid devs, I would recommend  

  1. an indication whether Write correction to parity is check or unchecked during parity checked
  2. information how much parity errors are corrected out of parity errors detected on parity checked before

 

This may go unnoticed. Best place to do these kind of recommendations is to put a feature request.

 

  • Upvote 1
Link to comment
  • 8 months later...

I have had many hard reboots for reasons I'm still figuring out. Parity checks have always come back fine, but recently started showing 464 errors.  After the most recent post-reboot check, I ran it again manually with the write corrections box checked. After it finished it still came back and indicated 464 errors?

Link to comment
4 minutes ago, btrcp2000 said:

I have had many hard reboots for reasons I'm still figuring out. Parity checks have always come back fine, but recently started showing 464 errors.  After the most recent post-reboot check, I ran it again manually with the write corrections box checked. After it finished it still came back and indicated 464 errors?

 

You really need to start your own thread.  Basically, your problem is the reboots and why they are happening.  Where or not there are parity errors depends on whether the array had finished all of its writes to disk when the lockup occurred. 

Link to comment

I've been starting threads on reboots for months, no one seems to have a conclusive answer on what's happening, so I am in trial and error mode with that. The errors were a new thing for me. If you are telling me not to worry about the 464 errors, then I will continue focusing on the reboots. This thread made it seem like they should have gone away with the manual parity check.

Link to comment
1 hour ago, btrcp2000 said:

I've been starting threads on reboots for months, no one seems to have a conclusive answer on what's happening, so I am in trial and error mode with that. The errors were a new thing for me. If you are telling me not to worry about the 464 errors, then I will continue focusing on the reboots. This thread made it seem like they should have gone away with the manual parity check.

The only acceptable answer for parity errors is zero. But this thread is for helping the OP, start your own. We don't want to get everyone confused with diagnostics, requests, replies, and instructions for multiple people in the same thread.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.