Parity check error and docker updates not available


Recommended Posts

Hi Everyone-

 

Ran my monthly parity check (for the first time where i unchecked "write corrections to parity") and came up with 9 errors.  

 

Last 2 months, i had 1 error each and the month before that I had 9. 

 

Don't recall any unclean shutdowns in the last 30 days.  I occasionally get a message about once a week from unraid stating, "the connection to your UPS has been restored" for my APC UPS.  Haven't had power go out in months.  

 

When i look under the main tab, all of my disks have 0 errors.

 

Is this something to be worried about?  Any idea how i identify which files/disks are involved?  

 

Attached is my diagnostics.

 

I'm currently on 6.8.3 and haven't updated to 6.9.2

 

The other issue is my krusader and plex docker programs won't let me update them. It states "version not available".  I've restarted both dockers and clicked the check for updates box.  Any idea? Are the 2 problems somehow related?

 

Thanks in advance.  

 

nasgard-diagnostics-20210526-1506.zip

Link to comment
7 hours ago, JorgeB said:

You're overclocking the RAM, Ryzen with overclocked RAM is known to corrupt data resulting in sync errors, see here.

 

That's strange. I updated the bios a few months ago but don't recall manually doing it. 

 

I'll take a look and see. Hoping that's the issue. 

 

How would I identify what the 9 errors are?  Should I click write corrections to parity for my next check?

Edited by omartian
Link to comment
5 hours ago, ChatNoir said:

Under 6.8, the "not available" part is normal, Docker changed stuff on their side.

The best solution is to upgrade to 6.9.

 

If not possible, you can find a workaround there. (also look for the go-file if you want it to survive a reboot)

Awesome. 

 

I'll just upgrade. Thank you. 

Link to comment
10 hours ago, JorgeB said:

You're overclocking the RAM, Ryzen with overclocked RAM is known to corrupt data resulting in sync errors, see here.

Checked in the Bios and XMP was enabled for my ram.  I disabled the xmp profile and took speeds from 3200 to 2100hz.  Hopefully that won't affect my plex server performance.  Will try doing a correcting check now, hoping for only 9 errors. 

Edited by omartian
error
Link to comment
On 5/27/2021 at 1:43 PM, JorgeB said:

First check after fixing the problem can still find errors, but after that it should always be 0 errors, which is the only acceptable number of sync errors.

 

So i just ran two parity checks.  after switching off xmp on my ram, i ran a non-correcting check and only got 2 errors which i thought was weird since i was expecting the 9 from before.  

 

I attached that diagnostic.

 

I then went to my server and re-seated all my sata cables and my sata-sas adapter (LSI SAS 9207-8i SATA/SAS 6Gb/s PCI-E 3.0 Host Bus Adapter IT Mode SAS9207-8i US).

 

Decided to run another non-correcting check, and received 6 errors.  attached below.

 

I noticed that when the process is running, i get about 65% of the way through w/0 errors.  It seems like when i get to disk 6 + 7 (or maybe just 7), the parity errors occur.  Do you think it might be that connector or that disk based on the diagnostics?

 

  

2 errors noncorrecting check.zip 6 errors noncorrecting check.zip

Edited by omartian
updated
Link to comment

The first set of diagnostics shows the parity check starting but only lasts for s few minutes more so no indication of what sectors had the problem.   Are you sure they were taken AFTER the 2 errors occurred?

 

the second diagnostics sdlog shows which sectors had the problem as it has these entries

May 31 13:51:10 Nasgard kernel: md: recovery thread: Q incorrect, sector=18795850688
May 31 14:32:14 Nasgard dhcpcd[1903]: br0: failed to renew DHCP, rebinding
May 31 15:52:33 Nasgard emhttpd: spinning down /dev/sdj
May 31 15:52:33 Nasgard emhttpd: spinning down /dev/sdh
May 31 15:52:33 Nasgard emhttpd: spinning down /dev/sdi
May 31 15:53:47 Nasgard kernel: md: recovery thread: Q incorrect, sector=20466240808
May 31 18:14:49 Nasgard kernel: md: recovery thread: Q incorrect, sector=22417914176
May 31 19:46:33 Nasgard kernel: md: recovery thread: Q incorrect, sector=23565254352
May 31 20:12:20 Nasgard kernel: md: recovery thread: Q incorrect, sector=23959071080
May 31 20:38:19 Nasgard emhttpd: spinning down /dev/sdb
May 31 20:38:19 Nasgard emhttpd: spinning down /dev/sdc
May 31 21:30:28 Nasgard kernel: md: recovery thread: Q incorrect, sector=25097677136

indicating the error sectors.  
 

However since there were no corresponding enters in the syslog in the first diagnostics it is not possible to see if there was any correspondence in the sectors reporting errors.

  • Like 1
Link to comment

It has just occurred to me that if you have the Parity Check Tuning plugin installed then you might be able to investigate this far more rapidly using it's Tools -> Parity Problem Assistant feature?  That feature was developed for exactly your scenario but I have never had any feedback on how useful it turns out to be in practice so would be interested to get some (plus any suggestions for making it more useful).

Link to comment
18 minutes ago, itimpi said:

It has just occurred to me that if you have the Parity Check Tuning plugin installed then you might be able to investigate this far more rapidly using it's Tools -> Parity Problem Assistant feature?  That feature was developed for exactly your scenario but I have never had any feedback on how useful it turns out to be in practice so would be interested to get some (plus any suggestions for making it more useful).

 

Will check out that plugin. Thank you.

 

Weird. Could have sworn I downloaded the 2 error after a full scan. 

 

Anything else you can make out of those bad sectors?

Link to comment
On 6/1/2021 at 4:18 AM, JorgeB said:

#1 reason for unexpected sync errors is RAM related, if just downclocking didn't fix it it could be a bad DIMM.

 

Memtest is currently on pass 9 and has been running for about 20 hrs w/0 errors.  

 

At this point, should i run a correcting check, or is there anything else i should be doing?

Link to comment

If there are still errors on consecutive checks you basically need to rule out the hardware involved, RAM is still a good candidate even without memtest finding errors, but could also be board/CPU or a disk, I would start by using just one DIMM at at time since it's the easiest thing to rule out.

Link to comment
38 minutes ago, JorgeB said:

If there are still errors on consecutive checks you basically need to rule out the hardware involved, RAM is still a good candidate even without memtest finding errors, but could also be board/CPU or a disk, I would start by using just one DIMM at at time since it's the easiest thing to rule out.

 

Ok. One dimm at a time on a non correcting check. 

 

Do these sync error mean that the media files on the data disk are no longer valid or just that there is a discrepancy w the parity. 

 

I'm wondering if I fixed this issue but since I never ran a correcting check, the same random sync issues pop up. If I run a correcting now, I'm hoping the next non correcting check would be clean. 

 

I wish unraid made it easier to isolate the issue. Too many variables....

Edited by omartian
Link to comment
34 minutes ago, omartian said:

Do these sync error mean that the media files on the data disk are no longer valid or just that there is a discrepancy w the parity.

It means parity doesn't match de calculate from the arrays data devices, but with data corruption the problem can be anywhere, could be already written corrupt, could be parity that is wrong, or just the calculation at that time is wrong.

Link to comment
14 minutes ago, JorgeB said:

It means parity doesn't match de calculate from the arrays data devices, but with data corruption the problem can be anywhere, could be already written corrupt, could be parity that is wrong, or just the calculation at that time is wrong.

 

Thanks for all of your help Jorge.  I'll keep tinkering, run a correcting check.  

Link to comment
On 6/1/2021 at 2:50 AM, itimpi said:

It has just occurred to me that if you have the Parity Check Tuning plugin installed then you might be able to investigate this far more rapidly using it's Tools -> Parity Problem Assistant feature?  That feature was developed for exactly your scenario but I have never had any feedback on how useful it turns out to be in practice so would be interested to get some (plus any suggestions for making it more useful).

 

Which sectors should i point it to.  I have a hard time

On 6/1/2021 at 2:50 AM, itimpi said:

It has just occurred to me that if you have the Parity Check Tuning plugin installed then you might be able to investigate this far more rapidly using it's Tools -> Parity Problem Assistant feature?  That feature was developed for exactly your scenario but I have never had any feedback on how useful it turns out to be in practice so would be interested to get some (plus any suggestions for making it more useful).

 

 

So tried using this plugin, but getting an error message.  

 

Based on the syslog that jorge highlighted above, it looks like the error happens between sectors 18795850000 and 25097678000

 

When i try to set it to go, i get the error message:  "end point too large:  The end has been set to more than the size of the disk."  i punched in the above #'s as start and endpoint bc it's asking for sector numbers.  Do i need to adjust it somehow?

Link to comment
3 hours ago, omartian said:

Which sectors should i point it to.  I have a hard time

 

At the moment you have to manually search the syslog for the affected sectors (it will typically be near the end if you have just run a check).

 

I was intending to add a button on the input page that would scan the syslog and pop up a dialog with any sectors found so that it is much easier to both know what sectors are involved and to make it easier to select them.  I had been waiting on some feedback on people trying to use this assistant and finding it useful before putting the work in to implement that option.  Sounds as if it is definitely going to be wanted :) 

 

3 hours ago, omartian said:

So tried using this plugin, but getting an error message.  

 

Based on the syslog that jorge highlighted above, it looks like the error happens between sectors 18795850000 and 25097678000

 

Looks like you added an extra 0 on the end.   I will improve the error message to include the acceptable range which might help with picking this up.

 

Link to comment
4 hours ago, Tigerherz said:

I had similar problems.

It was a problem with spin down disks.

I set spin down in disksettings to never

and clear stats on the mainpage.

I think my controller has a problem with spin down.

 

Do you disable spin down for parity checks or at all times?  If your disks are spinning all the time, isn't that bad for longevity?

 

Also, if you got an error this way, do you run a correcting check afterwards?

Edited by omartian
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.