Parity Check Error Question (SOLVED)


Recommended Posts

Hi all, quick question probably.  Had the kernel panic yesterday which seems to have triggered an automatic parity check on reboot.  Parity option is check to write corrections and it found 52 errors.  Do I need to be concerned or it just writes the corrections and I'm good to go again?

First time I've ever kernel panicked and went through this.

 

Thanks!

Edited by nickp85
Link to comment
3 hours ago, jonathanm said:

Yes, concerned, but writing the corrections should be enough. Perhaps a non-correcting check with zero errors just to be sure.

So I started a non-correcting check and it's about 15% in and found 3 errors already.  This is a brand new machine built back in December.  Parity drive is a new WD red 4TB.  Have 3 other identical 4 TB drives and 2 3 TB reds I took from my old NAS (about 5 years old).  Drives all seem to be working fine and pass SMART tests.  Been running parity check on a 2 month schedule without issue until this kernel panic and reboot.  Someone said the kernel panic was because I updated from 6.5.3rc2 to 6.5.3 so I did what was suggested and backed up the config file and reloaded the stick with a fresh copy of 6.5.3.  No more panics but now getting these parity errors.

 

Anything I should do from here?

Link to comment
4 hours ago, nickp85 said:

Had the kernel panic yesterday which seems to have triggered an automatic parity check on reboot.  Parity option is check to write corrections and it found 52 errors.

The automatic parity check would be non correct, did you cancel and change to correct? If not it's normal to find the same errors again, if you havn't rebooted yet the diagnostics will also show.

Link to comment

 

8 hours ago, johnnie.black said:

The automatic parity check would be non correct, did you cancel and change to correct? If not it's normal to find the same errors again, if you havn't rebooted yet the diagnostics will also show.

 

I think I did cancel it because I didn’t know why it was running since it wasn’t scheduled to start yet. But then I realized and triggered it manually. 53 errors correct and said parity valid. Then I ran another one without correct enabled and it found 121. This is less than a day after correcting the first 53.

 

whats going on? Could it be that new pre-emptive setting in 6.5.3?  What should I do?

 

Date Duration Speed Status Errors
2018-06-15, 11:12:14 12 hr, 26 min, 55 sec 89.3 MB/s OK 121
2018-06-14, 16:30:04 19 hr, 33 min, 1 sec 56.8 MB/s OK 52
2018-06-13, 20:26:55 29 min, 5 sec Unavailable Canceled 0
2018-04-04, 16:18:16 12 hr, 18 min, 15 sec 90.3 MB/s OK 0
2018-01-03, 20:41:40 11 hr, 41 min, 39 sec 95.0 MB/s OK 0
2018-01-03, 15:41:40 11 hr, 41 min, 39 sec 95.0 MB/s OK 0
2017-12-31, 04:50:28 7 hr, 12 min, 55 sec 154.0 MB/s OK 0
2017-12-09, 22:20:07 8 hr, 56 sec 138.6 MB/s OK 0
 
Edited by nickp85
Link to comment
10 hours ago, johnnie.black said:

Don't see anything that would indicate a disk problem, since you're not using ECC memtest for 24H is where I would start.

Tried to boot memtest from the unraid boot menu and the machine just reboots after showing /memtest on the screen after hitting enter on the memtest option.  How come memtest won't load off the stick?

Link to comment
15 minutes ago, nickp85 said:

Tried to boot memtest from the unraid boot menu and the machine just reboots after showing /memtest on the screen after hitting enter on the memtest option.  How come memtest won't load off the stick?

Instead of booting the stick via UEFI, try booting it in Legacy mode (should be able to select that via your BIOS boot order), and then try memtest

Link to comment
4 minutes ago, Squid said:

Instead of booting the stick via UEFI, try booting it in Legacy mode (should be able to select that via your BIOS boot order), and then try memtest

Thanks, figured it was UEFI. Made another USB stick using the UEFI compatible memtest and booted off that. It’s running. Let’s see what happens. 

 

Btw after parity check ended with 121 without corrections on, I ran it again to correct those and it ended up correcting 123 I think.

 

Two recent hardware changes 

1. I added another 16 GB of RAM so now I have 4x8 DDR4-3000 but that was almost 3 months ago and a parity check ran after a week after the extra RAM was installed without any errors.

2. Motherboard died recently and I swapped it out for a brand new one same model. That was just under a month ago. Wouldn’t POST anymore.  New board fixed the booting issue so I figured all good.

 

This is all new hardware purchased on or after November/December 2017.

Link to comment
25 minutes ago, nickp85 said:

Memtest still running, no errors yet but there is a “note”

 

[Note] RAM may be vulnerable to high frequency row hammer bit flips

 

no idea what that means. Should I be concerned?

 

It's basically a design flaw in lots of memory modules - reads consumes charges and the charge is only refreshed every x milliseconds. So too many reads with specific patterns can result in read errors.

 

You got the message because Memtest managed to produce read errors while hammering the memory with these specific patterns - but could not produce any read error when using more normal read patterns.

 

https://www.memtest86.com/troubleshooting.htm

 

It's up to you to decide how concerned you should be. Note that not all versions of Memtest have produced this warning.

 

In the end, only ECC memory gives you some reasonable form of guarantee that your machine doesn't read out bad data from RAM.

Link to comment
51 minutes ago, nickp85 said:

But could this trigger the continuing parity errors I’m getting?  Still have about 16 hours left on the memtest to get to a full 24h test.

 

Not sure anyone can answer that one-million-dollar question. Most probably not.

Link to comment

Over 24 hours on the memtest and 0 errors. Only a note about the hammer test twice in all the passes that have completed. I’d say the memory is looking good.

 

What else could be triggering the parity errors that keep popping up?  Any other tests that can be run?

 

Thanks

Link to comment

Some possible reasons:

  1. Overclocking or overheating the CPU.
  2. Overclocking or overheating the chipset.
  3. Defective CPU.
  4. Unclean power.
  5. Bad SATA cable (would normally show up as UDMA CRC count in the SMART data)
  6. Hardware issue with a SATA controller.
  7. Hardware issue with a disk.
  8. Some program that makes direct disk accesses (/dev/sdX), bypassing the parity system (/dev/mdX)
Link to comment
Just now, pwm said:

Some possible reasons:

  1. Overclocking or overheating the CPU.
  2. Overclocking or overheating the chipset.
  3. Defective CPU.
  4. Unclean power.
  5. Bad SATA cable (would normally show up as UDMA CRC count in the SMART data)
  6. Hardware issue with a SATA controller.
  7. Hardware issue with a disk.
  8. Some program that makes direct disk accesses (/dev/sdX), bypassing the parity system (/dev/mdX)

 

1. not overclocking anything, CPU temp is good

2. not overclocking anything, chipset temp looks good

3. Wouldn't a bad CPU trigger a memtest error since it is somewhat involved in the testing?  if not, how to test the CPU?

4. I have a Cyberpower CP1500PFCLCD for battery backup and power conditioning

5. quick SMART test on all disks is clean

6. How to test for this?

7. Not getting any errors in log or otherwise.  How to test for this?

8. I'm running a pretty basic unraid setup with 1 VM and some docker containers for media streaming

 

Is it possible to turn off that new CPU pre-empt setting in unraid 6.5.3 and see if the parity errors go away?  I don't mind the VM boot up time if it fixes the parity sync errors.

Link to comment

The memtest program is intended to stress the memory - that doesn't mean it stresses the CPU itself.

 

A defective CPU need not be repeatedly defective - it's common for overclocked chips to crash more and more often. But non-overclocked chips can also have defects that gradually worsens.

 

It's hard to test for issues with SATA controllers or disks unless you see specific issues that are solved by either switching to a different controller (or controller port) or switching to a different disk.

 

Unclean power need not be unclean mains power - it can also be the PSU itself that isn't supplying stable voltages over all load conditions.

Link to comment

This post provides some more information on the hardware that the OP is using:

 

    https://lime-technology.com/forums/topic/72172-unraid-os-version-653-available/?page=3&tab=comments#comment-664393

 

 

Note the Marvell based SATA card.  These cards have been involved in past problems.  See here fro a rather lengthy discussion:

 

      https://lime-technology.com/forums/topic/39003-marvell-disk-controller-chipsets-and-virtualization/

 

I believe the current general recommendation to avoid Marvell based controllers and use the ones based on the LSI chip sets.  

 

Link to comment

These errors suggest a hardware problem, if the problem is not the RAM, and we can't be sure it isn't, since memtest can only prove there's bad RAM and not the contrary, the next likely candidates would be the board or the Marvell controller, since you have two unused SATA ports on the onboard controller I would start by connecting all drives there, so you can rule out the Marvell controller, despite AFAIK sync errors caused by Marvells controllers are limited to the SASLP, SAS2LP and related controllers.

Link to comment

I’ve had the marvel card since January and have passed parity checks in April with 0 errors. Can’t use SATA5/6 on the main controller because I have two m.2 drives in a cache pool and using both slots on the board disables the last two SATA port, hence the extra card I added.

Link to comment
8 minutes ago, nickp85 said:

I’ve had the marvel card since January and have passed parity checks in April with 0 errors.

That doesn't mean anything, any peace of hardware can fail at any time, you'll need to start trying things out to find where the problem is, one easy thing to try is to reduce the number or DIMMs, e.g., use only half, if there are still errors use the other half on different slots and you can rule out the RAM and slots.

Link to comment

Stupid Marvell SATA card... found a way to slow the speed of one m2 slot to x2 from x4 and regain my 5/6 onboard ports. Did a parity check to correct any errors and now running a second one to verify clean. It’s 90% complete with 0 errors.

 

ban the marvell SATA cards! :)

 

Edit: Parity check complete with 0 errors.  It was the damn Marvell card.  Less than 5 months to go bad.

Edited by nickp85
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.