Parity Check Error Question (SOLVED)

nickp85 · June 15, 2018

Hi all, quick question probably. Had the kernel panic yesterday which seems to have triggered an automatic parity check on reboot. Parity option is check to write corrections and it found 52 errors. Do I need to be concerned or it just writes the corrections and I'm good to go again?

First time I've ever kernel panicked and went through this.

Thanks!

Edited June 19, 2018 by nickp85

JonathanM · June 15, 2018

26 minutes ago, nickp85 said:

Do I need to be concerned or it just writes the corrections and I'm good to go again?

Yes, concerned, but writing the corrections should be enough. Perhaps a non-correcting check with zero errors just to be sure.

nickp85 · June 15, 2018

3 hours ago, jonathanm said:

Yes, concerned, but writing the corrections should be enough. Perhaps a non-correcting check with zero errors just to be sure.

So I started a non-correcting check and it's about 15% in and found 3 errors already. This is a brand new machine built back in December. Parity drive is a new WD red 4TB. Have 3 other identical 4 TB drives and 2 3 TB reds I took from my old NAS (about 5 years old). Drives all seem to be working fine and pass SMART tests. Been running parity check on a 2 month schedule without issue until this kernel panic and reboot. Someone said the kernel panic was because I updated from 6.5.3rc2 to 6.5.3 so I did what was suggested and backed up the config file and reloaded the stick with a fresh copy of 6.5.3. No more panics but now getting these parity errors.

Anything I should do from here?

JorgeB · June 15, 2018

4 hours ago, nickp85 said:

Had the kernel panic yesterday which seems to have triggered an automatic parity check on reboot. Parity option is check to write corrections and it found 52 errors.

The automatic parity check would be non correct, did you cancel and change to correct? If not it's normal to find the same errors again, if you havn't rebooted yet the diagnostics will also show.

nickp85 · June 15, 2018

8 hours ago, johnnie.black said:

The automatic parity check would be non correct, did you cancel and change to correct? If not it's normal to find the same errors again, if you havn't rebooted yet the diagnostics will also show.

I think I did cancel it because I didn’t know why it was running since it wasn’t scheduled to start yet. But then I realized and triggered it manually. 53 errors correct and said parity valid. Then I ran another one without correct enabled and it found 121. This is less than a day after correcting the first 53.

whats going on? Could it be that new pre-emptive setting in 6.5.3? What should I do?

Date	Duration	Speed	Status	Errors
2018-06-15, 11:12:14	12 hr, 26 min, 55 sec	89.3 MB/s	OK	121
2018-06-14, 16:30:04	19 hr, 33 min, 1 sec	56.8 MB/s	OK	52
2018-06-13, 20:26:55	29 min, 5 sec	Unavailable	Canceled	0
2018-04-04, 16:18:16	12 hr, 18 min, 15 sec	90.3 MB/s	OK	0
2018-01-03, 20:41:40	11 hr, 41 min, 39 sec	95.0 MB/s	OK	0
2018-01-03, 15:41:40	11 hr, 41 min, 39 sec	95.0 MB/s	OK	0
2017-12-31, 04:50:28	7 hr, 12 min, 55 sec	154.0 MB/s	OK	0
2017-12-09, 22:20:07	8 hr, 56 sec	138.6 MB/s	OK	0

Edited June 15, 2018 by nickp85

JorgeB · June 15, 2018

Post your diagnostics, it might show something, if not using ECC RAM you should also run memtest.

nickp85 · June 15, 2018

2 hours ago, johnnie.black said:

Post your diagnostics, it might show something, if not using ECC RAM you should also run memtest.

Here you go, thank you!

nicknas2-diagnostics-20180615-1347.zip

JorgeB · June 15, 2018

Don't see anything that would indicate a disk problem, since you're not using ECC memtest for 24H is where I would start.

nickp85 · June 16, 2018

10 hours ago, johnnie.black said:

Don't see anything that would indicate a disk problem, since you're not using ECC memtest for 24H is where I would start.

Tried to boot memtest from the unraid boot menu and the machine just reboots after showing /memtest on the screen after hitting enter on the memtest option. How come memtest won't load off the stick?

Squid · June 16, 2018

15 minutes ago, nickp85 said:

Tried to boot memtest from the unraid boot menu and the machine just reboots after showing /memtest on the screen after hitting enter on the memtest option. How come memtest won't load off the stick?

Instead of booting the stick via UEFI, try booting it in Legacy mode (should be able to select that via your BIOS boot order), and then try memtest

nickp85 · June 16, 2018

4 minutes ago, Squid said:

Instead of booting the stick via UEFI, try booting it in Legacy mode (should be able to select that via your BIOS boot order), and then try memtest

Thanks, figured it was UEFI. Made another USB stick using the UEFI compatible memtest and booted off that. It’s running. Let’s see what happens.

Btw after parity check ended with 121 without corrections on, I ran it again to correct those and it ended up correcting 123 I think.

Two recent hardware changes

1. I added another 16 GB of RAM so now I have 4x8 DDR4-3000 but that was almost 3 months ago and a parity check ran after a week after the extra RAM was installed without any errors.

2. Motherboard died recently and I swapped it out for a brand new one same model. That was just under a month ago. Wouldn’t POST anymore. New board fixed the booting issue so I figured all good.

This is all new hardware purchased on or after November/December 2017.

nickp85 · June 16, 2018

Memtest still running, no errors yet but there is a “note”

[Note] RAM may be vulnerable to high frequency row hammer bit flips

no idea what that means. Should I be concerned?

Edited June 16, 2018 by nickp85

pwm · June 16, 2018

25 minutes ago, nickp85 said:

Memtest still running, no errors yet but there is a “note”

[Note] RAM may be vulnerable to high frequency row hammer bit flips

no idea what that means. Should I be concerned?

It's basically a design flaw in lots of memory modules - reads consumes charges and the charge is only refreshed every x milliseconds. So too many reads with specific patterns can result in read errors.

You got the message because Memtest managed to produce read errors while hammering the memory with these specific patterns - but could not produce any read error when using more normal read patterns.

https://www.memtest86.com/troubleshooting.htm

It's up to you to decide how concerned you should be. Note that not all versions of Memtest have produced this warning.

In the end, only ECC memory gives you some reasonable form of guarantee that your machine doesn't read out bad data from RAM.

nickp85 · June 16, 2018

But could this trigger the continuing parity errors I’m getting? Still have about 16 hours left on the memtest to get to a full 24h test.

Edited June 16, 2018 by nickp85

pwm · June 16, 2018

51 minutes ago, nickp85 said:

But could this trigger the continuing parity errors I’m getting? Still have about 16 hours left on the memtest to get to a full 24h test.

Not sure anyone can answer that one-million-dollar question. Most probably not.

nickp85 · June 17, 2018

Over 24 hours on the memtest and 0 errors. Only a note about the hammer test twice in all the passes that have completed. I’d say the memory is looking good.

What else could be triggering the parity errors that keep popping up? Any other tests that can be run?

Thanks

pwm · June 17, 2018

Some possible reasons:

Overclocking or overheating the CPU.
Overclocking or overheating the chipset.
Defective CPU.
Unclean power.
Bad SATA cable (would normally show up as UDMA CRC count in the SMART data)
Hardware issue with a SATA controller.
Hardware issue with a disk.
Some program that makes direct disk accesses (/dev/sdX), bypassing the parity system (/dev/mdX)

nickp85 · June 17, 2018

Just now, pwm said:

Some possible reasons:

Overclocking or overheating the CPU.

Overclocking or overheating the chipset.

Defective CPU.

Unclean power.

Bad SATA cable (would normally show up as UDMA CRC count in the SMART data)

Hardware issue with a SATA controller.

Hardware issue with a disk.

Some program that makes direct disk accesses (/dev/sdX), bypassing the parity system (/dev/mdX)

1. not overclocking anything, CPU temp is good

2. not overclocking anything, chipset temp looks good

3. Wouldn't a bad CPU trigger a memtest error since it is somewhat involved in the testing? if not, how to test the CPU?

4. I have a Cyberpower CP1500PFCLCD for battery backup and power conditioning

5. quick SMART test on all disks is clean

6. How to test for this?

7. Not getting any errors in log or otherwise. How to test for this?

8. I'm running a pretty basic unraid setup with 1 VM and some docker containers for media streaming

Is it possible to turn off that new CPU pre-empt setting in unraid 6.5.3 and see if the parity errors go away? I don't mind the VM boot up time if it fixes the parity sync errors.

pwm · June 17, 2018

The memtest program is intended to stress the memory - that doesn't mean it stresses the CPU itself.

A defective CPU need not be repeatedly defective - it's common for overclocked chips to crash more and more often. But non-overclocked chips can also have defects that gradually worsens.

It's hard to test for issues with SATA controllers or disks unless you see specific issues that are solved by either switching to a different controller (or controller port) or switching to a different disk.

Unclean power need not be unclean mains power - it can also be the PSU itself that isn't supplying stable voltages over all load conditions.

Frank1940 · June 17, 2018

This post provides some more information on the hardware that the OP is using:

https://lime-technology.com/forums/topic/72172-unraid-os-version-653-available/?page=3&tab=comments#comment-664393

Note the Marvell based SATA card. These cards have been involved in past problems. See here fro a rather lengthy discussion:

https://lime-technology.com/forums/topic/39003-marvell-disk-controller-chipsets-and-virtualization/

I believe the current general recommendation to avoid Marvell based controllers and use the ones based on the LSI chip sets.

JorgeB · June 17, 2018

These errors suggest a hardware problem, if the problem is not the RAM, and we can't be sure it isn't, since memtest can only prove there's bad RAM and not the contrary, the next likely candidates would be the board or the Marvell controller, since you have two unused SATA ports on the onboard controller I would start by connecting all drives there, so you can rule out the Marvell controller, despite AFAIK sync errors caused by Marvells controllers are limited to the SASLP, SAS2LP and related controllers.

nickp85 · June 17, 2018

I’ve had the marvel card since January and have passed parity checks in April with 0 errors. Can’t use SATA5/6 on the main controller because I have two m.2 drives in a cache pool and using both slots on the board disables the last two SATA port, hence the extra card I added.

JorgeB · June 17, 2018

8 minutes ago, nickp85 said:

I’ve had the marvel card since January and have passed parity checks in April with 0 errors.

That doesn't mean anything, any peace of hardware can fail at any time, you'll need to start trying things out to find where the problem is, one easy thing to try is to reduce the number or DIMMs, e.g., use only half, if there are still errors use the other half on different slots and you can rule out the RAM and slots.

nickp85 · June 19, 2018

Stupid Marvell SATA card... found a way to slow the speed of one m2 slot to x2 from x4 and regain my 5/6 onboard ports. Did a parity check to correct any errors and now running a second one to verify clean. It’s 90% complete with 0 errors.

ban the marvell SATA cards!

Edit: Parity check complete with 0 errors. It was the damn Marvell card. Less than 5 months to go bad.

Edited June 19, 2018 by nickp85

pwm · June 19, 2018

Which drives did you handle with the Marvell card?

Parity Check Error Question (SOLVED)

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation