Jump to content
We're Hiring! Full Stack Developer ×

Parity Errors Nearly Every Check


Recommended Posts

I setup unRAID back in October of last year and have had consistent parity errors every check. There may be a few sectors that are persistent from check to check but overall they differ between running correcting checks and just standard checks. 


Back in November I was seeing about 100-150 errors on check. I decided to upgrade a lot of the system hardware: CPU, Motherboard, RAM from an i5 8600k with 128GB of RAM to a 14900k with 96GB (2x48) of RAM I was still seeing a few parity errors. CPU Temps never reach over 90C under very heavy load, HDD drive temps are usually around 28C. I am running the HX1000i that I obtained back in 2018, this would be my next item to replace. All of these are mounted in a Rosewill RSV-L4412U, I'll break this hardware out below as well. I have plenty of airflow and the server is located in my basement which is perpetually cooler than the rest of the house. I do not currently have a UPS but have had no issues with unclean shutdowns recently, and between the last correcting scan and today there have been no power off events.


Main Hardware:

  • CPU: Intel i9 14900K
  • CPU Cooler: Phantom Spirit 120SE
  • Motherboard: Asus Z790 Creator Wifi BIOS 1501 (Not updated to the latest BIOS as I saw complaints about stability)
  • RAM: 96GB (2x48GB) DDR5-6800 - CTCWD596G6800HC36DDC01
  • GPU: Nvidia Zotac 2080ti (Unused currently)
  • PSU: Corsair HX1000i (from 2018)
  • Additional SATA / SAS cards that were swapped as talked about below.

 

The problem XFS array consists of:

  • 4x 20TB Seagate EXOS (1 of which is parity)
  • 2x 20TB Western Digital Golds

 

Cache Pool:

  • 2x Samsung 970 EVOs NVMe (Mirrored)

 


The upgrade in hardware saw a large reduction in the errors I was seeing but did not completely remove them unfortunately, I am now seeing between 8-15 errors. I was originally running an LSI 9211-8i with the latest patch, I verified this by running SAS2 software which patch was currently running. I even tried replacing the LSI 9211-8i with a second one, including replacing and re-seating all the SAS breakout cables twice. My latest change was moving to a ASMedia 2116 6-port that I manually patched to be compatible with the Z790 chipset, same errors are still occurring unfortunately after my latest correction patch.

 

Things I've done so far to try to rule out the current hardware:

  •  Memtest for 24+ hours, multiple passes, no errors shown.
  •  Ran the long SMART test for each of my HDDs, all come back with completed without error.
  •  Validated that my RAM is on the motherboard's QVL for my CPU
  •  Replaced all the cabling between HDDs and motherboard and SATA/SAS cards.
  •  Replaced a lot of the old main components that are the usual suspects.

 

 

Here are the system logs from the last corrective pass and the currently in-progress check, run 4 days apart:

 

    Jan 12 17:47:05 Plex kernel: mdcmd (40): check correct
    Jan 12 17:47:05 Plex kernel: md: recovery thread: check P ...
    Jan 12 18:04:51 Plex kernel: md: recovery thread: P corrected, sector=408944536
    Jan 12 18:04:51 Plex kernel: md: recovery thread: P corrected, sector=408944568
    Jan 12 23:01:48 Plex kernel: md: recovery thread: P corrected, sector=7729962992
    Jan 12 23:06:37 Plex kernel: md: recovery thread: P corrected, sector=7857350816
    Jan 13 01:39:16 Plex kernel: md: recovery thread: P corrected, sector=11449856280
    Jan 13 02:36:11 Plex kernel: md: recovery thread: P corrected, sector=12737611824
    Jan 13 02:56:38 Plex kernel: md: recovery thread: P corrected, sector=13215764648
    Jan 13 04:17:23 Plex kernel: md: recovery thread: P corrected, sector=15261778672
    Jan 13 05:57:56 Plex kernel: md: recovery thread: P corrected, sector=17566756896
    Jan 13 07:25:04 Plex kernel: md: recovery thread: P corrected, sector=19574770480
    Jan 13 10:38:27 Plex kernel: md: recovery thread: P corrected, sector=23683676352
    Jan 13 14:38:00 Plex kernel: md: recovery thread: P corrected, sector=28142901992
    Jan 13 15:53:40 Plex kernel: md: recovery thread: P corrected, sector=29476919208
    Jan 14 00:20:12 Plex kernel: md: recovery thread: P corrected, sector=36978404352
    Jan 14 03:49:34 Plex kernel: md: sync done. time=122549sec
    Jan 14 03:49:34 Plex kernel: md: recovery thread: exit status: 0
    Jan 17 09:47:49 Plex kernel: mdcmd (41): check nocorrect
    Jan 17 09:47:49 Plex kernel: md: recovery thread: check P ...
    Jan 17 10:49:18 Plex kernel: md: recovery thread: P incorrect, sector=1520616296
    Jan 17 12:13:35 Plex kernel: md: recovery thread: P incorrect, sector=3337558520
    Jan 17 15:27:49 Plex kernel: md: recovery thread: P incorrect, sector=7729962992
    Jan 17 15:52:03 Plex kernel: md: recovery thread: P incorrect, sector=8299543424


Here's the screenshot of my parity operation history for reference.WrLlPqw.png)

 

Here's a screenshot of my array and pool devices.

fiDEC1w.png

 

Should I bother replacing the PSU? What else am I overlooking?

Edited by SPICYBOYS
Link to comment

I started doing MD5 hash checking tests and saw some sectors randomly flipping 1 thing here or there, I wasn't sure if it was writing to the sector but then it would go back to the original hash and thought maybe that's a little weird, not sure if memory bits were randomly flipping but out of 600-800 runs I saw it flip maybe 5 times across 2 of the 6 sectors.

 

I then paid for Memtest86 Pro and ran a new test to rule out RAM issues. on the 3rd and 4th run I got 3 more errors on my first stick. Looking at my XMP settings it turns out XMP I on this motherboard is an Asus enhanced version which doesn't take the full XMP profile and auto tunes some things. I swapped to XMP II which runs the 6800 MHz at 1.4v stock XMP settings and got through about 1 test before I started seeing errors again. I swapped to the secondary XMP profile that runs at a much lower speed of 6000 MHz with different timings and only 1.3v. This ran for 7 tests with no errors.

 

Test 1:

  • XMP I
  • 6800 MHz 1.4v
  • 3 Errors in 2 full tests.
  • Max RAM Temp 59C

 

Test 2:

  • XMP II
  • 6800 MHz 1.4v
  • 6 errors in 4 full tests.
  • Max RAM Temp 58C

 

Test 3:

  • XMP II
  • 6000 MHz 1.3v
  • 0 Errors after nearly 8 full tests
  • Max RAM Temp 52C

 

Not sure why I didn't catch these errors the first time around, maybe the ambient temps were cooler or the older Memtest86+ was somehow different / didn't stress enough. Not sure if it's worth trying to swap RAM for a different set or live with the slower speeds

Link to comment
18 minutes ago, itimpi said:

Normally ANY XMP profile is an overclock and bad for stability.

 

Yep totally understand that, that's why I ran a long memtest after setting the new hardware up hoping to rule out any instability as an issue. I really just wanted to give a *fix* update for any of the people who will eventually stumble across this post. So many unRAID posts lead to dead links or no updates.

 

Somehow my initial memtest was flawed. The RAM kit I have is pretty premium and overkill but came on the QVL as a stable larger (48GBx2)  2 stick option. I either lost the memory controller lotto or the airflow in my case is lacking on the RAM side, I do have a rather large CPU heatsink sitting over the 1 stick that was throwing errors.

 

Currently running a new correcting parity check and have seen 1 correction in 27%, but my server configs bugged out awhile back and random settings got changed so I lost my previous syslogs to verify if this sector was "corrected" before or not. I've gone past a few sectors that were showing needing correction that I had saved off in a notepad++ doc  before, nothing to report now so those could have been culprits of random memory address issues. Based on the timeline for my parity scans and how little errors I was seeing when trying to force memory problems I could see how I randomly got only a few correcting checks that are not 100% consistent between scans.

  • Like 1
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...