Partiy check errors ( 1 or 2): best method to find source of issue


Recommended Posts

I ended up with the first parity errors of my system last week, my build being about four months old with no issues. I re-ran the party check (with auto correction on, not knowing that I should leave it un-correcting). The errors persisted, so I did a clean reboot of the server, having read that an improper shutdown might have contributed..

 

I do have my system on a UPS, and though we have had power outages recently, the system did not have a power cut.

 

All that said, the manual party check I ran after boot (also still accidentally had auto-correcting on, had not found that information in my research at that time) also showed a couple of party errors. I am running another parity check, this time properly with no corrections, but my question is this:

 

How do I trace down the source of the issues? All my drives appear to be ok and I see zero read/write errors. If there is a drive failing, I would obviously like to track that down asap, but I need a helpful hand to point me to where I would find the clues.

 

My most recent diagnostic is attached.

 

Any assistance would be most appreciated.

denki-3k-diagnostics-20210916-1205.zip

Link to comment
Sep 14 09:06:42 DENKI-3K kernel: mdcmd (36): check 
Sep 14 09:06:42 DENKI-3K kernel: md: recovery thread: check P ...
Sep 14 21:31:02 DENKI-3K kernel: md: recovery thread: P corrected, sector=10314337816
Sep 14 23:50:55 DENKI-3K kernel: md: recovery thread: P corrected, sector=11576472504
Sep 15 00:07:56 DENKI-3K kernel: md: sync done. time=54074sec
Sep 15 00:07:56 DENKI-3K kernel: md: recovery thread: exit status: 0
Sep 15 08:46:26 DENKI-3K kernel: mdcmd (37): check 
Sep 15 08:46:26 DENKI-3K kernel: md: recovery thread: check P ...
Sep 15 10:57:03 DENKI-3K kernel: md: recovery thread: P corrected, sector=2339925056
Sep 15 23:36:49 DENKI-3K kernel: md: sync done. time=53423sec
Sep 15 23:36:49 DENKI-3K kernel: md: recovery thread: exit status: 0

Consecutive correcting checks, each finding and correcting different sectors. No obvious disk, connection, or controller problems logged.

 

Also

Sep 14 20:43:22 DENKI-3K kernel: traps: rslsync[13445] general protection fault ip:14fc4cc194b4 sp:14fc4937fba0 error:0 in libc-2.27.so[14fc4cb81000+1e7000]
Sep 15 02:02:02 DENKI-3K kernel: Plex Media Scan[4807]: segfault at 10 ip 0000152a2ed7e98e sp 0000152a2bb979d0 error 4 in Plex Media Scanner[152a2eb6d000+265000]

GPF and segfault

 

This all suggests bad RAM, and you don't want to run any OS with bad RAM since everything goes through RAM, your data, the executable code, everything.

 

You should shutdown and do memtest.

Link to comment

Just to update, though I had to make a dedicated UEFI memtest key to get the test to run, it did start finding memory errors on the second test pass. Very disappointed in Kingston as this was new identical matched RAM for this server. Now I need to figure out getting replacement as, based on the report, the failures seem to be scattered through all four DIMMS. >: (

 

pm

 

Link to comment

My unraid system is fairly modest, being the first one I built. It is based on an ASUS X99-A/USB 3.1 motherboard, with an Intel Core™ i7-5930K CPU @ 3.50GHz and right at the moment has some Kingston HyperV DDR4 ram modules in it that will be replaced today. My power supply is a Corsair RM 750x powering my 7 6Tb WD Red NAS drives and two SSD for cache. I do have a nice Cyberpower UPS connected (I have made that mistake before, and not willing to risk server damage on a power cut).

 

I have done no tuning or overclocking on this system at all, only making the necessary bios changes to support what I needed in Unraid.

 

That the memory started showing errors in a memtest with basically default settings, I am fairly confident that there is a lingering RAM issue. However, I also was being very budget focused when I bought the RAM and it is possible that it was never an ideal match for this motherboard. I picking up some Corsair Vengence DDR4 3000MHz modules today that are specifically listed in the mb compatibility list and will install and test and report back.

 

I have been so impressed with Unraid in general, but still learning all the ropes..

 

pm

Link to comment

Hmmm...  so a bit more research, and I may have found a possible reason for my RAM errors. My mb supports quad channel memory, but I bought two Kingston DDR4 dual channel kits thinking that would be fine. As I dig into it further, putting those two dual channel kits in a four channel arrangement could potentially lead to some oddness. I believe the mb is considering the RAM to be quad channel compatible (exactly the same speed and timing), but I am wondering out loud if due to the fact I didn't use an actual quad channel DDR4 kit, then minor differences in the two kits installed are causing issues, but only some of the time.

 

It will require possibly more server down time to test than I am really keen on, but I think my short term solution is since I have 8 ram slots, I will install the two dual channel kits in the top two banks of each channel, giving me the same overall ram, but no quad channel confusion. The likelihood of finding a quad channel kit for my old motherboard is slim to none, so I guess I start planning for a new server...  ;)

 

pm

Link to comment

Final update, for those interested.

 

Running Memtest on my original RAM (Kingston HyperX Fury 2666 MHz, 2 sets of dual channel 16 Gb kits for a total of 4 x 8 Gb DDR4 sticks).

 

Original DIMM configuration as per the ASUS motherboard guide with all four sticks in one quad-channel bank, ran into Memtest errors on the second pass.

 

Current config, 2 x 8 Gb dual pairs in the top two DIMM slots of each channel. In my motherboard this would occupy the closest four DIMM slots to the CPU. In both cases the motherboard had no problem identifying all 4 DIMM modules and that I had 32 Gb installed. In the first config, however, the auto detected frequency was 2133 MHz. In my second config the detected frequency is 2400 MHz. Keeping in mind I have not done any custom setup of RAM or CPU. Memtest on this config has been clear for the last 3 hours of testing (no errors). If it makes through 3 passes I will be satisfied for the time being and let Unraid boot and monitor for changes.

 

I would also like to mention that the Corsair Vengeance DDR4 3000 MHz sets that I purchased yesterday also flagged with errors when I installed them all in one quad channel. Again, in hindsight, maybe not a surprise as I was using two identical dual channel kits, not one complete four channel kit.

 

I assume that baring the fluke of finding an in stock quad channel ram kit that happens to be on the ASUS QVL for this motherboard, I am basically kissing the bottom two banks of each quad channel goodbye. Luckily the memory I have have is sufficient for my current needs and all this will be considered for any future builds.

 

pm

 

 

Edited by thepensivemonk
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.