Writes to caches, array, and usb regularly fail checksums


Recommended Posts

When updating to 6.11.5 i got a checksum error, I didn't write down the message but it suggested my USB disk was failing. After 5 retires or so, it worked. This was about a week ago I believe. 

 

When pulling docker containers, some will fail with `docker compose pull failed to register layer: Error processing tar file(exit status 1): archive/tar: invalid tar header`, which a retry then fixes

 

Starting today, Sabnzbd started failing ~95% of downloads. Tested on both an nvme cache and an ssd cache. Also installed and tested with NZBGet, same result. Installed and tested with NZBVortex on my Mac with the same nzb, and it downloaded successfully. I see lot's of `Error importing NzbFile: filename=...` logs in sab. 

 

Attached my diagnostics, any helps or ideas of tests to run would be greatly appreciated!

unraid-diagnostics-20221122-2347.zip

Link to comment

Running memtest now and it’s already found some errors, will post final results when it finishes.

 

Does this mean my RAM likely went bad? All parts are less than 6months old. Already had to RMA my motherboard, which also fried my USB boot drive. Things seemed stable for a while after I swapped those two parts out.

22201C32-C2CA-4291-B210-848DBF2EFE9B.jpeg

Link to comment
2 minutes ago, baboolian said:

Running memtest now and it’s already found some errors, will post final results when it finishes.

 

Does this mean my RAM likely went bad?

No point in continuing the memtest, any errors are fatal. Check CMOS memory timings, make sure there are no XMP / overclock settings, if needed manually set all the timings. Rerun memtest, if errors continue, test single sticks. If single sticks all pass, then possibly motherboard / PSU issues.

 

DO NOT ALLOW any data to be read or written to the drives until memtest is all clear.

Link to comment
24 minutes ago, JonathanM said:

No point in continuing the memtest, any errors are fatal. Check CMOS memory timings, make sure there are no XMP / overclock settings, if needed manually set all the timings. Rerun memtest, if errors continue, test single sticks. If single sticks all pass, then possibly motherboard / PSU issues.

 

DO NOT ALLOW any data to be read or written to the drives until memtest is all clear.


Thanks for the tips! XMP is disabled, how do I check the CMOS memory timings? 
 

I’ll memtest one stick at a time now 

Link to comment

I ran memtest on each stick, only let it go to 50% of the first pass (past where I initially saw errors) as I’m heading out of town for a few days in a couple of hours so short on time. Both sticks passed individually. 
 

What should I check next? Do I need to let memtest complete on both sticks?

Link to comment

Possibly reseating them solved it. You need at least a full pass with zero errors on whatever configuration you plan to run long term.

 

I suggested the single stick test to best prove or disprove that the RAM itself was the culprit, but sometimes motherboards can run single sticks with no issue and still have problems with the power load of multiple sticks.

Link to comment

Retested with both sticks in their original slots, as well as the previously empty two slots. Both failed, the former right away, and the later a bit later in the test with fewer errors.

 

I'm rerunning a single stick test, which is almost done with the first pass and no errors so far.

 

Does this indicate a mobo/psu issue? If so, how can I determine which is the culprit?

 

Is it safe for me to run Unraid with 1 stick of RAM (assuming it passes the full test) in this situation?

Link to comment

I upped the voltage from 1.2 -> 1.25 and got a full memtest pass /w both sticks! 

 

For fun, I ran memtest with XMP on, and it failed immediately (with 1.35v).

 

Does this indicate dying RAM or PSU? 1.2v worked fine for months, and i ran XMP for a few days to test it early on with no problem either. I did switch PSU's about a month ago, though the new one is way overkill 1600w EVGA P2 and was brand new. This was around the same time I installed the RMA'd motherboard. Initially everything worked fine at 1.2v no xmpp for a couple of weeks, until about two weeks ago. 

 

I have another 2 sticks of identical RAM coming Friday I'll test, my hope is to run 4 sticks in the end. 

Link to comment

Looks like the PSU was the culprit!

 

I took your advice and upped the voltage, set it to 1.25 and got a memtest pass! I thought that was strange as it ran at stock 1.2v for weeks no problem, so I did some experiments. I also got a second set of matching ram. For the record, prior to all this I was running stock, no XMP. 

 

I initially only had the 8pin motherboard power connected.

 

EVGA 1600w P2 PSU:

2x32 stock: FAIL

2x32 1.25v: PASS

2x32 XMP: FAIL

 

Corsair HX1000 PSU:

2x32 stock: PASS

2x32 XMP: FAIL

 

Connected the secondary 4pin motherboard power.

 

EVGA 1600w P2 PSU:

2x32 stock: PASS

2x32 XMP: PASS

 

Corsair HX1000 PSU:

2x32 stock: Didn't Test

2x32 XMP: PASS

 

Added in second set of ram, now 4x32gb:

 

EVGA 1600w P2 PSU:

4x32 stock: Didn't test

4x32 XMP: FAIL 

 

Corsair HX1000 PSU:

4x32 stock: Didn't Test

4x32 XMP: PASS

 

I don't trust this EVGA PSU anymore, but I reeeally don't want to have to swap it out again (I managed to squeeze that monster 1600w psu into a Fractal Node 804 with 8 platters right above it and the max amount of case fans), so I gave it another shot because why not. The second time, it passed, with XMP on, after failing miserably the first time.

 

I think I'll run it like this with XMP on for a week and see if it survives, and then hit it with another memtest to be sure. With 64gb it took ~8hrs and 128gb ~19hrs to run. I need a break.

 

Thanks a ton for your help! Without your advice I likely would have lost a month unnecessarily RMA'ing my motherboard.

Link to comment

Well that didn't last long. After about 10 minutes downloads began failing again. Ran another memtest and got dozens of failures within 30 seconds. Switched from XMP to stock, same result. 

 

Trying the Corsair HX1000 PSU again. If this works, the brand new evga 1600w p2 is a couple weeks past the return window, so I'll prob take it out back and go office space on it. 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.