instability, a restart about every 4 weeks


Go to solution Solved by JorgeB,

Recommended Posts

I seem to remember it being a little more stable for like months at a time. Recently, I'll notice one or two of my dockers aren't responsive. When I go to restart them, they fail and red square, with general error message. Other green running dockers, when restarted will also fail and not come back up after being restarted. I restart the whole server, that seems to sort stuff out.

Just hoping a guru out there could take a peek at my diagnostics and we can anticipate a failure before it all comes crashing down. thanks!

rubble-diagnostics-20230314-2102.zip

Link to comment

Made the two changes to bios:

Global C-state Control: Disabled

Power Supply Idle Control: Typical Current Idle

 

Did a btrfs scrub on the one cache drive I have (nvme0n1p1), rebooted, errors persist. I'll attach a few screen grabs, and a log. Ah here my latest system log

https://hastebin.com/share/saleyadehi.yaml

 

Ah yeah, the cache drive has some uncorrectable errors. Perhaps I did that "scrub" incorrectly? I checked the Fix It box, errors persisted. I didn't see any documentation/wiki on this Scrub operation.

 

1.png

2.png

3.png

5.png

6.png

7.png

8.png

9.png

Link to comment

Got the ram slowed down, problem persists. Still scratchin' my head. ;)

 

Main > cache > scrub btrfs, with "fix errors" checked.

 

Oh, and I pulled two sticks of RAM and left two sticks in slot 2,4 from the CPU.

 

UUID:             ddbfaa94-f0d2-4f5c-96cc-2289a26b703d
Scrub started:    Thu Mar 30 14:41:51 2023
Status:           finished
Duration:         0:02:48
Total to scrub:   224.02GiB
Rate:             1.28GiB/s
Error summary:    csum=4
  Corrected:      0
  Uncorrectable:  4
  Unverified:     0

 

System log shows lots and lots of this:

Mar 31 01:15:43 rubble kernel: verify_parent_transid: 9 callbacks suppressed
Mar 31 01:15:43 rubble kernel: BTRFS error (device nvme0n1p1): parent transid verify failed on 7715749888 wanted 198424 found 214808
Mar 31 01:15:43 rubble kernel: BTRFS error (device nvme0n1p1): parent transid verify failed on 7715749888 wanted 198424 found 214808
Mar 31 01:15:43 rubble kernel: BTRFS error (device nvme0n1p1): parent transid verify failed on 7715749888 wanted 198424 found 214808
Mar 31 01:15:43 rubble kernel: BTRFS error (device nvme0n1p1): parent transid verify failed on 7715749888 wanted 198424 found 214808
Mar 31 01:15:43 rubble kernel: BTRFS error (device nvme0n1p1): parent transid verify failed on 7715749888 wanted 198424 found 214808
Mar 31 01:15:43 rubble kernel: BTRFS error (device nvme0n1p1): parent transid verify failed on 7715749888 wanted 198424 found 214808
Mar 31 01:15:43 rubble kernel: BTRFS error (device nvme0n1p1): parent transid verify failed on 7715749888 wanted 198424 found 214808
Mar 31 01:15:43 rubble kernel: BTRFS error (device nvme0n1p1): parent transid verify failed on 7715749888 wanted 198424 found 214808
Mar 31 01:15:43 rubble kernel: BTRFS error (device nvme0n1p1): parent transid verify failed on 7715749888 wanted 198424 found 214808
Mar 31 01:15:43 rubble kernel: BTRFS error (device nvme0n1p1): parent transid verify failed on 7715749888 wanted 198424 found 214808

 

rubble-diagnostics-20230331-0115.zip

Edited by rutherford
Link to comment

Those uncorrectable errors need to be fixed manually, after the scrub check the syslog for the list of corrupt files, for example:

 

Mar 30 14:42:06 rubble kernel: BTRFS warning (device nvme0n1p1): checksum error at logical 27189551104 on dev /dev/nvme0n1p1, physical 27189551104, root 5, inode 19836829, offset 823296, length 4096, links 1 (path: appdata/binhex-crafty-4/crafty/import/world/region/r.2.1.mca)

 

These need to be deleted/restored from a backup

  • Like 1
Link to comment

@JorgeB Got the server back up. Removed four files that were throwing errors with the btrfs scrub. Now it reads

UUID:             ddbfaa94-f0d2-4f5c-96cc-2289a26b703d
Scrub started:    Sat Apr  1 17:33:09 2023
Status:           finished
Duration:         0:01:19
Total to scrub:   125.89GiB
Rate:             1.59GiB/s
Error summary:    no errors found

 

In my system log it's still throwing loads of this:

 

Apr  1 17:35:39 rubble kernel: BTRFS error (device nvme0n1p1): parent transid verify failed on 7715749888 wanted 198424 found 214808
Apr  1 17:35:39 rubble kernel: BTRFS error (device nvme0n1p1): parent transid verify failed on 7715749888 wanted 198424 found 214808
Apr  1 17:35:39 rubble kernel: BTRFS error (device nvme0n1p1): parent transid verify failed on 7715749888 wanted 198424 found 214808
Apr  1 17:35:39 rubble kernel: BTRFS error (device nvme0n1p1): parent transid verify failed on 7715749888 wanted 198424 found 214808
Apr  1 17:35:39 rubble kernel: BTRFS error (device nvme0n1p1): parent transid verify failed on 7715749888 wanted 198424 found 214808
Apr  1 17:35:39 rubble kernel: BTRFS error (device nvme0n1p1): parent transid verify failed on 7715749888 wanted 198424 found 214808
Apr  1 17:35:39 rubble kernel: BTRFS error (device nvme0n1p1): parent transid verify failed on 7715749888 wanted 198424 found 214808
Apr  1 17:35:39 rubble kernel: BTRFS error (device nvme0n1p1): parent transid verify failed on 7715749888 wanted 198424 found 214808
Apr  1 17:35:39 rubble kernel: BTRFS error (device nvme0n1p1): parent transid verify failed on 7715749888 wanted 198424 found 214808
Apr  1 17:35:39 rubble kernel: BTRFS error (device nvme0n1p1): parent transid verify failed on 7715749888 wanted 198424 found 214808
Apr  1 17:35:54 rubble kernel: verify_parent_transid: 9 callbacks suppressed

 

I'm ready to start replacing stuff. Just need to know what to replace! Hopefully short of building a new machine ;)

Link to comment

a while back I did install and have been running a pihole Docker with dedicated static IP. Fox Common Problems popped that up today, so I affected this change here. 
 

https://forums.unraid.net/topic/120220-fix-common-problems-more-information/page/2/#comment-1243020

 

rebooted, rescrubbed. Here is diag.

rubble-diagnostics-20230402-0831.zip

Edited by rutherford
Link to comment

After running for a bit, errors continue. The syslog from GUI

 

Apr  2 14:44:22 rubble kernel: verify_parent_transid: 9 callbacks suppressed
Apr  2 14:44:22 rubble kernel: BTRFS error (device nvme0n1p1): parent transid verify failed on 7715749888 wanted 198424 found 214808
Apr  2 14:44:22 rubble kernel: BTRFS error (device nvme0n1p1): parent transid verify failed on 7715749888 wanted 198424 found 214808
Apr  2 14:44:22 rubble kernel: BTRFS error (device nvme0n1p1): parent transid verify failed on 7715749888 wanted 198424 found 214808
Apr  2 14:44:22 rubble kernel: BTRFS error (device nvme0n1p1): parent transid verify failed on 7715749888 wanted 198424 found 214808
Apr  2 14:44:22 rubble kernel: BTRFS error (device nvme0n1p1): parent transid verify failed on 7715749888 wanted 198424 found 214808
Apr  2 14:44:22 rubble kernel: BTRFS error (device nvme0n1p1): parent transid verify failed on 7715749888 wanted 198424 found 214808
Apr  2 14:44:22 rubble kernel: BTRFS error (device nvme0n1p1): parent transid verify failed on 7715749888 wanted 198424 found 214808
Apr  2 14:44:22 rubble kernel: BTRFS error (device nvme0n1p1): parent transid verify failed on 7715749888 wanted 198424 found 214808
Apr  2 14:44:22 rubble kernel: BTRFS error (device nvme0n1p1): parent transid verify failed on 7715749888 wanted 198424 found 214808
Apr  2 14:44:22 rubble kernel: BTRFS error (device nvme0n1p1): parent transid verify failed on 7715749888 wanted 198424 found 214808

 

no errors reported on BTRFS scrub.

 

I've also heard, from a hardware buddy of mine, that the CPU I use has had issues with poor memory usage allocation error stuff.

CPU AMD Ryzen 7 1700 Eight-Core @ 3000 MHz, YD1700BBAEBOX cpu-world details

Mobo ASUSTeK COMPUTER INC. ROG STRIX B350-F GAMING, bios updated March 2023.

 

For troubleshooting I've purchased new CPU and RAM that I'll try out next week, one at a time to see if it fixes issues.

CPU AMD Ryzen 7 5800X amazon

RAM Teamgroup DDR4 2x16GB amazon

 

I hope I (we!) can get this licked.

Edited by rutherford
Link to comment

Got this wrapped up, and finished and we're all good. Seems to be back up and running AOK. The backup, format, and restore was a bit stressful. My Plex library had to get rolled back a couple days - I don't remember setting that backup up <shrug> thank goodness it worked out. Ed's video here walked me through those steps.

This project is done.

I will setup weekly CA Backups V2, and keep an occasional eye on the logs for more errors.

Thank you very much @JorgeB !

 

 

 

  • Like 1
Link to comment

Swapped out some memory, moved a couple of the stick from server into my desktop and ran memtest86 last night. Still going in the morning! Good grief. But apparently enough to tell: FAIL. haha, love the font and schtick of this usb bootable test.

 

I can't tell if this is both sticks or ram, or one of them: which one? Not that I wouldn't want a pair, but for selling them, I should know.

 

https://www.memtest86.com/troubleshooting.htm

looks like I just Chuck em both. Or I’ll sell them as FAIL memory. <shrug>
 

I’m so happy I found a smoking gun. 

 

 

 

IMG_2775.JPG

IMG_2776.JPG

IMG_2777.JPG

MemTest86-Report-20230408-235658.html

Edited by rutherford
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.