rutherford Posted March 15 Share Posted March 15 I seem to remember it being a little more stable for like months at a time. Recently, I'll notice one or two of my dockers aren't responsive. When I go to restart them, they fail and red square, with general error message. Other green running dockers, when restarted will also fail and not come back up after being restarted. I restart the whole server, that seems to sort stuff out. Just hoping a guru out there could take a peek at my diagnostics and we can anticipate a failure before it all comes crashing down. thanks! rubble-diagnostics-20230314-2102.zip Quote Link to comment
JorgeB Posted March 15 Share Posted March 15 Enable the syslog server and post that after a crash, there are issues with the cache filesystem, possibly from the crashing, you should backup and restore. 1 Quote Link to comment
rutherford Posted March 22 Author Share Posted March 22 Looks like I'm getting some BTRFS errors on my nvme0n1p1 https://hastebin.com/share/wifurixani.yaml I saw on another post that bad memory was sometimes a culprit there. I'll do a memtest86 USB boot and see if that turns anything up. Getting an appdata backup done here... SMART test from the ssd https://hastebin.com/share/caleferado.yaml Quote Link to comment
JorgeB Posted March 22 Share Posted March 22 2 hours ago, rutherford said: I saw on another post that bad memory was sometimes a culprit there Not for that error, please post the diagnostics. Quote Link to comment
rutherford Posted March 22 Author Share Posted March 22 rubble-diagnostics-20230322-0708.zip Quote Link to comment
JorgeB Posted March 22 Share Posted March 22 Before those errors there are corruption errors found, so yes, it can be RAM related, and Ryzen with overclocked RAM like you have is known to cause data corruption, see here, then run a correcting scrub in the pool. Quote Link to comment
rutherford Posted March 22 Author Share Posted March 22 Made the two changes to bios: Global C-state Control: Disabled Power Supply Idle Control: Typical Current Idle Did a btrfs scrub on the one cache drive I have (nvme0n1p1), rebooted, errors persist. I'll attach a few screen grabs, and a log. Ah here my latest system log https://hastebin.com/share/saleyadehi.yaml Ah yeah, the cache drive has some uncorrectable errors. Perhaps I did that "scrub" incorrectly? I checked the Fix It box, errors persisted. I didn't see any documentation/wiki on this Scrub operation. Quote Link to comment
JorgeB Posted March 23 Share Posted March 23 12 hours ago, rutherford said: Made the two changes to bios: Did you correct the RAM speed? Quote Link to comment
rutherford Posted March 23 Author Share Posted March 23 <face palm> I didn't! I'll do it, do a restart and see we go, and post here. Thanks so much @JorgeB Quote Link to comment
rutherford Posted March 31 Author Share Posted March 31 (edited) Got the ram slowed down, problem persists. Still scratchin' my head. Main > cache > scrub btrfs, with "fix errors" checked. Oh, and I pulled two sticks of RAM and left two sticks in slot 2,4 from the CPU. UUID: ddbfaa94-f0d2-4f5c-96cc-2289a26b703d Scrub started: Thu Mar 30 14:41:51 2023 Status: finished Duration: 0:02:48 Total to scrub: 224.02GiB Rate: 1.28GiB/s Error summary: csum=4 Corrected: 0 Uncorrectable: 4 Unverified: 0 System log shows lots and lots of this: Mar 31 01:15:43 rubble kernel: verify_parent_transid: 9 callbacks suppressed Mar 31 01:15:43 rubble kernel: BTRFS error (device nvme0n1p1): parent transid verify failed on 7715749888 wanted 198424 found 214808 Mar 31 01:15:43 rubble kernel: BTRFS error (device nvme0n1p1): parent transid verify failed on 7715749888 wanted 198424 found 214808 Mar 31 01:15:43 rubble kernel: BTRFS error (device nvme0n1p1): parent transid verify failed on 7715749888 wanted 198424 found 214808 Mar 31 01:15:43 rubble kernel: BTRFS error (device nvme0n1p1): parent transid verify failed on 7715749888 wanted 198424 found 214808 Mar 31 01:15:43 rubble kernel: BTRFS error (device nvme0n1p1): parent transid verify failed on 7715749888 wanted 198424 found 214808 Mar 31 01:15:43 rubble kernel: BTRFS error (device nvme0n1p1): parent transid verify failed on 7715749888 wanted 198424 found 214808 Mar 31 01:15:43 rubble kernel: BTRFS error (device nvme0n1p1): parent transid verify failed on 7715749888 wanted 198424 found 214808 Mar 31 01:15:43 rubble kernel: BTRFS error (device nvme0n1p1): parent transid verify failed on 7715749888 wanted 198424 found 214808 Mar 31 01:15:43 rubble kernel: BTRFS error (device nvme0n1p1): parent transid verify failed on 7715749888 wanted 198424 found 214808 Mar 31 01:15:43 rubble kernel: BTRFS error (device nvme0n1p1): parent transid verify failed on 7715749888 wanted 198424 found 214808 rubble-diagnostics-20230331-0115.zip Edited March 31 by rutherford Quote Link to comment
JorgeB Posted March 31 Share Posted March 31 Those uncorrectable errors need to be fixed manually, after the scrub check the syslog for the list of corrupt files, for example: Mar 30 14:42:06 rubble kernel: BTRFS warning (device nvme0n1p1): checksum error at logical 27189551104 on dev /dev/nvme0n1p1, physical 27189551104, root 5, inode 19836829, offset 823296, length 4096, links 1 (path: appdata/binhex-crafty-4/crafty/import/world/region/r.2.1.mca) These need to be deleted/restored from a backup 1 Quote Link to comment
rutherford Posted April 1 Author Share Posted April 1 I'll do that when it comes back up. I had a hard freeze and crash today. Quote Link to comment
rutherford Posted April 2 Author Share Posted April 2 @JorgeB Got the server back up. Removed four files that were throwing errors with the btrfs scrub. Now it reads UUID: ddbfaa94-f0d2-4f5c-96cc-2289a26b703d Scrub started: Sat Apr 1 17:33:09 2023 Status: finished Duration: 0:01:19 Total to scrub: 125.89GiB Rate: 1.59GiB/s Error summary: no errors found In my system log it's still throwing loads of this: Apr 1 17:35:39 rubble kernel: BTRFS error (device nvme0n1p1): parent transid verify failed on 7715749888 wanted 198424 found 214808 Apr 1 17:35:39 rubble kernel: BTRFS error (device nvme0n1p1): parent transid verify failed on 7715749888 wanted 198424 found 214808 Apr 1 17:35:39 rubble kernel: BTRFS error (device nvme0n1p1): parent transid verify failed on 7715749888 wanted 198424 found 214808 Apr 1 17:35:39 rubble kernel: BTRFS error (device nvme0n1p1): parent transid verify failed on 7715749888 wanted 198424 found 214808 Apr 1 17:35:39 rubble kernel: BTRFS error (device nvme0n1p1): parent transid verify failed on 7715749888 wanted 198424 found 214808 Apr 1 17:35:39 rubble kernel: BTRFS error (device nvme0n1p1): parent transid verify failed on 7715749888 wanted 198424 found 214808 Apr 1 17:35:39 rubble kernel: BTRFS error (device nvme0n1p1): parent transid verify failed on 7715749888 wanted 198424 found 214808 Apr 1 17:35:39 rubble kernel: BTRFS error (device nvme0n1p1): parent transid verify failed on 7715749888 wanted 198424 found 214808 Apr 1 17:35:39 rubble kernel: BTRFS error (device nvme0n1p1): parent transid verify failed on 7715749888 wanted 198424 found 214808 Apr 1 17:35:39 rubble kernel: BTRFS error (device nvme0n1p1): parent transid verify failed on 7715749888 wanted 198424 found 214808 Apr 1 17:35:54 rubble kernel: verify_parent_transid: 9 callbacks suppressed I'm ready to start replacing stuff. Just need to know what to replace! Hopefully short of building a new machine Quote Link to comment
JorgeB Posted April 2 Share Posted April 2 Reboot, run a new scrub and post new diags after that. Quote Link to comment
rutherford Posted April 2 Author Share Posted April 2 (edited) a while back I did install and have been running a pihole Docker with dedicated static IP. Fox Common Problems popped that up today, so I affected this change here. https://forums.unraid.net/topic/120220-fix-common-problems-more-information/page/2/#comment-1243020 rebooted, rescrubbed. Here is diag. rubble-diagnostics-20230402-0831.zip Edited April 2 by rutherford Quote Link to comment
rutherford Posted April 2 Author Share Posted April 2 (edited) After running for a bit, errors continue. The syslog from GUI Apr 2 14:44:22 rubble kernel: verify_parent_transid: 9 callbacks suppressed Apr 2 14:44:22 rubble kernel: BTRFS error (device nvme0n1p1): parent transid verify failed on 7715749888 wanted 198424 found 214808 Apr 2 14:44:22 rubble kernel: BTRFS error (device nvme0n1p1): parent transid verify failed on 7715749888 wanted 198424 found 214808 Apr 2 14:44:22 rubble kernel: BTRFS error (device nvme0n1p1): parent transid verify failed on 7715749888 wanted 198424 found 214808 Apr 2 14:44:22 rubble kernel: BTRFS error (device nvme0n1p1): parent transid verify failed on 7715749888 wanted 198424 found 214808 Apr 2 14:44:22 rubble kernel: BTRFS error (device nvme0n1p1): parent transid verify failed on 7715749888 wanted 198424 found 214808 Apr 2 14:44:22 rubble kernel: BTRFS error (device nvme0n1p1): parent transid verify failed on 7715749888 wanted 198424 found 214808 Apr 2 14:44:22 rubble kernel: BTRFS error (device nvme0n1p1): parent transid verify failed on 7715749888 wanted 198424 found 214808 Apr 2 14:44:22 rubble kernel: BTRFS error (device nvme0n1p1): parent transid verify failed on 7715749888 wanted 198424 found 214808 Apr 2 14:44:22 rubble kernel: BTRFS error (device nvme0n1p1): parent transid verify failed on 7715749888 wanted 198424 found 214808 Apr 2 14:44:22 rubble kernel: BTRFS error (device nvme0n1p1): parent transid verify failed on 7715749888 wanted 198424 found 214808 no errors reported on BTRFS scrub. I've also heard, from a hardware buddy of mine, that the CPU I use has had issues with poor memory usage allocation error stuff. CPU AMD Ryzen 7 1700 Eight-Core @ 3000 MHz, YD1700BBAEBOX cpu-world details Mobo ASUSTeK COMPUTER INC. ROG STRIX B350-F GAMING, bios updated March 2023. For troubleshooting I've purchased new CPU and RAM that I'll try out next week, one at a time to see if it fixes issues. CPU AMD Ryzen 7 5800X amazon RAM Teamgroup DDR4 2x16GB amazon I hope I (we!) can get this licked. Edited April 2 by rutherford Quote Link to comment
Solution JorgeB Posted April 3 Solution Share Posted April 3 Since the scrub cannot fix the errors you should backup and re-format the pool. 1 Quote Link to comment
rutherford Posted April 9 Author Share Posted April 9 Got this wrapped up, and finished and we're all good. Seems to be back up and running AOK. The backup, format, and restore was a bit stressful. My Plex library had to get rolled back a couple days - I don't remember setting that backup up <shrug> thank goodness it worked out. Ed's video here walked me through those steps. This project is done. I will setup weekly CA Backups V2, and keep an occasional eye on the logs for more errors. Thank you very much @JorgeB ! 1 Quote Link to comment
rutherford Posted April 9 Author Share Posted April 9 (edited) Swapped out some memory, moved a couple of the stick from server into my desktop and ran memtest86 last night. Still going in the morning! Good grief. But apparently enough to tell: FAIL. haha, love the font and schtick of this usb bootable test. I can't tell if this is both sticks or ram, or one of them: which one? Not that I wouldn't want a pair, but for selling them, I should know. https://www.memtest86.com/troubleshooting.htm looks like I just Chuck em both. Or I’ll sell them as FAIL memory. <shrug> I’m so happy I found a smoking gun. MemTest86-Report-20230408-235658.html Edited April 9 by rutherford Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.