shadowbert Posted October 12, 2023 Share Posted October 12, 2023 I don't even know where to start with this one, but it's a real pain. The server will simply stop responding to any traffic and, because it's headless (no gpu, no display ports on motherboard), I have no choice but to cold restart. You may note that drive 5 is missing. This drive had quite a lot of smart errors, so I took it out to see if that was somehow causing issues. Ideally I would think that drives misbehaving may result in it getting disabled (and certainly shouldn't take out the server), but it doesn't seem to have helped. lime-diagnostics-20231013-0844.zip Quote Link to comment
JorgeB Posted October 13, 2023 Share Posted October 13, 2023 Enable the syslog server and post that after a crash. Quote Link to comment
shadowbert Posted October 13, 2023 Author Share Posted October 13, 2023 Fair call. Is this all I need to do for that? Nothing came up in the share yet... Quote Link to comment
JorgeB Posted October 13, 2023 Share Posted October 13, 2023 You also need to set the remote IP, user the server IP, or enable the mirror to flash drive option instead. 1 Quote Link to comment
shadowbert Posted October 13, 2023 Author Share Posted October 13, 2023 Never mind, I worked it out. I had to set the server to use itself as the remote server. Quote Link to comment
shadowbert Posted October 21, 2023 Author Share Posted October 21, 2023 Alright, so I haven't had it completely lock up since posting, but I have had docker grind to a halt at least 3 times. Looks like I have BTRFS errors for days in the logs. So I guess there's something wrong with my cache. One of the drives do have a CRC error count value of 133... so I'm guessing I should try ripping that out and seeing if it helps. Quote Link to comment
JorgeB Posted October 21, 2023 Share Posted October 21, 2023 You can post new diags. Quote Link to comment
shadowbert Posted October 21, 2023 Author Share Posted October 21, 2023 Sure, here you go. lime-diagnostics-20231021-1923.zip Quote Link to comment
JorgeB Posted October 21, 2023 Share Posted October 21, 2023 Btrfs is detecting data corruption, you should run memtest. Quote Link to comment
shadowbert Posted October 21, 2023 Author Share Posted October 21, 2023 I highly suspect it's more likely to be due to the sudden shutdowns than ram. Running memtest is going to be tricky given that I don't have any display outputs on that machine... What should I do to clean up the corruption, assuming RAM isn't the issue? Quote Link to comment
JorgeB Posted October 21, 2023 Share Posted October 21, 2023 7 minutes ago, shadowbert said: I highly suspect it's more likely to be due to the sudden shutdowns than ram That won't cause data corruption. Quote Link to comment
shadowbert Posted October 21, 2023 Author Share Posted October 21, 2023 Really? I had assumed that a power outage that happened to occur when something is written to disk A but not disk B would cause that sort of issue... or is btrfs somehow smarter than that? Quote Link to comment
JorgeB Posted October 21, 2023 Share Posted October 21, 2023 Writes with btrfs are atomic, either they are written correctly or they are not written at all. Quote Link to comment
shadowbert Posted October 21, 2023 Author Share Posted October 21, 2023 (edited) That makes sense... though it is certainly concerning. Would the (now removed) SSD that was throwing those CRC be a possible cause? Edited October 24, 2023 by shadowbert typo Quote Link to comment
Solution JorgeB Posted October 21, 2023 Solution Share Posted October 21, 2023 2 minutes ago, shadowbert said: Would the (now removed) SSD that was throwing those CRC be a possibly cause? Extremely unlikely, RAM is by far the #1 suspect, controller/device would be a remote #2. Quote Link to comment
shadowbert Posted October 21, 2023 Author Share Posted October 21, 2023 Alright. I'll need to find a way to get a screen connected then. Quote Link to comment
shadowbert Posted October 24, 2023 Author Share Posted October 24, 2023 I got a screen connected. For some reason, memtest (from the unraid install) refuses to run. It just immediately restarts the computer. Ominous, but not conclusive. Going back to running unraid, with the monitor connected I can see that this shutdown is caused by a kernel panic. That certainly explains why the server simply "disappeared". Unfortunately (though unsurprisingly) the server was not kind enough to write the full details of the panic to the syslog, but it certainly helps confirm the ram theory. I've taken half of it out, to see if I can bisect which one is giving me grief. Fingers crossed. Thanks for pointing me in the right direction. Quote Link to comment
itimpi Posted October 24, 2023 Share Posted October 24, 2023 2 hours ago, shadowbert said: I got a screen connected. For some reason, memtest (from the unraid install) refuses to run. It just immediately restarts the computer. Ominous, but not conclusive. The version of memtest provided with Unraid will only run if booting in legacy mode. If you boot in UEFI mode then you should download the latest version from memtest86.com which will also boot in UEFI mode. 1 Quote Link to comment
shadowbert Posted October 24, 2023 Author Share Posted October 24, 2023 Huh. Weird design choice, but that certainly explains it. At any rate, things are looking better with this half of ram. And it's entirely possible my old set might actually still be under warranty, so that might be handy... Quote Link to comment
itimpi Posted October 24, 2023 Share Posted October 24, 2023 15 minutes ago, shadowbert said: Huh. Weird design choice, but that certainly explains it. It is a Licencing issue I believe. Quote Link to comment
shadowbert Posted October 24, 2023 Author Share Posted October 24, 2023 That'd be right... kinda would have been nice if they hinted at it in the title. If it goes down again I'll certainly check it out though. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.