Pri Posted October 21, 2023 Share Posted October 21, 2023 (edited) The problem: For about a month now since I upgraded to the 6.12.x branch (.3 and then .4 only, I did not use .0 .1 or .2) I've had my server randomly start half-crashing every 1 to 16 days. Sometimes two times a day, sometimes only once a week, it seems random. I did not have any issues like this on 6.11, it was stable as a rock. What I mean by half-crashing: And by half-crashing what I mean is: my disk shares stop working, one of my Windows Server 2022 virtual machines crashes (the others do not), the WebUI becomes very slow to respond but does function, all my disks (NVMe SSD's and SATA Hard Drives) enter a sleep state and won't wake up etc - Only a reboot fixes it. So far what I've tried: 1. Brand new USB stick to boot from. Some people previously said based on my log it looked like my USB stick was failing. 2. Disabling C-States in the BIOS 3. Uninstalling the Unassigned Devices plugin (some thought it could be related to this as others on reddit are having the same crashes as I am as per this thread: https://www.reddit.com/r/unRAID/comments/16yz0gd/possible_fix_for_people_crashing_on_612/ but it appears to be unrelated). 4. Ran memtest86, it passed after 2 hours, I let it run for a total of 7 hours and 20 minutes without errors. Specs of my server / other hardware related details: 3rd Gen EPYC Milan with 7 NVMe devices, 11 Hard Disk Drives. The HBA I'm using is a 9500-8i. I do not have a dedicated GPU installed, I'm not doing any kind of fancy graphics stuff with my VM's either. All the firmware and BIOS are up to date on all my stuff (Motherboard, HBA, Network cards, NVMe drives etc). Diagnostics taken today during the half-crashing it did hyepyc-diagnostics-20231021-1828.zip I did make a thread about this earlier, I marked it as resolved because everyone on discord told me: get a new USB key your current one seems broken. So that is what I did, however that hasn't resolved the issue so I'm making the topic again with a fresh diagnostics. The only real difference between then and now is I do have the new USB key in, I did uninstalled the unassigned devices plugin before this most recent crash and so forth. Any help is greatly appreciated. Also you may see in my log a lot of SSH logins, this is from external software I run to automate Docker since unRAID doesn't have an official API. I have since disabled this in-case there is some kind of SSH-related out of memory bug in play etc Edited October 22, 2023 by Pri Quote Link to comment
Pri Posted October 21, 2023 Author Share Posted October 21, 2023 (edited) Someone suggested I run memtest86, so I have done that. After two hours it passed, no errors of any kind. I am allowing it to continue to run for the next 10 hours or so just to be sure though. I will add this to an edit in my above post aswell. EDIT: It kept passing for 7 hours and 20 minutes before I turned it off as I needed the server up to do some work etc Edited October 22, 2023 by Pri Quote Link to comment
JorgeB Posted October 22, 2023 Share Posted October 22, 2023 There are some strange errors and still some apparent flash drive issues: oct 21 17:08:40 HYEPYC emhttpd: Unregistered - flash device error (ENOFLASH7) But I'm not seeing any USB errors, try booting in safe mode. Quote Link to comment
Pri Posted October 22, 2023 Author Share Posted October 22, 2023 (edited) 5 minutes ago, JorgeB said: There are some strange errors and still some apparent flash drive issues: oct 21 17:08:40 HYEPYC emhttpd: Unregistered - flash device error (ENOFLASH7) But I'm not seeing any USB errors, try booting in safe mode. Yeah it looks like that, but I've completely changed the USB port, the USB stick etc - It seems to be a symptom of the problem in that it has some memory allocation issue and then that has this cascade of issues where it shows that USB error as part of it etc I checked my old USB key, did all kinds of diagnostics on it, it seems to be perfectly fine. I'm at a loss to explain what is going on really etc Regarding safe mode, is there any downsides to using that and is there anything you want me to provide once I'm running in safe mode? Edited October 22, 2023 by Pri Quote Link to comment
JorgeB Posted October 22, 2023 Share Posted October 22, 2023 1 hour ago, Pri said: is there any downsides to using that No plugins while in safe mode, other than that everything else should work normally. Quote Link to comment
Solution Pri Posted June 23 Author Solution Share Posted June 23 I was just made aware that I didn't follow up on this thread with a solution. So I narrowed down the issue to Docker. Specifically, when dockers are restarted (instead of shutdown and started back up) there appears to be a memory leak related to system resource allocation. There's some set memory pool that gets consistently used up when a docker restarts until there's none left. At that point, unRAID goes haywire. Storage devices can't be accessed (including the USB unRAID device), WebUI may fail to load, SSH won't connect reliably etc Stopping the docker that you've restarted multiple times releases these resources back to the system and everything instantly begins working again. Since learning of this, I changed how I interact with my dockers so they aren't restarted but are completely stopped then started and I've experienced no more problems since then. Quote Link to comment
JorgeB Posted June 24 Share Posted June 24 17 hours ago, Pri said: when dockers are restarted (instead of shutdown and started back up) Thanks for the updated, did you find if it was a specific container doing that? Quote Link to comment
Pri Posted June 24 Author Share Posted June 24 I've had it happen with a few containers that I made myself (which I was rebooting as part of my normal use of them) and also with the Mysterium container (the official one from the app's repo). Since I rarely if ever need to reboot my normal dockers (Plex, Grafana, qBittorrent etc) I've not witnessed the issue with those. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.