[7.0.0-beta3] Random pratial lockups of system

November 26, 20241 yr

For a couple of months now, my unRAID server has been suffering from occasional lockups. Sometimes, the server would recover, most of the time not. Docker containers, even new SSH connections and the WebUI would not load most of the time. I recently managed to have an htop SSH connection running while it happened and the dashboard page open. The CPU seemed to be pinned at 100%. Culprits would change though. Even after having killed the processed pegged at 100+% in htop, the system would stay unresponsive. Any SAMBA connections made before the system locked up would still be usable, though, transferring at below 10 MB/s instead of the expected 2.5 Gb/s (to cache).

I've been hard resetting the system. One time, however, I checked in via HDMI and saw the console was still up and blinking. While I couldn't log in, it would show that it was trying to shut down when I pressed the power button correctly. That one time I left it to shut down on its own. It was stuck on `Generating diagnostics` and then later on `Starting diagnostics collection...` for multiple hours until finally having turned off sometime in the night.

Diagnostics and screenshot of that time are attached (the image errors on the screenshot are due to my monitor, not the server).

Upgrading to v7 in mid-October seemed to solve the issue for a few weeks, but it came back soon thereafter. Rebuilding the thumb drive always seemed to buy me a week or two without issues, but considering I've changed USB ports and thumb drives multiple times now, buying brand new ones, it might be another issue still.

I've tried removing unneeded devices, such as a GPU (GTX 1650) and an M.2 riser to see if those could be the culprit, but saw no change in behavior. The 11400 in this system is set in UEFI to not boost and to power saving to reduce energy consumption. Under normal operation I see around 30-50% utilization, only spiking when some Dockers transcode or run a game server. I don't experience IO-wait from my cache or array.

stower24-diagnostics-20241117-1934.zip

Quote

November 26, 20241 yr

Community Expert

There are PCIe errors logged for the NIC, if it's an add-on NIC, try using in a different slot, if not try a different NIC, there are also ata errors for disk3, check/replace cables.

Quote

November 28, 20241 yr

Author

On 11/26/2024 at 12:23 PM, JorgeB said:

There are PCIe errors logged for the NIC, if it's an add-on NIC, try using in a different slot, if not try a different NIC, there are also ata errors for disk3, check/replace cables.

Thanks.

I don't have any PCIe cards connected. It must be the on-board 10 GbE NOC then. Darn; really didn't want to replace the mainboard. Is this a plausible cause for the server getting hung up occasionally, only responding very slowly?

The ATA errors have been coming back every time I've changed ports or cables. I assume it's with the drive, but I can't replace it currently.

Quote

November 28, 20241 yr

Community Expert

1 minute ago, DesertCookie said:

Is this a plausible cause for the server getting hung up occasionally, only responding very slowly?

It's possible, try using an add-on NIC if possible

Quote

December 6, 20241 yr

Author

On 11/28/2024 at 10:28 AM, JorgeB said:

[...] try using an add-on NIC [...]

I added a TP-Link 2.5Gb NIC and reseated all SATA cables, yet continue to experience freezes. I did change some UEFI values that now allow the CPU to clock down to 800MHz on idle.

stower24-diagnostics-20241206-2014.zip

Quote

December 7, 20241 yr

Community Expert

Don't see anything relevant logged, this can also be a hardware issue, recommend updating to rc1, and one thing you can try is to boot the server in safe mode with all docker containers/VMs disabled, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one, including the individual docker containers.

Quote

1

December 13, 20241 yr

Author

To give a little status update (I will investigate further and try to pinpoint the issue):

My issue is not technically solved, but I've made some progress and have had no critical lockups for three days now. My stop-gap solution: Isolate two out of 12 threads. Also, I upgraded to v7.0.0-rc.1, but seeing as this has been an issue even in v6.x I don't see that as a major contributor to the improvements.

This points at some Docker container using up too many resources or potentially causing I/O wait, which, depending on the severity, would lock up the system for minutes or hours before. Now, three days by no means indicate that I am in the clear, but it definitely is a milestone. Especially considering that my monitoring tools (Uptime Kuma) reported fewer lockups than before and all of them shorter than 60 seconds where before they would be 10-20 minutes long (or longer, but if by that point the server got locked up, monitoring would stop too which has not happened again so far).

These are sub-60s timeouts on my Nextcloud server. These usually aren't noticeable, as they quickly resolve themselves without the user encountering a 504 error.

Looking at the time around 7:42 today that had such a lockup, I am not able to see anything out of the ordinary in Grafana, hinting at Nextcloud being the issue; however, this wouldn't explain why other services sometimes are impacted too.

I/O wait is not especially high. Overall, the highest I've seen is 24% over the past three days.

Uptime Kuma, running on a secondary system now, reports numerous timeouts but ultimately only rare 60s-downtimes. Again, my main instance that only warns after downtimes of 300 seconds did not respond a single time the past days, where before it would regularly report timeouts across almost all of my services.

---

A hardware issue definetely isn't out of the question. I'll leave the server running like this for a little longer to potentially find out whether it is some container or other service causing CPU usage or I/O wait. Next step would be swapping out the mainboard. After that, potentially the parity drive, as it has a read and write error rate in the millions. `sdc`, which previously threw a lot of UDMA CRC errors seems to have calmed down now with me having reseated all cables again. SMART tests pass. I'm monitoring it. In theory, it's the youngest drive in my server at 3 years power-on-time.

Quote

December 25, 20241 yr

Author

Had a lockup shortly after making this post. I've removed my parity disk since then and have had no issues for four days (let's see if that holds). For now, I'm keeping most frequently served data on the former parity HDD which I've added as a second cache pool and back that up using the Mover Tuning plugin, as well as Borgbackup for things such as Nextcloud.

Quote

[7.0.0-beta3] Random pratial lockups of system

Featured Replies

Join the conversation

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)