Skip to content
View in the app

A better way to browse. Learn more.

Unraid

A full-screen app on your home screen with push notifications, badges and more.

To install this app on iOS and iPadOS
  1. Tap the Share icon in Safari
  2. Scroll the menu and tap Add to Home Screen.
  3. Tap Add in the top-right corner.
To install this app on Android
  1. Tap the 3-dot menu (⋮) in the top-right corner of the browser.
  2. Tap Add to Home screen or Install app.
  3. Confirm by tapping Install.

[7.0.0-beta3] Random pratial lockups of system

Featured Replies

For a couple of months now, my unRAID server has been suffering from occasional lockups. Sometimes, the server would recover, most of the time not. Docker containers, even new SSH connections and the WebUI would not load most of the time. I recently managed to have an htop SSH connection running while it happened and the dashboard page open. The CPU seemed to be pinned at 100%. Culprits would change though. Even after having killed the processed pegged at 100+% in htop, the system would stay unresponsive. Any SAMBA connections made before the system locked up would still be usable, though, transferring at below 10 MB/s instead of the expected 2.5 Gb/s (to cache).

 

I've been hard resetting the system. One time, however, I checked in via HDMI and saw the console was still up and blinking. While I couldn't log in, it would show that it was trying to shut down when I pressed the power button correctly. That one time I left it to shut down on its own. It was stuck on `Generating diagnostics` and then later on `Starting diagnostics collection...` for multiple hours until finally having turned off sometime in the night. 

Diagnostics and screenshot of that time are attached (the image errors on the screenshot are due to my monitor, not the server).

 

Upgrading to v7 in mid-October seemed to solve the issue for a few weeks, but it came back soon thereafter. Rebuilding the thumb drive always seemed to buy me a week or two without issues, but considering I've changed USB ports and thumb drives multiple times now, buying brand new ones, it might be another issue still.

 

I've tried removing unneeded devices, such as a GPU (GTX 1650) and an M.2 riser to see if those could be the culprit, but saw no change in behavior. The 11400 in this system is set in UEFI to not boost and to power saving to reduce energy consumption. Under normal operation I see around 30-50% utilization, only spiking when some Dockers transcode or run a game server. I don't experience IO-wait from my cache or array.

photo_2024-11-26_11-20-17.jpg

stower24-diagnostics-20241117-1934.zip

  • Community Expert

There are PCIe errors logged for the NIC, if it's an add-on NIC, try using in a different slot, if not try a different NIC, there are also ata errors for disk3, check/replace cables.

  • Author
On 11/26/2024 at 12:23 PM, JorgeB said:

There are PCIe errors logged for the NIC, if it's an add-on NIC, try using in a different slot, if not try a different NIC, there are also ata errors for disk3, check/replace cables.

Thanks.

 

I don't have any PCIe cards connected. It must be the on-board 10 GbE NOC then. Darn; really didn't want to replace the mainboard. Is this a plausible cause for the server getting hung up occasionally, only responding very slowly?

 

The ATA errors have been coming back every time I've changed ports or cables. I assume it's with the drive, but I can't replace it currently.

  • Community Expert
1 minute ago, DesertCookie said:

Is this a plausible cause for the server getting hung up occasionally, only responding very slowly?

It's possible, try using an add-on NIC if possible

  • 2 weeks later...
  • Author
On 11/28/2024 at 10:28 AM, JorgeB said:

[...] try using an add-on NIC [...]

I added a TP-Link 2.5Gb NIC and reseated all SATA cables, yet continue to experience freezes. I did change some UEFI values that now allow the CPU to clock down to 800MHz on idle.

stower24-diagnostics-20241206-2014.zip

  • Community Expert

Don't see anything relevant logged, this can also be a hardware issue, recommend updating to rc1, and one thing you can try is to boot the server in safe mode with all docker containers/VMs disabled, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one, including the individual docker containers. 

  • Author

To give a little status update (I will investigate further and try to pinpoint the issue):

 

My issue is not technically solved, but I've made some progress and have had no critical lockups for three days now. My stop-gap solution: Isolate two out of 12 threads. Also, I upgraded to v7.0.0-rc.1, but seeing as this has been an issue even in v6.x I don't see that as a major contributor to the improvements.

 

This points at some Docker container using up too many resources or potentially causing I/O wait, which, depending on the severity, would lock up the system for minutes or hours before. Now, three days by no means indicate that I am in the clear, but it definitely is a milestone. Especially considering that my monitoring tools (Uptime Kuma) reported fewer lockups than before and all of them shorter than 60 seconds where before they would be 10-20 minutes long (or longer, but if by that point the server got locked up, monitoring would stop too which has not happened again so far).

 

These are sub-60s timeouts on my Nextcloud server. These usually aren't noticeable, as they quickly resolve themselves without the user encountering a 504 error.

telegram.png.835ecc0246e83bbc798084b8f9eb1664.png

 

Looking at the time around 7:42 today that had such a lockup, I am not able to see anything out of the ordinary in Grafana, hinting at Nextcloud being the issue; however, this wouldn't explain why other services sometimes are impacted too.

Grafana.thumb.png.6b00b3ffe132ff9bddbd0ec869b86ce7.png

I/O wait is not especially high. Overall, the highest I've seen is 24% over the past three days.

Grafana2.thumb.png.456acf0a8848affefcb902a784d95708.png

 

Uptime Kuma, running on a secondary system now, reports numerous timeouts but ultimately only rare 60s-downtimes. Again, my main instance that only warns after downtimes of 300 seconds did not respond a single time the past days, where before it would regularly report timeouts across almost all of my services.

uptime-kuma.thumb.png.021d112dff21861bceb5feaec91f265f.png

 

---

 

A hardware issue definetely isn't out of the question. I'll leave the server running like this for a little longer to potentially find out whether it is some container or other service causing CPU usage or I/O wait. Next step would be swapping out the mainboard. After that, potentially the parity drive, as it has a read and write error rate in the millions. `sdc`, which previously threw a lot of UDMA CRC errors seems to have calmed down now with me having reseated all cables again. SMART tests pass. I'm monitoring it. In theory, it's the youngest drive in my server at 3 years power-on-time.

drives.thumb.png.b0e0bd819d4e533bb3a5c5a32f575976.png

 

  • 2 weeks later...
  • Author

Had a lockup shortly after making this post. I've removed my parity disk since then and have had no issues for four days (let's see if that holds). For now, I'm keeping most frequently served data on the former parity HDD which I've added as a second cache pool and back that up using the Mover Tuning plugin, as well as Borgbackup for things such as Nextcloud.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

Account

Navigation

Search

Search

Configure browser push notifications

Chrome (Android)
  1. Tap the lock icon next to the address bar.
  2. Tap Permissions → Notifications.
  3. Adjust your preference.
Chrome (Desktop)
  1. Click the padlock icon in the address bar.
  2. Select Site settings.
  3. Find Notifications and adjust your preference.