Server locking up in random intervals

xcsascii · April 2

Good evening, I have been plagued by an ongoing server freeze/crash issue for the last 2 weeks or so. I really can't seem to find out what is causing it.

I was running on 6.12.7 without issue for quite some time, then all of a sudden I would get a crash, the web gui would almost act as if it was stuck logging in, the console would act the same. I have updated to 6.12.8 and now 6.12.9 and the crashes seem to have gotten worse. My most recent crash was 30 minutes exactly after startup.

Any help would be greatly appreciated

Attached is my diagnostic output as well as syslog from flash.

zeus-diagnostics-20240401-2229.zip syslog-previous (1)

JorgeB · April 2

There are xfs and zfs related call traces, I would start by running memtest.

xcsascii · April 2

I have run memtest for around 30 mins, and it has passed.

Should I keep it running? or is there anything else I should be checking?

JorgeB · April 2

Unfortunately memtest is only definitive if it finds an error, you can run it for a few more hours, or alternatively, use just the server with one stick of RAM, if you keep getting crashes try the other one, that will basically rule out a RAM problem.

xcsascii · April 2

I have been running memtest now for a little over 3 hours without any errors.

Is there anything else I should be looking at?

A few things I noticed today.

I noticed that at around a hour into a boot, I was unable to access the web gui (would just constantly load)

However I was able to log into the console, trying to shutdown from the console would just hang. No errors. Just wouldn't shutdown.

Kind of a crazy issue.

xcsascii · April 3

I noticed that if I disable my network bonding that the server will stay up for quite a bit longer. I am talking 1-2 hours.

I don't understand why removing the network bond would decrease the amount of crashes. Here are updated logs.

syslog-previous (2) zeus-diagnostics-20240402-1923.zip

xcsascii · April 3

The server has been running for around 4 hours. However I have disabled docker.

Clearly something within docker is causing the problem. If it is a misconfiguration, or not is unknown at this time.

I am running a parity check right now, I don't think I've had a successful one for around a month.

Anyone having similar issues? I really can't get to the bottom of this.

JorgeB · April 3

If you think it's docker related, enable just the service and leave all containers disabled, re-test and if still stable, start enabling the containers one by one until you find the culprit.

xcsascii · April 8

Quite a few discoveries

1. I ran a full parity check which took a little over 2 days. During that time docker service was disabled.

2. I started containers one at a time and found the culprit. It was my Plex docker. At around the 1 hour mark, there would be a shfs process (?) would generate iowaits and the system would just lock up.

I am still trying to find out what is causing the issue but at least for the mean time I can migrate my Plex work load to my proxmox cluster.

xcsascii · April 17

It has been 10 days and I have had 8 days of uptime. Here is what I have learned and what I am doing now. If someone is having the same issue as I am this may not be the solution you are looking for.

I do not have any hardware errors/failures.
The root issue on my system was the IOWAIT generated by running the Plex docker container. Based on usage this is due to library scanning for intros and credits.
I can run all of my other containers without crashing, even heavier ones (including a firefox container, or delugevpn container)
I still get quite a bit of IOWAIT. Specifically when running large jobs in deluge or any of the *arrs
A co-worker of mine running 6.12.10 has the same processor (AMD Ryzen 5700G) he also noticed that his IOWAITs have gotten much higher.

I noticed that regardless of IOWAIT, I am not seeing any real disk usage on the array. Maybe 10-20mb/s on one or two disks. the CPU on the homepage GUI shows that the CPU is running high (80%+) but htop and glances contradicts that information. Showing a single CPU core running at 100% while the remaining are much much lower.

As a temporary work around I moved the Plex workload over to a VM running on unraid, and it worked without issue. I then moved the workload over to Proxmox and copied the existing cache and metadata. I have since been running Plex in Proxmox pointed to unraid via NFS.

Based on this I think it is safe to assume that the file manipulation on media files themselves isn't the root issue. It must be the appdata share. However the disk usage is also pretty minimal. My appdata is hosted on a zfs pool with 10k sas drives. HOWEVER the mount is still /mnt/user/appdata. The appdata folder is then set to the cache as the primary folder.

Based on this information, I have begun the process of migrating all of the containers off unraid, and onto proxmox. I will probably use unraid as pure storage only from now on. Which is really a shame because the community apps is what pushed me to purchase a license. Oh well maybe I have just hit the magic number of files to hardware to warrant a different system for compute.

TLDR; Plex container was causing high IOWAIT which was causing my issues. Moving the workload off unraid solved* my issue.

Server locking up in random intervals

Recommended Posts

xcsascii

Link to comment

JorgeB

Link to comment

xcsascii

Link to comment

JorgeB

Link to comment

xcsascii

Link to comment

xcsascii

Link to comment

xcsascii

Link to comment

JorgeB

Link to comment

xcsascii

Link to comment

xcsascii

Link to comment

Join the conversation