Jump to content

Server locking up in random intervals


Go to solution Solved by xcsascii,

Recommended Posts

Good evening, I have been plagued by an ongoing server freeze/crash issue for the last 2 weeks or so. I really can't seem to find out what is causing it. 

 

I was running on 6.12.7 without issue for quite some time, then all of a sudden I would get a crash, the web gui would almost act as if it was stuck logging in, the console would act the same. I have updated to 6.12.8 and now 6.12.9 and the crashes seem to have gotten worse. My most recent crash was 30 minutes exactly after startup.

 

Any help would be greatly appreciated

 

Attached is my diagnostic output as well as syslog from flash. 

zeus-diagnostics-20240401-2229.zip syslog-previous (1)

Link to comment

Unfortunately memtest is only definitive if it finds an error, you can run it for a few more hours, or alternatively, use just the server with one stick of RAM, if you keep getting crashes try the other one, that will basically rule out a RAM problem.

Link to comment

I have been running memtest now for a little over 3 hours without any errors. 

Is there anything else I should be looking at? 

 

A few things I noticed today.

 

I noticed that at around a hour into a boot, I was unable to access the web gui (would just constantly load) 

However I was able to log into the console, trying to shutdown from the console would just hang. No errors. Just wouldn't shutdown. 

 

Kind of a crazy issue. 

 

Link to comment

The server has been running for around 4 hours. However I have disabled docker. 

 

Clearly something within docker is causing the problem. If it is a misconfiguration, or not is unknown at this time. 

 

I am running a parity check right now, I don't think I've had a successful one for around a month. 

 

Anyone having similar issues? I really can't get to the bottom of this. 

Link to comment

Quite a few discoveries

 

1. I ran a full parity check which took a little over 2 days. During that time docker service was disabled.

 

2. I started containers one at a time and found the culprit. It was my Plex docker. At around the 1 hour mark, there would be a shfs process (?) would generate iowaits and the system would just lock up.

 

I am still trying to find out what is causing the issue but at least for the mean time I can migrate my Plex work load to my proxmox cluster. 

Link to comment
  • 2 weeks later...
  • Solution

It has been 10 days and I have had 8 days of uptime. Here is what I have learned and what I am doing now. If someone is having the same issue as I am this may not be the solution you are looking for.

  1. I do not have any hardware errors/failures.
  2. The root issue on my system was the IOWAIT generated by running the Plex docker container. Based on usage this is due to library scanning for intros and credits.
  3. I can run all of my other containers without crashing, even heavier ones (including a firefox container, or delugevpn container)
  4. I still get quite a bit of IOWAIT. Specifically when running large jobs in deluge or any of the *arrs
  5. A co-worker of mine running 6.12.10 has the same processor (AMD Ryzen 5700G) he also noticed that his IOWAITs have gotten much higher.

I noticed that regardless of IOWAIT, I am not seeing any real disk usage on the array. Maybe 10-20mb/s on one or two disks. the CPU on the homepage GUI shows that the CPU is running high (80%+) but htop and glances contradicts that information. Showing a single CPU core running at 100% while the remaining are much much lower.

 

As a temporary work around I moved the Plex workload over to a VM running on unraid, and it worked without issue. I then moved the workload over to Proxmox and copied the existing cache and metadata. I have since been running Plex in Proxmox pointed to unraid via NFS.

 

Based on this I think it is safe to assume that the file manipulation on media files themselves isn't the root issue. It must be the appdata share. However the disk usage is also pretty minimal. My appdata is hosted on a zfs pool with 10k sas drives. HOWEVER the mount is still /mnt/user/appdata. The appdata folder is then set to the cache as the primary folder.

 

Based on this information, I have begun the process of migrating all of the containers off unraid, and onto proxmox. I will probably use unraid as pure storage only from now on. Which is really a shame because the community apps is what pushed me to purchase a license. Oh well maybe I have just hit the magic number of files to hardware to warrant a different system for compute.

 

TLDR; Plex container was causing high IOWAIT which was causing my issues. Moving the workload off unraid solved* my issue.

 

 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...