Jump to content

System Lock Up and High IO Wait


Recommended Posts

Hello,

I've been experiencing some (seemingly) random system lockups and instability. I am sure I have something configured poorly but can't seem to find it.

 

Symptoms: Can't access GUI or running containers. Sometimes not even SSH into the machine. 

 

What's new: I've recently swapped cases for my machine but have had these symptoms sporadically before.

 

What I'm doing about it: I'm not entirely sure what the issue is. I have the Syslog open on my windows machine via TFTPD64 and I don't really see any errors. (Exception: when my webgui was frozen on a blank screen I tried to change the trailing "/main" to "/docker" to see if I could identify a problem container. It had an error for that). I have previously run Memtest (in the last month or so) and did not have any errors after it finished. 

 

I'm hopeful I've got a dumb setting somewhere but I can't seem to identify it.

 

The below image shows the High IO and the subsequent lockup / unavailability.

image.thumb.png.8eba2ae3fd9adfdbd2d7b23fc295243f.png

 

Version 6.12.11 

 

Diagnostics attached.

 

Thanks for your time! 

 

theblender-diagnostics-20240810-2012.zip

Edited by ChuckBuilds
added Unraid Version
Link to comment

Still working on this. It is seemingly random. I had a log message of a disk spinning down right before it went unresponsive so I disabled disk spindown. Not even 24 hours later it is unresponsive and the last syslog message was from 2 hours ago when I successfully logged into the WEBGUI.

 

I've run an extended memtest with no errors. CPU and system temps are cool. 

 

I suppose my next step is safe mode but I am not looking forward to 24+ hours without my homeassistant running.

 

Any direction is appreciated.

Link to comment
11 minutes ago, ChuckBuilds said:

suppose my next step is safe mode but I am not looking forward to 24+ hours without my homeassistant running.

Since there's nothing relevant logged that I can see, that's probably your best bet, boot the server in safe mode with all docker containers/VMs disabled, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one.

Link to comment
  • 2 weeks later...

I think I narrowed it down to my BIOS settings trying to Overclock the CPU automatically and maybe crashing during peak usage. It wasn't abundantly clear how to get it off but eventually got it disabled and it seems more solid. Just hit 10 days without an issue, which is more than I could say before.  I hope it continues.  I am a bit surprised that is the issue as I don't recall ever turning that feature on but maybe when I was swapping the case  and doing fan tuning I accidentally tuned it on. That doesn't really explain it as I swear I had issues before the case swap - but either way it seems good now (for now!).

 

Thanks for checking JorgeB, it helped convince me it was a hardware issue and not a bad Unraid setting.

 

  • Like 1
Link to comment

Unfortunately it struck again today. 15 days later. Which is a lot better up-time but I was right next to the machine this time. I was able to use the Local command line on the machine and got an error when logging in that it was timing out after 60 seconds. This led me to google and a result elsewhere on these forums that it was related to their Intel C-States. I just went to BIOS and disabled Intel C-States. 

 

Hope to be marking this the solution in the near future.

Link to comment
  • 2 weeks later...

13 days later got a crash. I had just re-enabled disk spin-down after 12 days so I can't help but feel that it is related. 

 

At the beginning of 13 days I reseated CPU, RAM, updated BIOS, and loaded defaults on BIOS. Disabled C States, disabled XMP, disabled "Game Mode" CPU Overclock. 

 

Timer is reset, but I am running out of ideas on what is messing up! I could access the command line locally on the machine with a connected monitor and keyboard. I tried to login as root and got the Error that login timed out after 60 seconds. 

 

Just documenting at this point, if someone stumbles across this I hope I get a definitive resolution at some point.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...