OS Hanging forcing reboot via IPMI


VelcroBP

Recommended Posts

I've been experiencing intermittent freezes where the UI and all container apps become unresponsive. I am still able to connect via SSH and IPMI, and rebooting via powerdown -r restores functionality for a while. I haven't been able to correlate to any particular action by container or service. However, just prior to the most recent freeze this a.m. I noticed  high RAM usage in the dashboard (98%). TOP only accounted for ~55%, so I don't think it was actually using all of the allocated RAM but thought I'd mention it.

 

I've been running a Kiwi server on another PC since the last hang, and the syslog from today is attached.

MootowerSyslogCatchAll-2022-12-27.txt

Link to comment

So far running stable after a couple days running in safe mode. I've renamed the extension of the .plg files. 

Clarification on the plugin testing: Is it enough to install one at a time during the current safe boot session? Or do I need to reboot into normal mode after restoring each .plg extension?

Link to comment
  • 2 weeks later...

So I disabled all plugins and have been running in normal mode since 12/30. Every couple of days I re-enabled a few plugins, with the final batch being on 1/12. So far, no issues or freezes at all and I thought all must be well, the issue was with a plugin that was corrected by reinstalling. 

 

Today I was testing the Roku Jellyfin app (having recently setting up a container), and upon playing a file and returning to the menu, unRaid locked up again. Just the same as before, with all shares accessible and the console as well via SSH. Just the UI and apps are not responding. I can log into the Main and container WebUIs, but nothing loads.

 

I'm attaching a diagnostic zip from during the hang (generated via console command), as well as the auto generated post-boot one. The freeze occurred at ~12:10. Syslog doesn't seem to report anything relevant, but I can attach it if needed.

mootower-diagnostics- DURING - 20230114-1211.zip mootower-diagnostics- POST REBOOT - 20230114-1217.zip

Link to comment

Just hung up again. This time, a user was initiating a Plex remote stream that was transcoding from 1080 - SD (no idea why she would have her iOS Plex app set to SD). 

 

Attached are the diagnostics from during the freeze (~15:25) and right after reboot and starting the array. Any help anyone can provide in parsing these for a potential cause to start troubleshooting next would be greatly appreciated. I will continue disabling 1 plugin at a time unless other suggestions come in. Though I the fact that the 2 most recent incidents involved Jellyfin playback or Plex transcoding/playback make me inclined to think it's something related to video? I'm at a loss really.

mootower-diagnostics-20230116-1527 -- DURING FREEZE.zip mootower-diagnostics-20230116-1537 -- AFTER REBOOT and ARRAY START.zip

Link to comment

ok thanks for looking. I will keep going with plugins, then I'll try disabling iGPU transcoding. Just grasping at straws really. If I can't find a root cause, I hope to have funds in the next few months to rebuild my server and replace/upgrade everything but the data drives. Just needs to hang in there until then.

Link to comment

I had a though last night, don't know if it's relevant. I have been getting CRC errors on one of my data drives. I've kept an eye on it, and recently it started happening more frequently. I've swapped the cable, and the mobo port and they continued. Yesterday I moved it into a different bay in the Norco 3x5 and if it still grows then I'm assuming bad drive.

 

With that said about the drive, is it possible for an error communicating with a data drive, like during playback or transcode or whatever, could cause the OS/UI to freeze? With several of my reboots from hangs, the system fails to POST due to a SMART error with that drive. I press F1 to retry and it boots normally. Just grasping at straws really. I hope to have funds in the next few months to rebuild my server and replace/upgrade everything but the data drives. Just needs to hang in there until then lol.

Link to comment

New theory: issue occurs when Plex is transcoding AND Nextcloud sync operation is running? That was the system state at the time of hanging today anyway. So far Plex has been running for all the hangs I've been present for. And the only non-Plex was with Jellyfin??

 

Also new with today's freeze: the UI returned many errors of devices not being available for unmounting during the Powerdown, resulting in an unclean shutdown.

mootower-diagnostics-20230120-1458 - HANG TIME.zip mootower-diagnostics-20230120-1501 - POST BOOT.zip

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.