[SOLVED] Server becomes unstable, cannot find why


Recommended Posts

Starting last week my Unraid main server has become gradually unresponsive to the point where I must hard reset the server. I have attempted a few different things with varying success to find out that after a few days the issue returns. No recent hardware or software changes, besides docker updates, more on that below.

 

Running 6.8.3 Nvidia

ASRock - X570M Pro4

AMD Ryzen 5 3600 6-Core Processor

32GB of RAM

 

  1. Initially the issue occurred while I was doing some large removes in rutorrent docker.
    1. The web client for that docker became unavailable.
    2. I used WebGUI to attempt to cycle the docker 
    3. Just sits at never ending cycle of stoping/starting
    4. Then attempted to locate container in HTOP, but did not feel comfortable with trying a SIGNKILL/SIGNTERM at this stage
    5. Attempted to stop the entire Docker service via WebGUI
    6. WebGUI became unresponsive
    7. Attempted to do powerdown commands and few others to try to get the array safely down with no luck and hard reset was performed.
  2. Server came back up, parity check started...
    1. The recent major change; I migrated away from binhex rutorrent image to linuxserver, so I figured I would back down the version of it to a few weeks ago to see if that made any difference (none)
    2. A day or two goes by and I do some smaller file moves and changes, rutorrent becomes unavailable again
    3. Trying to pull Diagnostics from WebGUI, the WebGUI becomes unresponsive
    4. Attempt to pull via CLI, and that just sits there for a hour with no success
    5. During this entire time, the other dockers are running fine, example plex streaming fine
    6. I try to do some SIGNTERMS AND SIGNKILLS to docker processes related to rutorrent and then the docker service with no luck of freeing the system from what is hanging things up. 
    7. Syslog does not point to anything major
    8. Hard reset the box, as powerdown and other commands ultimately take the rest of the dockers down into a unusable state but still stalled.
  3. Server came back up, parity check started
    1. Ran diagnostics and of course nothing of real major interest but for giggles included.
    2. Found a few posts that the docker.img can get jacked up, so I blew away the docker.img file and recreated.
    3. Recreated all my major dockers, the aux/testing ones I left off.
    4. A day or so goes by and I decide to check on things and the WebGUI works, but not rutorrent. This time I was not making any file moves besides the Sonarr and Radarr ones that happen automatically in the background (nothing major, like a whole series or anything.)
    5. I try to cycle the docker and then stuff does south again
    6. Can't get Diags. in WebGUI or CLI.
    7. Try a few odds and ends, no luck hard reset
  4. Server booted in safe mode, no plugins, just basic docker images, parity check complete
    1. I attempt to duplicate issues by cleaning up files from within rutorrent and outside and cleaned about 2-300 GB of data up across the varying devices.
    2. Not able to duplicate
    3. 1-2 days pass and I find rutorrent not responsive again (still in safe mode)
    4. Attempt to access WebGUI and loads fine till i try to restart the docker and the that freezes up. 
    5. Cannot get Diags. from CLI as that just sits there, screenshot: http://cln.sh/CZ8q
    6. I sent cat /var/log/syslog to attached file and zipped the entire directly and also saved (did not attach for security reasons but can provide if needed)
    7. 500 Internal Server Error for WebGUI, but as before the other key Dockers are working so going to keep the server on until maybe i can get some advise on this thread.

 

Any thoughts or suggestions is much appreciated!

 

 

 

nas-diagnostics-20200506-1837.zip syslog.txt

Edited by wbhst83
Link to comment

I can try to adjust the BIOS settings as outlined, but before I do that, I would like to make sure no one has any other gather requests /ideas before I bounce the box. Also this system has been up and running for almost a year now (meaning only intentional reboots and never a hung state), and the most recent BIOS update was in January and stable before and after until the last 2 weeks. 

Edited by wbhst83
Link to comment

Not sure if this helps any, but ran this and got this back. Also to be clear, the system is responding to CLI commands and dockers are mostly still working. No kernel panics.

 

root@nas:~# cat /sys/module/intel_idle/parameters/max_cstate
9

Edited by wbhst83
Link to comment

I rebooted, surprisingly the powerdown option worked this time. I adjusted Advanced -> AMD CBS -> Power Supply Idle Control to "Typical" from "Auto". Server is back up with plugins. I will see how the next couple of days goes, I did notice that the BIOS is 2.3 and the latest is 2.6 for ASROCK so that might be my next step also.

nas-diagnostics-20200508-2206.zip

Edited by wbhst83
Link to comment
  • JorgeB changed the title to [SOLVED] Server becomes unstable, cannot find why

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.