[SOLVED] Server Going Unresponsive Multiple Times This Week

MarkRMonaco · September 14, 2020

I've been running into an issue where my Unraid (v6.9.0-beta25) server has gone completely unresponsive (the server itself, all hosted Docker containers, and no network communication whatsoever) several times this past week and had to be power-cycled to come back online (each time). - This has happened enough to the point where I'm rebuilding one of my array disks (which previously reported UDMA CRC errors, but passed several SMART tests).

Can someone take a look at my diag logs and see if there are any other underlying issues that I may be missing (outside of the one reported drive)? I have a feeling that it may have to do with when the system is experiencing an unusually high load, but I'm not 100% certain. - As far as hardware changes are concerned, none have been made for (roughly) about a month and a half now.

Thanks in advance.

wadewilson-diagnostics-20200913-2324.zip

Edited March 30, 2021 by MarkRMonaco

ChatNoir · September 14, 2020

Did you consider the FAQ entry about Ryzen CPUs :

MarkRMonaco · September 14, 2020

Thanks, I am aware of that, but I don't think its the issue this time around.

C-states are idle power states and my system is definitely locking up during activity (not while being idle).

Edited September 14, 2020 by MarkRMonaco

trurl · September 14, 2020

9 hours ago, MarkRMonaco said:

CRC errors

are connection issues. You can acknowledge those by clicking on the warning on the Dashboard and it won't warn again until they increase. The disk you are rebuilding looks fine and only had 11 CRC so I wouldn't worry about it. No reason to rebuild unless it was disabled but let it continue.

Unlikely related to any problem, but your system share has files on the array instead of all on cache.

Syslog resets on reboot so those diagnostics don't tell us anything about what happened prior.

Setup Syslog Server so syslog will be saved somewhere you can get it and post it when you have to reboot again.

civic95man · September 14, 2020

looks like you are overclocking your memory above what is supported for that combination of processor, memory, and board (XMP is considered an overclock). You are running at 3266 and max supported is 2667. Try adjusting your memory speeds to that and see if your'e stability improves.

MarkRMonaco · September 14, 2020

43 minutes ago, civic95man said:

looks like you are overclocking your memory above what is supported for that combination of processor, memory, and board (XMP is considered an overclock). You are running at 3266 and max supported is 2667. Try adjusting your memory speeds to that and see if your'e stability improves.

That's not necessarily a true statement. That particular generation of Ryzen runs fine on RAM speeds between 3000 and 3200 MHz (+/-). In fact, I've had Unraid running on that combination for over a year and its been fine.

Edited September 14, 2020 by MarkRMonaco

MarkRMonaco · September 14, 2020

58 minutes ago, trurl said:

are connection issues. You can acknowledge those by clicking on the warning on the Dashboard and it won't warn again until they increase. The disk you are rebuilding looks fine and only had 11 CRC so I wouldn't worry about it. No reason to rebuild unless it was disabled but let it continue.

Unlikely related to any problem, but your system share has files on the array instead of all on cache.

Syslog resets on reboot so those diagnostics don't tell us anything about what happened prior.

Setup Syslog Server so syslog will be saved somewhere you can get it and post it when you have to reboot again.

Thanks for confirming my suspicion about the drive being "ok". I definitely acknowledged the errors, but the drive did eventually go into disabled status twice over the course of yesterday. So, this is my second time trying to rebuild (it locked up again overnight and I decided to leave the server powered-off).

I'll definitely get the syslog server configured (something that I've been meaning to do) as well.

I'm also going to check the cabling (when I get home) and see if there is anything that stands out (like a deep crease or a crimp from a zip tie).

Edited September 14, 2020 by MarkRMonaco

civic95man · September 14, 2020

10 minutes ago, MarkRMonaco said:

That's not necessarily a true statement. That particular generation of Ryzen runs fine on RAM speeds between 3000 and 3200 MHz (+/-). In fact, I've had Unraid running on that combination for over a year and its been fine.

I'm just quoting what is in the facts... and any system will run fine on an overclock until it doesn't. If your'e trying to troubleshoot an unstable system, its always best to remove any overclock and go from there.

10 minutes ago, MarkRMonaco said:

That particular generation of Ryzen runs fine on RAM speeds between 3000 and 3200 MHz (+/-)

runs fine where? windows? or unraid?

That linked faq for unraid has this table to identify the supported memory speeds based on configurations

image.png.ae627a45f89e61b4e8e3a65346647716.png

MarkRMonaco · September 14, 2020

37 minutes ago, civic95man said:

I'm just quoting what is in the facts... and any system will run fine on an overclock until it doesn't. If your'e trying to troubleshoot an unstable system, its always best to remove any overclock and go from there.

runs fine where? windows? or unraid?

That linked faq for unraid has this table to identify the supported memory speeds based on configurations

Runs fine on either platform.

With all due respect, I get where you're going with the troubleshooting (and appreciate it), but I do know what I'm doing. I've been a desktop and server engineer for many years. That is why I was looking to do a deep-dive on the logs in the first place.

Edited September 14, 2020 by MarkRMonaco

trurl · September 14, 2020

1 hour ago, trurl said:

Unlikely related to any problem, but your system share has files on the array instead of all on cache.

Just thought I should explain the consequences of this since you didn't comment on it.

system share is where docker.img and libvirt.img normally lives. If dockers or VMs are enabled, these files will be open, and if those files are on the array, array disks will be kept spinning, and also the decreased performance resulting from parity update.

ChatNoir · September 14, 2020

Maybe a slower memory speed solves the issue, maybe not.

In any case, finding the cause is a process of elimination. Testing this is super simple, requires no particuliar hardware and would eliminate this possibility from the equation.

Particularly since you are running only one stick in single channel. This is not a config that is not usually covered by most tests.

Edited September 14, 2020 by ChatNoir

MarkRMonaco · September 14, 2020

1 minute ago, trurl said:

Just thought I should explain the consequences of this since you didn't comment on it.

system share is where docker.img and libvirt.img normally lives. If dockers or VMs are enabled, these files will be open, and if those files are on the array, array disks will be kept spinning, and also the decreased performance resulting from parity update.

good point. thx

MarkRMonaco · September 14, 2020

2 hours ago, ChatNoir said:

Maybe a slower memory speed solves the issue, maybe not.

In any case, finding the cause is a process of elimination. Testing this is super simple, requires no particuliar hardware and would eliminate this possibility from the equation.

Particularly since you are running only one stick in single channel. This is not a config that is not usually covered by most tests.

I'm not running in single-channel mode. There are two DIMMs installed. - 16gb (2x8gb) means that the kit installed is comprised of two 8gb modules, which is pretty standard notation for RAM specs.

Edited September 14, 2020 by MarkRMonaco

MarkRMonaco · September 14, 2020

Just an update. - I fixed the "system" share issue and made sure that it only resides within the cache pool. The other files that were on one of my drives were outdated. Therefore, I deleted them through Krusader.

I also did the following:

Enabled the local syslog server and have it mirroring between the cache pool and the "flash" share.
Reverted back to stock (and rebooted) from the linuxserver.io Nvidia (Unraid Nvida plugin) image since my card wasn't supported.
Turned off (disabled) any ErP or C-State settings in the BIOS (which were previously enabled).

Edited September 14, 2020 by MarkRMonaco

JorgeB · September 14, 2020

Like already mentioned you should respect the max officially supported RAM speeds by AMD depending on the config, at least while you're troubleshooting to rule that out, several cases in the forum of instability and even data corruption with Ryzen and overclocked RAM.

MarkRMonaco · September 14, 2020

57 minutes ago, JorgeB said:

Like already mentioned you should respect the max officially supported RAM speeds by AMD depending on the config, at least while you're troubleshooting to rule that out, several cases in the forum of instability and even data corruption with Ryzen and overclocked RAM.

Fair enough. - That will be my next step (going back down to the base XMP setting) if the system goes unresponsive again.

Edited September 14, 2020 by MarkRMonaco

MarkRMonaco · September 14, 2020

Now, regarding the syslog configuration, is this something I should be concerned about (and do I need to take any action)?

Lines 66 & 67:

image.png.52804e414cce14c2441a5d45dbb37063.png

Line 123:

image.png.0845e5661e4e03793c9e812ca850fc3e.png

Edited September 14, 2020 by MarkRMonaco

trurl · September 14, 2020

Post a screenshot of your Syslog Server settings.

MarkRMonaco · September 14, 2020

1 hour ago, trurl said:

Post a screenshot of your Syslog Server settings.

I also put a screenshot of the specific lines that were called out from the /etc/rsyslog.conf file in my previous reply.

Edited September 14, 2020 by MarkRMonaco

trurl · September 14, 2020

I think you have to fill in the Remote syslog server to tell Unraid to send them to itself.

This is mentioned as the 3rd option in that FAQ I linked.

MarkRMonaco · September 14, 2020

1 hour ago, trurl said:

I think you have to fill in the Remote syslog server to tell Unraid to send them to itself.

This is mentioned as the 3rd option in that FAQ I linked.

Thanks. I missed that part.

With the "flash mirroring" option turned off and the syslog server set to "both" for protocols, I'm still getting one error message returned when the service started/restarted:

Starting rsyslogd daemon: 
/usr/sbin/rsyslogd -i /var/run/rsyslogd.pid
rsyslogd:  Could not find template 1 'remote' - action disabled [v8.2002.0 try https://www.rsyslog.com/e/3003 ]
rsyslogd: error during parsing file /etc/rsyslog.conf, on or before line 121: errors occured in file '/etc/rsyslog.conf' around line 121 [v8.2002.0 try https://www.rsyslog.com/e/2207 ]

Current Config:

image.png.3e1b98bff1d6cbf2eaef1bbd4bfb4f22.png

Edited September 14, 2020 by MarkRMonaco

MarkRMonaco · September 15, 2020

I also checked my "syslog" share, and it looks like it is not populating with any files as well...

MarkRMonaco · September 15, 2020

On 9/14/2020 at 11:41 AM, MarkRMonaco said:

Just an update. - I fixed the "system" share issue and made sure that it only resides within the cache pool. The other files that were on one of my drives were outdated. Therefore, I deleted them through Krusader.

I also did the following:

Enabled the local syslog server and have it mirroring between the cache pool and the "flash" share.

Reverted back to stock (and rebooted) from the linuxserver.io Nvidia (Unraid Nvida plugin) image since my card wasn't supported.

Turned off (disabled) any ErP or C-State settings in the BIOS (which were previously enabled).

Just an update, my drive is almost 100% rebuilt and I have not ran into issues with it being unresponsive (yet). So, it looks like one of these steps (above) solved the issue.

image.png.3fc48cb94acc591e88df8fd4f59c9e8a.png

I am, however, still experiencing issues actually getting syslog working at all. The Unraid share has yet to be populated with anything, and I am still running into that single error message whenever the service is started or restarted.

image.png.e40b0b3ba196a7a50040cdfce8629e75.png

kevschu · September 16, 2020

I am having a similar issue. I had a locking issue when I first switched over to Ryzen and it was related to CState config and it worked for several months after making changes in the BIOS.

a month or so ago, it locks up a couple times a week. Made sure the BIOS was up to date, haven’t double checked memory speeds yet. I have my syslog server setup, i will make a separate post instead of stealing this.

thought I would chime in that you aren’t alone in this.

MarkRMonaco · September 16, 2020

Thanks @kevschu.

An update on my end...

About a half of day later, the drive went back into "disabled" status due to errors. - Therefore, I went into the BIOS and brought the RAM clock back down to the base/stock XMP setting (3000MHz) w/ no additional overclock. From there, I shut the system back down, and swapped the SATA cable ordering (they're physically tagged) across the four 3.5" drives (1 through 4, top to bottom; versus 4 through 1). All of the power connectors were checked as well to ensure that they were fully seated.

Now, I'm back to square one with the parity rebuild/sync since the drive had to be removed and re-added to the pool...

In the meantime, let me know if anyone is interested in a new set of logs pulled from the system.

Edited September 16, 2020 by MarkRMonaco

[SOLVED] Server Going Unresponsive Multiple Times This Week

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation