[SOLVED] Server Going Unresponsive Multiple Times This Week


Recommended Posts

I've been running into an issue where my Unraid (v6.9.0-beta25) server has gone completely unresponsive (the server itself, all hosted Docker containers, and no network communication whatsoever) several times this past week and had to be power-cycled to come back online (each time). - This has happened enough to the point where I'm rebuilding one of my array disks (which previously reported UDMA CRC errors, but passed several SMART tests).

 

Can someone take a look at my diag logs and see if there are any other underlying issues that I may be missing (outside of the one reported drive)? I have a feeling that it may have to do with when the system is experiencing an unusually high load, but I'm not 100% certain. - As far as hardware changes are concerned, none have been made for (roughly) about a month and a half now.

 

Thanks in advance.

wadewilson-diagnostics-20200913-2324.zip

Edited by MarkRMonaco
Link to comment
9 hours ago, MarkRMonaco said:

CRC errors

are connection issues. You can acknowledge those by clicking on the warning on the Dashboard and it won't warn again until they increase. The disk you are rebuilding looks fine and only had 11 CRC so I wouldn't worry about it. No reason to rebuild unless it was disabled but let it continue.

 

Unlikely related to any problem, but your system share has files on the array instead of all on cache.

 

Syslog resets on reboot so those diagnostics don't tell us anything about what happened prior.

 

Setup Syslog Server so syslog will be saved somewhere you can get it and post it when you have to reboot again.

 

Link to comment
43 minutes ago, civic95man said:

looks like you are overclocking your memory above what is supported for that combination of processor, memory, and board (XMP is considered an overclock). You are running at 3266 and max supported is 2667.  Try adjusting your memory speeds to that and see if your'e stability improves.

That's not necessarily a true statement. That particular generation of Ryzen runs fine on RAM speeds between 3000 and 3200 MHz (+/-). In fact, I've had Unraid running on that combination for over a year and its been fine.

Edited by MarkRMonaco
Link to comment
58 minutes ago, trurl said:

are connection issues. You can acknowledge those by clicking on the warning on the Dashboard and it won't warn again until they increase. The disk you are rebuilding looks fine and only had 11 CRC so I wouldn't worry about it. No reason to rebuild unless it was disabled but let it continue.

 

Unlikely related to any problem, but your system share has files on the array instead of all on cache.

 

Syslog resets on reboot so those diagnostics don't tell us anything about what happened prior.

 

Setup Syslog Server so syslog will be saved somewhere you can get it and post it when you have to reboot again.

 

Thanks for confirming my suspicion about the drive being "ok". I definitely acknowledged the errors, but the drive did eventually go into disabled status twice over the course of yesterday. So, this is my second time trying to rebuild (it locked up again overnight and I decided to leave the server powered-off).

 

I'll definitely get the syslog server configured (something that I've been meaning to do) as well.

 

I'm also going to check the cabling (when I get home) and see if there is anything that stands out (like a deep crease or a crimp from a zip tie).

Edited by MarkRMonaco
Link to comment
10 minutes ago, MarkRMonaco said:

That's not necessarily a true statement. That particular generation of Ryzen runs fine on RAM speeds between 3000 and 3200 MHz (+/-). In fact, I've had Unraid running on that combination for over a year and its been fine.

I'm just quoting what is in the facts... and any system will run fine on an overclock until it doesn't.  If your'e trying to troubleshoot an unstable system, its always best to remove any overclock and go from there.

10 minutes ago, MarkRMonaco said:

That particular generation of Ryzen runs fine on RAM speeds between 3000 and 3200 MHz (+/-)

runs fine where? windows? or unraid?

 

That linked faq for unraid has this table to identify the supported memory speeds based on configurations

image.png.ae627a45f89e61b4e8e3a65346647716.png

Link to comment
37 minutes ago, civic95man said:

I'm just quoting what is in the facts... and any system will run fine on an overclock until it doesn't.  If your'e trying to troubleshoot an unstable system, its always best to remove any overclock and go from there.

runs fine where? windows? or unraid?

 

That linked faq for unraid has this table to identify the supported memory speeds based on configurations

image.png.ae627a45f89e61b4e8e3a65346647716.png

Runs fine on either platform.

 

With all due respect, I get where you're going with the troubleshooting (and appreciate it), but I do know what I'm doing. I've been a desktop and server engineer for many years. That is why I was looking to do a deep-dive on the logs in the first place.

Edited by MarkRMonaco
Link to comment
1 hour ago, trurl said:

Unlikely related to any problem, but your system share has files on the array instead of all on cache.

Just thought I should explain the consequences of this since you didn't comment on it.

 

system share is where docker.img and libvirt.img normally lives. If dockers or VMs are enabled, these files will be open, and if those files are on the array, array disks will be kept spinning, and also the decreased performance resulting from parity update.

Link to comment

Maybe a slower memory speed solves the issue, maybe not.

In any case, finding the cause is a process of elimination. Testing this is super simple, requires no particuliar hardware and would eliminate this possibility from the equation.

 

Particularly since you are running only one stick in single channel. This is not a config that is not usually covered by most tests.

Edited by ChatNoir
Link to comment
1 minute ago, trurl said:

Just thought I should explain the consequences of this since you didn't comment on it.

 

system share is where docker.img and libvirt.img normally lives. If dockers or VMs are enabled, these files will be open, and if those files are on the array, array disks will be kept spinning, and also the decreased performance resulting from parity update.

good point. thx

Link to comment
2 hours ago, ChatNoir said:

Maybe a slower memory speed solves the issue, maybe not.

In any case, finding the cause is a process of elimination. Testing this is super simple, requires no particuliar hardware and would eliminate this possibility from the equation.

 

Particularly since you are running only one stick in single channel. This is not a config that is not usually covered by most tests.

I'm not running in single-channel mode. There are two DIMMs installed. - 16gb (2x8gb) means that the kit installed is comprised of two 8gb modules, which is pretty standard notation for RAM specs.

Edited by MarkRMonaco
Link to comment

Just an update. - I fixed the "system" share issue and made sure that it only resides within the cache pool. The other files that were on one of my drives were outdated. Therefore, I deleted them through Krusader.

 

I also did the following:

  • Enabled the local syslog server and have it mirroring between the cache pool and the "flash" share.
  • Reverted back to stock (and rebooted) from the linuxserver.io Nvidia (Unraid Nvida plugin) image since my card wasn't supported.
  • Turned off (disabled) any ErP or C-State settings in the BIOS (which were previously enabled).
Edited by MarkRMonaco
Link to comment
57 minutes ago, JorgeB said:

Like already mentioned you should respect the max officially supported RAM speeds by AMD depending on the config, at least while you're troubleshooting to rule that out, several cases in the forum of instability and even data corruption with Ryzen and overclocked RAM.

Fair enough. - That will be my next step (going back down to the base XMP setting) if the system goes unresponsive again.

Edited by MarkRMonaco
Link to comment
1 hour ago, trurl said:

I think you have to fill in the Remote syslog server to tell Unraid to send them to itself.

 

This is mentioned as the 3rd option in that FAQ I linked.

Thanks. I missed that part.

 

With the "flash mirroring" option turned off and the syslog server set to "both" for protocols, I'm still getting one error message returned when the service started/restarted:

Starting rsyslogd daemon: 
/usr/sbin/rsyslogd -i /var/run/rsyslogd.pid
rsyslogd:  Could not find template 1 'remote' - action disabled [v8.2002.0 try https://www.rsyslog.com/e/3003 ]
rsyslogd: error during parsing file /etc/rsyslog.conf, on or before line 121: errors occured in file '/etc/rsyslog.conf' around line 121 [v8.2002.0 try https://www.rsyslog.com/e/2207 ]

 

Current Config:

 

image.png.3e1b98bff1d6cbf2eaef1bbd4bfb4f22.png

Edited by MarkRMonaco
Link to comment
On 9/14/2020 at 11:41 AM, MarkRMonaco said:

Just an update. - I fixed the "system" share issue and made sure that it only resides within the cache pool. The other files that were on one of my drives were outdated. Therefore, I deleted them through Krusader.

 

I also did the following:

  • Enabled the local syslog server and have it mirroring between the cache pool and the "flash" share.
  • Reverted back to stock (and rebooted) from the linuxserver.io Nvidia (Unraid Nvida plugin) image since my card wasn't supported.
  • Turned off (disabled) any ErP or C-State settings in the BIOS (which were previously enabled).

Just an update, my drive is almost 100% rebuilt and I have not ran into issues with it being unresponsive (yet). So, it looks like one of these steps (above) solved the issue.

 

image.png.3fc48cb94acc591e88df8fd4f59c9e8a.png

 

I am, however, still experiencing issues actually getting syslog working at all. The Unraid share has yet to be populated with anything, and I am still running into that single error message whenever the service is started or restarted.

 

image.png.e40b0b3ba196a7a50040cdfce8629e75.png

Link to comment

I am having a similar issue.  I had a locking issue when I first switched over to Ryzen and it was related to CState config and it worked for several months after making changes in the BIOS.  
 

a month or so ago, it locks up a couple times a week.  Made sure the BIOS was up to date, haven’t double checked memory speeds yet.  I have my syslog server setup, i will make a separate post instead of stealing this.  
 

thought I would chime in that you aren’t alone in this. 

  • Like 1
Link to comment

Thanks @kevschu.

 

An update on my end...

About a half of day later, the drive went back into "disabled" status due to errors. - Therefore, I went into the BIOS and brought the RAM clock back down to the base/stock XMP setting (3000MHz) w/ no additional overclock. From there, I shut the system back down, and swapped the SATA cable ordering (they're physically tagged) across the four 3.5" drives (1 through 4, top to bottom; versus 4 through 1). All of the power connectors were checked as well to ensure that they were fully seated.

 

Now, I'm back to square one with the parity rebuild/sync since the drive had to be removed and re-added to the pool...

 

In the meantime, let me know if anyone is interested in a new set of logs pulled from the system.

Edited by MarkRMonaco
Link to comment
  • MarkRMonaco changed the title to [SOLVED] Server Going Unresponsive Multiple Times This Week

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.