Jump to content
hmnd

[SOLVED] System Crashing Every Few Hours - Gigabyte AB350M-D3H, Ryzen 1600

24 posts in this topic Last Reply

Recommended Posts

I've scheduled parity checks to run monthly, but I keep getting a parity check started every few hours, which is causing all my docker containers to stop working. I've attached diagnostics if that helps  (bigmac-diagnostics-20190518-0156.zip).

 

Thanks in advance! 

Share this post


Link to post

Your diagnostics are from just after a reboot, and you are getting a parity check due to unclean shutdown. Are you sure you aren't losing power?

Share this post


Link to post

Oh! You're right. I think I may need to replace my UPS. Can you tell what caused the shutdown or is it just up to speculation?

Share this post


Link to post
23 minutes ago, hmnd said:

I think I may need to replace my UPS.

IT might be just the battery.  You can check this with a couple of lights with incandescent lamps in them.  (You want about 150W of load.)  Shut the server down, plugin in the lamps.  Now pull the power plug for UPS from the mains.  Make sure it runs for about ten minutes on the battery.   Less than five minutes and you need to replace the battery.  UPS batteries for the home market UPS's are fairly standardized in size and capacity.  There seems to be only two or three different ones and you can measure them with a ruler to get the size you need.    

Share this post


Link to post
21 minutes ago, Frank1940 said:

IT might be just the battery.  You can check this with a couple of lights with incandescent lamps in them.  (You want about 150W of load.)  Shut the server down, plugin in the lamps.  Now pull the power plug for UPS from the mains.  Make sure it runs for about ten minutes on the battery.   Less than five minutes and you need to replace the battery.  UPS batteries for the home market UPS's are fairly standardized in size and capacity.  There seems to be only two or three different ones and you can measure them with a ruler to get the size you need.    

I actually have a bunch of UPS's I can try with and a second one I tried just now had the same issue. Lowered the sensitivity on the UPS, as it was on high. Hopefully that fixes it. 

Share this post


Link to post

There are couple things you should try.

 

First, is a 24 hour memtst from the Boot menu.  (The built-in memtst will not check ECC memory.)

 

Second, the Power Supply.  This is best done by replacement.  Many folks have one in the junk box, can borrow one from a friend, or, even,  'borrow' one from a vendor with a liberal return policy. 

 

8 hours ago, hmnd said:

Lowered the sensitivity on the UPS, as it was on high. Hopefully that fixes it. 

This should really not fix it.  This is what UPS are suppose to protect against.  The only thing that lowering the sensitivity should do is to reduce the number of transfers from mains to battery.   You could make a screen shot of your UPS settings for shutdown and post it up.  (Many folks haven't really thought things through when setting up a UPS.)

Share this post


Link to post
11 hours ago, Frank1940 said:

First, is a 24 hour memtst from the Boot menu.  (The built-in memtst will not check ECC memory.)

Did a bunch of memtests a week ago and all passed. 

11 hours ago, Frank1940 said:

Second, the Power Supply.  This is best done by replacement.  Many folks have one in the junk box, can borrow one from a friend, or, even,  'borrow' one from a vendor with a liberal return policy. 

I think this is it. I checked and turns out the only component I neglected to replace in my rebuild last year was the PSU. Ordered a new one for tomorrow, so hopefully that'll fix it. I'll update here with what happens. 

 

Thanks!

Share this post


Link to post

Have you checked the Ryzen threads?  In particular something about C-States?

Share this post


Link to post

OK, you are running 6.7.0.  The version contains a new troubleshooting tool.  This is an (slightly) edited tip from the Tip and Tweaks plugin...

 

"There is a very cool feature in Unraid 6.7 that can be used to capture your logs and help in troubleshooting a situation where your server locks up or becomes unresponsive. Go to Settings  ->   Syslog Server and enable the Syslog Server. Put the IP address of your server in the 'Remote syslog server' field. This will log all your server log entries to the Syslog Server and save the log to an array share. After a reboot from a lockup or hung server, you can view the saved log using the Syslog viewer."    ('Help' will provide you with some tips on how to get this setup if you still have questions.) 

 

Hopefully, this new tool will capture things that the old 'Troubleshooting mode' in the Fix Common Problems plugin would miss.  

Share this post


Link to post
Posted (edited)

For troubleshooting purposes the setting "Mirror syslog to flash" is introduced.

This will copy everything logged to the flash device (even when the system is rebooted)

 

With this setting enabled, it is not necessary to set up a syslog server explicitely.

Edited by bonienl

Share this post


Link to post
2 minutes ago, bonienl said:

This will copy everything logged to the flash device (even when the system is rebooted)

 

The reason, I am hesitant about recommending using the Flash drive is because of the wear-and-tear on the flash drive.  (I would not worried about my flash drive but having other folks do it is a bit of a concern...)  But I have a question about how this logging function works. Is there a problem with looping back to the same server?  Another way to ask the same question: Can I save the syslog to an share on the server that the syslog is for?

 

Second question:   Is it 'sticky' through a reboot?  That is, if it is running when the server is rebooted (deliberate or otherwise), will the syslog server be running and logging as the server is starting up?

Share this post


Link to post

You can save the syslog to a share on the Unraid server.   I have it set to save mine to my cache drive.    Doing it that way survives a reboot.

Share this post


Link to post
31 minutes ago, Frank1940 said:

The reason, I am hesitant about recommending using the Flash drive is because of the wear-and-tear on the flash drive.

Yes, that's why it is recommended to use it only while troubleshooting (as explained in the Help)

This function ensures everything is captured from the start of the system, unlike the local syslog server which only starts capturing after the network and shares are available.

Share this post


Link to post
17 minutes ago, bonienl said:

...unlike the local syslog server which only starts capturing after the network and shares are available.

That makes sense.  I never thought about the fact that those services don't really start until fairly late in the boot cycle.  

 

So I further gather from this comment that it is sticky across reboots.  That is great as it will allow checking times and other details without have to look at two seperate files.  Hopefully, this will turn out to be a really useful tool to troubleshoot the type of problem that this OP is experiencing.  

Share this post


Link to post
3 hours ago, bonienl said:

For troubleshooting purposes the setting "Mirror syslog to flash" is introduced.

This will copy everything logged to the flash device (even when the system is rebooted)

 

With this setting enabled, it is not necessary to set up a syslog server explicitely.

Is this a good config for troubleshooting? 

 

NOcg9e0F.png

Share this post


Link to post
4 hours ago, Squid said:

Have you checked the Ryzen threads?  In particular something about C-States?

I hadn't, but I've disabled global c-state (I think that's what it was called in bios) and I've got 3 hrs of uptime so far, so that may be it!

 

Hopefully I haven't jinxed myself...

Share this post


Link to post
1 hour ago, hmnd said:

@Frank1940 @itimpi @bonienl Just crashed again. Syslog did save through the crash but there's nothing logged prior to it booting up again.

With "Mirror syslog to flash" enabled, everything is copied to the same file on the flash device, including the log after booting.

Share this post


Link to post
7 minutes ago, bonienl said:

With "Mirror syslog to flash" enabled, everything is copied to the same file on the flash device, including the log after booting.

I mean that nothing was logged pertaining to the crash. Just stuff a few hours before and the boot up from after the crash. 

Share this post


Link to post

This sounds like you are having some hardware related issue.

Share this post


Link to post
Just now, bonienl said:

This sounds like you are having some hardware related issue.

Yeah, and I'm unsure how to proceed at this point... So far I've replaced PSU, memtested RAM, disabled global c state and cool n quiet on CPU. Also installed latest mobo firmware that was released a couple weeks ago. Any other ideas?

Share this post


Link to post

Do a bit of research on the BIOS updates for your MB.   I had a look at your Diagnostics file and it looks like you have might one of the early Ryzen systems.  As I recall, there were a lot of issues with them.  I seem to recall that AMD actually replaced some CPU's that were having issues with Linux.   Google is your friend in doing these types of searches.  

Share this post


Link to post
On 5/20/2019 at 7:22 AM, Frank1940 said:

I seem to recall that AMD actually replaced some CPU's that were having issues with Linux. 

I did quite a bit of searching and, at the risk of jinxing myself again, I think I've finally found the solution to this. First thing I did was add the following kernel boot parameter: rcu_nocbs=0-11, thanks to the post below. That should be added by clicking flash in the Main section and adding it directly after append under all the Syslinux config boxes starting with Unraid OS. 

Next, I added /usr/local/sbin/zenstates --c6-disable to my /boot/config/go file, per the post below. That disables C6, and I had already disabled Global C-state Control in BIOS before.

My server has now been running for 18 hours straight without issue, and hopefully it'll survive through the night. I figured I'd document what I did here, in case anyone encounters the same issue in the future.

Edited by hmnd

Share this post


Link to post

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.