Unraid Crashing


Recommended Posts

Hey guys, so usually my server is pretty good when it comes to up-time. Last week it had some issues with it randomly crashing though. I was loosing everything, no webgui, no SSH, no keyboard input, nothing. I had to hard reset the power once. Everything came back fine and worked after that. Since last week, have replaced my motherboard and CPU. I am still having these issues. I had these issues on 6.6.6 and I upgrade to the newest RC and I am still having these issues. Trying to figure out whats going on.

 

My up-time is max of 6 hours at the moment, the server will just flat out crash randomly, or when I am starting up VM's or Dockers. I have attached the logs I was able to grab after the last crash. I have also ran the "fix common problems" plugin and it finds nothing. I also have updated my bios to the latest and turned off C states on my motherboard.

 

Here are my specs.

Ryzen 7 1700 ( overclocked to 3.5Ghz )

ASUS PRIME b350-plus

 

My first thought, was maybe because of my overclock on my CPU. However, this happened to me prior to the new CPU and Motherboard. On a stock AMD chip. So as of right now, I am going to turn the overclock off after I reboot my server just to make sure this isn't it.

 

Thanks!

 

tower-diagnostics-20190310-2224.zip

Link to comment

I do have a PSU tester, won't that be able to tell me if my PSU is bad?

I can do the memory test as well, brand new memory though. Would really suck if I got the bad batch.

 

EDIT:

By the way, the server literally just crashed about 5 minuets ago as I was provisioning a windows server vm. (2 cores, 2GB ram)

Just wanted to update the post. It crashed right when I was hitting the field to search for my install media. It started to search the iso's folder and that's when I lost connectivity. I almost feel like this is a bad sata card or something to do with a drive, since it seems that whenever the data is accessed it throws Unraid for a loop.

Edited by Micaiah12
Added an update
Link to comment

Problem with the PS Tester is that is most of them only find the ones that  are completely bad.  Not the flaky ones.  And those flaky ones are the ones which are the troublesome problem givers.  By the way, have you consulted the Ryzen thread to see if there are any special BIOS settings that your particular CPU needs any special BIOS settings.  

 

Bad RAM  is not common but it is one of the cheapest possible causes to test out.  It only costs time...

Link to comment
11 hours ago, Micaiah12 said:

Ryzen 7 1700 ( overclocked to 3.5Ghz )

Don't overclock it. Overclocking is for gaming rigs, not servers where, above anything, you want stability. It will precision boost all cores to 3.2 GHz for most of the time under load - the Wraith Spire is good enough for that. Don't overclock the memory controller either. So 2666 MT/s or lower, depending on your RAM configuration. Install the latest BIOS and look for the Power Supply Idle control. The default is low current - set it to normal current.

 

20 minutes ago, Micaiah12 said:

brand new memory though. Would really suck if I got the bad batch.

Brand new is a good time to test it. Bad DIMMs happen but being new it will be easy to replace. In any case, reputable brands have lifetime warranties.

Link to comment

Two ram modules. I only bought the ram that was compatible for my motherboard. I have the memory test going now and I will let it run overnight.

 

As for the overclocking, I understand that it can make things unstable, but my last CPU/MOBO combo was overclocked and I wasn't having any issues really until the last crash before I replaced the motherboard and CPU. If a stable clock can be obtained, is there no real reason not to overclock it? Just wondering is all.

 

I will consult the Ryzen thread again, that is where I got the idea to turn off C states. I will continue to browse through it as well as do the memory test.

 

 

Link to comment
1 hour ago, Micaiah12 said:

is there no real reason not to overclock it?

Ryzens don't overclock by much. The 1000-series is already close to the limits of the silicon so there's not a lot to be gained. The 2000-series in particular, using Precision Boost and XFR2, do pretty well by themselves if you keep them cool enough. I really wouldn't do it.

 

Don't turn off C-states if you don't need it. Try the Power Supply Idle control in the BIOS first.

Link to comment

Yes, I saw the power supply idle control down at the bottom of one thread. I will do that when I get home.

I've disabled the overclock, I suppose I can get into the overclocking project afterwards if I deiced too. It just seems when I was overclocking this CPU and my previous one I was getting much better performance in the windows VM's then stock clocks.

Link to comment

AMD-Vi and IOMMU are to do with hardware passthrough to a VM. The ATA error messages are about a SATA link failing and the disk dropping offline. The controller repeatedly tried to re-establish the link but doesn't get a response. Grab diagnostics before you shut down to check both power and data cables.

Link to comment

Memory test finished. Looks like no errors on the ram side. 

 

I will try and repeat the crashing however, when it crashes I can’t do anything to access the server. How am I going to grab the logs? I remember that there was one way to continuously write the logs to the USB. How do I do that again?

Link to comment

Hey guys. So apparently, troubleshoot mode doesn't exist as I am on 6.7.0.

I have attached an image of my monitor when the server crashed as well as the system log from logging.html.

It seems when I start the array and the rebuild takes place. There is a huge amount of errors to disk1 and then the server crashes. This could be inline with whenever I am trying to build out a VM and I browse for the ISO file it will crash the unraid server since it starts to read from disk 1. I don't know exactly which disk is disk1 but I am going to reboot and see if I can find that out. I really am starting to think it's a bad sata controller. Let me know what you think.

 

EDIT: found that disk1 is the HGST. It's one that is connected into the sata controller.

 

IMG_1308.HEIC

System Log.htm

Edited by Micaiah12
Link to comment

IF you are running the latest rc version of 6.7.0 ( not sure when in the rc series this feature was added) and  you go to     Settings   >>>   Network Services  >>>    Syslog Server    , you will find an improved version of the Troubleshooting tool that use to be in Fix Common Problems.  Use the   Help   Feature for instructions.  (If I remember correctly from when I played around with it, I set the Remote syslog server as the same server being monitored and turned on the Mirror syslog to flash.)

Link to comment
2 hours ago, Frank1940 said:

IF you are running the latest rc version of 6.7.0 ( not sure when in the rc series this feature was added) and  you go to     Settings   >>>   Network Services  >>>    Syslog Server    , you will find an improved version of the Troubleshooting tool that use to be in Fix Common Problems.  Use the   Help   Feature for instructions.  (If I remember correctly from when I played around with it, I set the Remote syslog server as the same server being monitored and turned on the Mirror syslog to flash.)

I went to do that. Powered unraid on, but now looks like I lost another drive, this time it's disabled entirely. I'm afraid that if I start the array I am going to have two emulated disks and I won't have my data in parity. Wondering now what the best option is. I don't know if the constant resets got to this or if the sata controller really is messed up and has been causing issues all along. Sorry guys, really trying to get you the logs, how should I precede?

Screenshot from 2019-03-12 21-56-37.png

Edited by Micaiah12
Link to comment
3 hours ago, Micaiah12 said:

I'm afraid that if I start the array I am going to have two emulated disks and I won't have my data in parity.

Unraid can't emulate two disks with single parity, a new config should recover disk1, but disk4's data might be damaged, since it looks like it was in the middle of a rebuild.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.