Server crashes to black screen after <=30 minutes


jsmj

Recommended Posts

I upgraded my mobo and CPU and am having trouble keeping the server booted. Starts up normally, and I can access the GUI over the network. I haven't tried starting the array yet, as I'm worried about the unclean shutdowns. Basically what happens is the server operates normally (as normal as it can without the array mounted) and then falls off the network, and the monitor attached to the machine goes black. Keyboard input does nothing. All I can do is shut it down. I've tried booting from a Ubuntu drive, and I can boot into Ubuntu and noodle around, and it stays booted for a few hours (as long as I've tested), no crash to black screen. I can also run memtest86 from the Unraid flash drive and it runs for hours as well. No crash, but I might have it run again overnight for good measure. Unraid is the only thing that makes it crash so far, and 30 minutes is as long as it'll stay booted. Sometimes it crashes to black sooner or not even get through the boot up process before it crashes to black screen.

 

Specs:

ASUS ROG Strix B450-F

Ryzen 5 2600

32GB G.Skill 3200 DDR4

Corsair CX600M

MSI GTX 970

Two PCIe 4 port Sata adapters (One is full, the other has one drive attached)

11 Disks (6 on the mobo, 5 on the PCIe SATA cards)

 

Another thing I've tried is booting UEFI from the flash drive, and this doesn't crash to black. Instead it just will reboot after a few minutes. Logs are attached. Only thing I have yet to try is a new flash drive, and I will as soon as I can keep it booted long enough to download a backup. Can I just copy the contents of the current flash drive onto a new one or is there something special about the backup? I also watched temps in the BIOS for a good while and never saw anything above 40C. BIOS is updated to the current version.

 

Logs are attached. When I looked through them I ctrl+f'd 'fail' and found a lot, but don't understand them.

 

Edit: Also tried booting with one stick of RAM and both sticks (16 gb/ea) show the same issue

 

syslog.txt

Edited by jsmj
Link to comment
10 hours ago, Benson said:

Pls try disable C6 state at mainboard BIOS, this a common problem on 1st gen Ryzen ( although you have 2nd gen )

The only setting I found in my BIOS is global C-state, and disabling that didn't solve it

 

2 hours ago, jonathanm said:

Also be sure memory timings are correct. 3200 is way overclocked IIRC.

The RAM settings are unchanged from board default, which is 2133. I tried changing it to 3200 for kicks, and the system won't boot at all that way. I haven't messed with timings or voltage or anything.

 

This morning I tried a different PSU, which didn't solve anything either. I also ditched the 2 PCI-E 4 port SATA cards for an IBM M1015 HBA in IT mode, which seems to be working properly when the server is running, but I still get the blackouts. I've also noticed that after the machine crashes, subsequent boots only last a couple minutes if it boots at all. If I give it some down time, it'll stay booted for longer, but eventually crashes. I thought this might mean a thermal problem, but I can't find any offending high temps, 40C at the most.

Link to comment
5 hours ago, jonathanm said:

Have you pulled the CPU heatsink off and checked the mounting and redone the paste?

checked the paste today, confirmed the pins aren't bent or anything and redid the paste. No dice. Stayed booted for about 45 minutes, just enough to crush my hopes

Link to comment

Is the case clean of dust and dirt?  All cooling fins unclogged?  Same for case air intakes and exhausts?  

 

If all of these are OK and if you are running one of the latest versions of Unraid, setup the Syslog Server ( Settings   >>>   Syslog Server  ) to mirror the syslog to your flash drive  (turn on Unraid 'Help' for instructions).

 

EDIT: since this is a brand new system, I would suggest running memtst for 24 hours (unless the memory is ECC).  

 

Next try booting into the 'Safe Mode'.

Edited by Frank1940
Link to comment
31 minutes ago, Frank1940 said:

Is the case clean of dust and dirt?  All cooling fins unclogged?  Same for case air intakes and exhausts?  

 

If all of these are OK and if you are running one of the latest versions of Unraid, setup the Syslog Server ( Settings   >>>   Syslog Server  ) to mirror the syslog to your flash drive  (turn on Unraid 'Help' for instructions).

 

EDIT: since this is a brand new system, I would suggest running memtst for 24 hours (unless the memory is ECC).  

 

Next try booting into the 'Safe Mode'.

It's pretty dust free in there. I wouldn't eat out of it, but I've seen far worse. Ventilation shouldn't be a problem. I'll run memtest for 24 hours and see how it goes. Attached is the syslog from the USB after the latest crash and full diagnostics zip

 

Edit to add: There are 13 files on my flash drive with the name FSCK000*.REC that I don't remember having before. Is this of any significance? I can add my flash backup .zip file as well if that'll help

syslog

tower-diagnostics-20190910-0012.zip

Edited by jsmj
added diagnostics
Link to comment
24 minutes ago, jsmj said:

here are 13 files on my flash drive with the name FSCK000*.REC that I don't remember having before. Is this of any significance? I can add my flash backup .zip file as well if that'll help

Rather than write it out, I grabbed a quote from a Google search.

Quote

These files are generated by the fsck linux utility, which is the equivalent of Windows/DOS CHKDSK command.
They are the recovery data that it finds, like file data with no reference in FAT table, etc, so it converts it in a recovery file for you to analyze and recover...

If you don't need to recover lost files (Which is most of the time), then delete them.

In other words, one of the crashes occurred while an open file was being written, resulting in the file table not being updated properly.

Link to comment
  • 2 weeks later...

Hey guys, sorry I went dark. I was out of town. I started memtest to run over night and forgot to stop it and it ran for 161 hours before I manually shut it down. It accumulated 1 error well after the 24 hour mark. 

 

I tried using a different PCIE GPU and it ran for 3 hours or so before I went to bed with high hopes, but it crashed sometime in the night. When I went to reboot it, it wouldn’t even make it to the boot selector screen before lockup. 

 

This behavior of needing a “rest” after being booted for longer (hour+) periods is super strange to me. The length of time the machine will stay booted seems to be related to the amount of time it’s been powered down. I went out of town and got 3 whole hours of uptime! But if I boot it directly after a crash, it can’t make it long enough to fully boot.

 

That seems to be a temp issue which I can’t chase down or unraid needing to “forget” something that it did or accumulated in order to cause the crash. 

 

I’d like to make an unraid trial usb and boot from that, but I wanted to check and make sure that’s safe for my data. I wouldn’t start the array. 

B3FD4ACC-C553-4C51-9B22-7804A7EC2D8C.jpeg

Link to comment
50 minutes ago, johnnie.black said:

Still not acceptable, make sure RAM is not overclocked, it's known to cause issues with some Ryzen systems.

 

As far as I can tell, memory speeds are unchanged by me and running at 2133. The sticks are rated for 3200. Here's the product page for the memory I'm using.

 

I've attached the screens for my bios settings

 

 

bios_screens.zip

Edited by jsmj
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.