jsmj Posted September 10, 2019 Share Posted September 10, 2019 (edited) I upgraded my mobo and CPU and am having trouble keeping the server booted. Starts up normally, and I can access the GUI over the network. I haven't tried starting the array yet, as I'm worried about the unclean shutdowns. Basically what happens is the server operates normally (as normal as it can without the array mounted) and then falls off the network, and the monitor attached to the machine goes black. Keyboard input does nothing. All I can do is shut it down. I've tried booting from a Ubuntu drive, and I can boot into Ubuntu and noodle around, and it stays booted for a few hours (as long as I've tested), no crash to black screen. I can also run memtest86 from the Unraid flash drive and it runs for hours as well. No crash, but I might have it run again overnight for good measure. Unraid is the only thing that makes it crash so far, and 30 minutes is as long as it'll stay booted. Sometimes it crashes to black sooner or not even get through the boot up process before it crashes to black screen. Specs: ASUS ROG Strix B450-F Ryzen 5 2600 32GB G.Skill 3200 DDR4 Corsair CX600M MSI GTX 970 Two PCIe 4 port Sata adapters (One is full, the other has one drive attached) 11 Disks (6 on the mobo, 5 on the PCIe SATA cards) Another thing I've tried is booting UEFI from the flash drive, and this doesn't crash to black. Instead it just will reboot after a few minutes. Logs are attached. Only thing I have yet to try is a new flash drive, and I will as soon as I can keep it booted long enough to download a backup. Can I just copy the contents of the current flash drive onto a new one or is there something special about the backup? I also watched temps in the BIOS for a good while and never saw anything above 40C. BIOS is updated to the current version. Logs are attached. When I looked through them I ctrl+f'd 'fail' and found a lot, but don't understand them. Edit: Also tried booting with one stick of RAM and both sticks (16 gb/ea) show the same issue syslog.txt Edited September 10, 2019 by jsmj Quote Link to comment
Vr2Io Posted September 10, 2019 Share Posted September 10, 2019 Pls try disable C6 state at mainboard BIOS, this a common problem on 1st gen Ryzen ( although you have 2nd gen ) Quote Link to comment
JonathanM Posted September 10, 2019 Share Posted September 10, 2019 Also be sure memory timings are correct. 3200 is way overclocked IIRC. Quote Link to comment
jsmj Posted September 10, 2019 Author Share Posted September 10, 2019 10 hours ago, Benson said: Pls try disable C6 state at mainboard BIOS, this a common problem on 1st gen Ryzen ( although you have 2nd gen ) The only setting I found in my BIOS is global C-state, and disabling that didn't solve it 2 hours ago, jonathanm said: Also be sure memory timings are correct. 3200 is way overclocked IIRC. The RAM settings are unchanged from board default, which is 2133. I tried changing it to 3200 for kicks, and the system won't boot at all that way. I haven't messed with timings or voltage or anything. This morning I tried a different PSU, which didn't solve anything either. I also ditched the 2 PCI-E 4 port SATA cards for an IBM M1015 HBA in IT mode, which seems to be working properly when the server is running, but I still get the blackouts. I've also noticed that after the machine crashes, subsequent boots only last a couple minutes if it boots at all. If I give it some down time, it'll stay booted for longer, but eventually crashes. I thought this might mean a thermal problem, but I can't find any offending high temps, 40C at the most. Quote Link to comment
JonathanM Posted September 10, 2019 Share Posted September 10, 2019 5 minutes ago, jsmj said: I thought this might mean a thermal problem Have you pulled the CPU heatsink off and checked the mounting and redone the paste? Quote Link to comment
jsmj Posted September 10, 2019 Author Share Posted September 10, 2019 5 hours ago, jonathanm said: Have you pulled the CPU heatsink off and checked the mounting and redone the paste? checked the paste today, confirmed the pins aren't bent or anything and redid the paste. No dice. Stayed booted for about 45 minutes, just enough to crush my hopes Quote Link to comment
Frank1940 Posted September 10, 2019 Share Posted September 10, 2019 (edited) Is the case clean of dust and dirt? All cooling fins unclogged? Same for case air intakes and exhausts? If all of these are OK and if you are running one of the latest versions of Unraid, setup the Syslog Server ( Settings >>> Syslog Server ) to mirror the syslog to your flash drive (turn on Unraid 'Help' for instructions). EDIT: since this is a brand new system, I would suggest running memtst for 24 hours (unless the memory is ECC). Next try booting into the 'Safe Mode'. Edited September 10, 2019 by Frank1940 Quote Link to comment
jsmj Posted September 10, 2019 Author Share Posted September 10, 2019 (edited) 31 minutes ago, Frank1940 said: Is the case clean of dust and dirt? All cooling fins unclogged? Same for case air intakes and exhausts? If all of these are OK and if you are running one of the latest versions of Unraid, setup the Syslog Server ( Settings >>> Syslog Server ) to mirror the syslog to your flash drive (turn on Unraid 'Help' for instructions). EDIT: since this is a brand new system, I would suggest running memtst for 24 hours (unless the memory is ECC). Next try booting into the 'Safe Mode'. It's pretty dust free in there. I wouldn't eat out of it, but I've seen far worse. Ventilation shouldn't be a problem. I'll run memtest for 24 hours and see how it goes. Attached is the syslog from the USB after the latest crash and full diagnostics zip Edit to add: There are 13 files on my flash drive with the name FSCK000*.REC that I don't remember having before. Is this of any significance? I can add my flash backup .zip file as well if that'll help syslog tower-diagnostics-20190910-0012.zip Edited September 10, 2019 by jsmj added diagnostics Quote Link to comment
Frank1940 Posted September 10, 2019 Share Posted September 10, 2019 24 minutes ago, jsmj said: here are 13 files on my flash drive with the name FSCK000*.REC that I don't remember having before. Is this of any significance? I can add my flash backup .zip file as well if that'll help Rather than write it out, I grabbed a quote from a Google search. Quote These files are generated by the fsck linux utility, which is the equivalent of Windows/DOS CHKDSK command. They are the recovery data that it finds, like file data with no reference in FAT table, etc, so it converts it in a recovery file for you to analyze and recover... If you don't need to recover lost files (Which is most of the time), then delete them. In other words, one of the crashes occurred while an open file was being written, resulting in the file table not being updated properly. Quote Link to comment
jsmj Posted September 23, 2019 Author Share Posted September 23, 2019 Hey guys, sorry I went dark. I was out of town. I started memtest to run over night and forgot to stop it and it ran for 161 hours before I manually shut it down. It accumulated 1 error well after the 24 hour mark. I tried using a different PCIE GPU and it ran for 3 hours or so before I went to bed with high hopes, but it crashed sometime in the night. When I went to reboot it, it wouldn’t even make it to the boot selector screen before lockup. This behavior of needing a “rest” after being booted for longer (hour+) periods is super strange to me. The length of time the machine will stay booted seems to be related to the amount of time it’s been powered down. I went out of town and got 3 whole hours of uptime! But if I boot it directly after a crash, it can’t make it long enough to fully boot. That seems to be a temp issue which I can’t chase down or unraid needing to “forget” something that it did or accumulated in order to cause the crash. I’d like to make an unraid trial usb and boot from that, but I wanted to check and make sure that’s safe for my data. I wouldn’t start the array. Quote Link to comment
JorgeB Posted September 23, 2019 Share Posted September 23, 2019 8 minutes ago, jsmj said: It accumulated 1 error well after the 24 hour mark. Still not acceptable, make sure RAM is not overclocked, it's known to cause issues with some Ryzen systems. Quote Link to comment
jsmj Posted September 23, 2019 Author Share Posted September 23, 2019 (edited) 50 minutes ago, johnnie.black said: Still not acceptable, make sure RAM is not overclocked, it's known to cause issues with some Ryzen systems. As far as I can tell, memory speeds are unchanged by me and running at 2133. The sticks are rated for 3200. Here's the product page for the memory I'm using. I've attached the screens for my bios settings bios_screens.zip Edited September 23, 2019 by jsmj Quote Link to comment
JorgeB Posted September 23, 2019 Share Posted September 23, 2019 26 minutes ago, jsmj said: memory speeds are unchanged by me and running at 2133. It appears so, so you likely have a bad dimm. Quote Link to comment
jsmj Posted September 23, 2019 Author Share Posted September 23, 2019 2 minutes ago, johnnie.black said: It appears so, so you likely have a bad dimm. I've reproduced the issue when booting from each individual dimm by itself but I haven't run memtest with a single dimm installed. Is that worth doing? Quote Link to comment
JorgeB Posted September 23, 2019 Share Posted September 23, 2019 Yes, like mentioned the only acceptable number of errors during memtest is 0. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.