DontWorryScro Posted October 24, 2019 Share Posted October 24, 2019 (edited) Edit: It was an underpowered PSU. Should have taken in to account the fact that I had expanded exponentially with HDDs and had a top tier, power hungry CPU. Swapped out the 550w for a 1000w and flawless ever since. I have a fairly new Unraid server (version 6.7.2) that was running flawlessly until about 3 days ago. It all started when I was moving ISOs around via mapped drive locations on my Windows 10 PC. This was the first time the server went straight to shut down as if someone pulled the plug. I believe the reboot happens because bios is set to return to last state after power failure. Since that initial reboot things have gone downhill. Of course after the hard reset the server goes in to a parity check which in and of itself seems to tax the system into crashing and rebooting again. I have kept an eye on temps and it does seem that some of these reboots are attributed to heating issues. But it also seems that even when I am specifically watching for heat issues, if I transfer large amounts of data between shares or from mapped drives on my Windows pc it will trigger a reboot. I still suspect this is a heat issue but I am not sure how else to tackle this. I notice the CPU fan is often times at 100% when I'm up around 60C. I'm not overclocked here and I'm hardly doing anything I'd consider CPU intensive. If nothing is immediately apparent in the diagnostics I fear I'm in for a full tear down and rebuild with special emphasis on thermal paste application. Do I really need to go the cpu tower cooler route for a server build? Is something else going on here? I tried to set up a syslog server on my Windows pc to grab info about why my server was crashing but I can't really figure out how to get it up and running. My hardware: Supermicro X9SCL-F Intel Xeon E3-1270 v2 32 GiB DDR3 ECC Kingston Noctua NH L9i EVGA SuperNOVA G2 550w 80+ Gold PSU A combo of shucked 10Tb and 8Tb WD white label reds with a smattering of pre-used 5GB/6GB HGST drives ADATA and Samsung SSDs as unassigned devices (and appdata) HP EX920 1TB m.2 cache drive GTX 1050 gpu Mellanox MNOA19-XTR Connectx2 10 gigabit card LSI SAS9211 card Some PCIe nvme card My dockers: binhex-delugevpn binhex-jackett binhex-krusader binhex-plexpass binhex-radarr binhex-sonarr duckdns Firefox letsencrypt mariadb nextcloud openvpn-as Plugins: CA Auto Turbo Write Mode CA Auto Update Applications Ca Backup/Restore Appdata CA Cleanup Appdata Community Applications controlR Disk Location Dynamix Active Streams Dynamix Auto Fan Control Dynamix Cache Directories Dynamix Date Time Dynamix S3 Sleep Dynamix SCSI Devices Dynamix SSD Trim Dynamix System Buttons Dynamix System Information Dynamix System Statistics Dynamix System Tempurature Fix Common Problems IPMI support Nerd Tools Preclear Disks Speedtest Command Line Tool Statistics Tips and Tweaks Unassigned Devices unBALANCE Unraid Nvidia User Scripts DontWorryScro-diagnostics-20191024-0500.zip Edited December 10, 2019 by DontWorryScro solved Quote Link to comment
DontWorryScro Posted October 24, 2019 Author Share Posted October 24, 2019 More than happy to provide additional info if there’s not enough to go on. Quote Link to comment
testdasi Posted October 24, 2019 Share Posted October 24, 2019 Could it possibly be a problem with the PSU? Your config is pretty damn close to the 550W of the PSU so if it has not aged graciously, it could very well be under-powering under heavy load. That is particularly relevant with HDD which can cause power surge spinning up/down etc. Quote Link to comment
DontWorryScro Posted October 24, 2019 Author Share Posted October 24, 2019 I had considered this as well. I bought the PSU brand new from EVGA when I built this server about 6 weeks ago does but that should not exclude it from being a potential cause. Unfortunately I'd have to drop some coin on a new, more powerful one to test it. But as a last resort I'll go that way. Or try my hand at working with EVGA on some sort of upgrade discount. Quote Link to comment
testdasi Posted October 24, 2019 Share Posted October 24, 2019 3 hours ago, DontWorryScro said: I had considered this as well. I bought the PSU brand new from EVGA when I built this server about 6 weeks ago does but that should not exclude it from being a potential cause. Unfortunately I'd have to drop some coin on a new, more powerful one to test it. But as a last resort I'll go that way. Or try my hand at working with EVGA on some sort of upgrade discount. Hmm, if it's just 6-week old then it's not likely to be the problem. While you are close to 550W, it shouldn't be something affected by a 6-week-old PSU aging. I don't think it's CPU overheating that's the issue. 60 degrees is nothing with Xeon. Your CPU is also rather low power and should be within the limit of the Noctua L9i. I have done some light overclock with the L9i so while it's not D15 level, it should not be overheating your Xeon. Perhaps you can start with setting up Syslog Server (Settings -> Syslog Server) to save log locally and/or mirror to flash so you can see what happens at the point of reboot. Your current diagnostics looks to be after the reboot which doesn't have the syslog before (which would be more useful). Quote Link to comment
DontWorryScro Posted October 24, 2019 Author Share Posted October 24, 2019 Just crashed again right around here: Oct 24 14:44:57 Snuts rc.docker: openvpn-as: started succesfully! Oct 24 14:46:32 Snuts autofan: Highest disk temp is 44C, adjusting fan speed from: 56 (21% @ 484rpm) to: 233 (91% @ 1695rpm) Oct 24 14:47:01 Snuts ool www[28109]: /usr/local/emhttp/plugins/dynamix/scripts/rsyslog_config Oct 24 14:47:03 Snuts rsyslogd: [origin software="rsyslogd" swVersion="8.1903.0" x-pid="30855" x-info="https://www.rsyslog.com"] start Oct 24 14:49:50 Snuts ntpd[1780]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized Oct 24 14:50:38 Snuts autofan: Highest disk temp is 45C, adjusting fan speed from: 233 (91% @ 1737rpm) to: FULL (100% @ 1890rpm) Oct 24 14:54:00 Snuts root: Fix Common Problems Version 2019.10.13a Oct 24 14:54:07 Snuts root: Fix Common Problems Version 2019.10.13a Oct 24 15:09:39 Snuts kernel: microcode: microcode updated early to revision 0x21, date = 2019-02-13 Oct 24 15:09:39 Snuts kernel: Linux version 4.19.56-Unraid (root@matrix) (gcc version 8.3.0 (GCC)) #1 SMP Thu Jun 27 13:13:13 BST 2019 Oct 24 15:09:39 Snuts kernel: Command line: BOOT_IMAGE=/bzimage initrd=/bzroot Oct 24 15:09:39 Snuts kernel: x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers' Oct 24 15:09:39 Snuts kernel: x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers' Oct 24 15:09:39 Snuts kernel: x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers' Oct 24 15:09:39 Snuts kernel: x86/fpu: xstate_offset[2]: 576, xstate_sizes[2]: 256 Oct 24 15:09:39 Snuts kernel: x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, using 'standard' format. Oct 24 15:09:39 Snuts kernel: BIOS-provided physical RAM map: Oct 24 15:09:39 Snuts kernel: BIOS-e820: [mem 0x0000000000000000-0x00000000000987ff] usable Oct 24 15:09:39 Snuts kernel: BIOS-e820: [mem 0x0000000000098800-0x000000000009ffff] reserved Oct 24 15:09:39 Snuts kernel: BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] reserved Oct 24 15:09:39 Snuts kernel: BIOS-e820: [mem 0x0000000000100000-0x00000000ae9ebfff] usable Quote Link to comment
DontWorryScro Posted October 24, 2019 Author Share Posted October 24, 2019 (edited) Seems like the repeating theme I see, though maybe just coincidence, is this line shortly before a reboot: Oct 24 14:44:57 Snuts rc.docker: openvpn-as: started succesfully! Edit: I've restored appdata from an older backup before all these shenanigans started to see if somehow the Openvpn docker got corrupted and was causing my entire server to reboot. (But is that possible?) Edit2: Nah, just rebooted again Edited October 24, 2019 by DontWorryScro more info Quote Link to comment
DontWorryScro Posted October 25, 2019 Author Share Posted October 25, 2019 It looks like this was actually a USB issue. Whatever made the server crash initially seemed to have corrupted the USB stick. I pulled the stick, repaired it in Windows, reformatted with the UnRAID installer, restored the data from a previous flash drive backup and I've been up and running crash-free for the past 10 hours. At least for now, I'll call this solved. Quote Link to comment
testdasi Posted October 25, 2019 Share Posted October 25, 2019 3 hours ago, DontWorryScro said: It looks like this was actually a USB issue. Whatever made the server crash initially seemed to have corrupted the USB stick. I pulled the stick, repaired it in Windows, reformatted with the UnRAID installer, restored the data from a previous flash drive backup and I've been up and running crash-free for the past 10 hours. At least for now, I'll call this solved. Your USB stick could be shorting the port. Corrupted data on the USB stick typically doesn't cause reboot (more likely that it just won't boot at all) because the OS is loaded into RAM. I would recommend using a different port. Quote Link to comment
DontWorryScro Posted October 25, 2019 Author Share Posted October 25, 2019 If I start getting random reboots again I'll try a different port. It's currently mounted on the internal USB port on the motherboard itself. Quote Link to comment
DontWorryScro Posted October 26, 2019 Author Share Posted October 26, 2019 After 18 hours it indeed rebooted again and then went into such a severe reboot loop that it could not even turn on. It just kept clicking as it tried to start. In the process it nuked the USB and seems to have reset to some default state using Tower as the server name (different than what I had it set as). This whole process is really testing my patience. Quote Link to comment
DontWorryScro Posted October 26, 2019 Author Share Posted October 26, 2019 ok ill keep replying to my own post in the hopes that something resonates with someone. But also thank you for your insight and help with troubleshooting, Testdasi. I have swapped USB ports and will see where we wind up. I will say I've noticed that without fail the reboots happen after the last docker gets init-ed. I initially thought OpenVPN-as might be a culprit because after the "openvpn-as: started succesfully!" message in the syslog it would reboot. However I came to find that it wasn't OpenVPN-as that was triggering the reboot. Openvpn-as was merely the last docker alphabetically in my docker list. To test this I deleted Openvpn and sure enough the next time around, nextcloud was the docker that loaded last before the system rebooted. So something about initing the dockers is triggering a reboot it seems? Which to my ignorant mind sounds like software, not hardware. Here's an example of a crash happening that was so bad that on restart had reverted back to some sort of default state and renamed my server back to Tower. Sure would be nice to have some UnRaid rep chime in here with some help. Oct 25 15:54:47 Snuts rc.docker: mariadb: started succesfully! Oct 25 15:54:47 Snuts kernel: br-3918c056127b: port 2(veth2b247b1) entered blocking state Oct 25 15:54:47 Snuts kernel: br-3918c056127b: port 2(veth2b247b1) entered disabled state Oct 25 15:54:47 Snuts kernel: device veth2b247b1 entered promiscuous mode Oct 25 15:54:47 Snuts kernel: IPv6: ADDRCONF(NETDEV_UP): veth2b247b1: link is not ready Oct 25 15:54:47 Snuts kernel: br-3918c056127b: port 2(veth2b247b1) entered blocking state Oct 25 15:54:47 Snuts kernel: br-3918c056127b: port 2(veth2b247b1) entered forwarding state Oct 25 15:54:47 Snuts kernel: br-3918c056127b: port 2(veth2b247b1) entered disabled state Oct 25 15:54:48 Snuts kernel: eth0: renamed from veth6d65f8b Oct 25 15:54:48 Snuts kernel: IPv6: ADDRCONF(NETDEV_CHANGE): veth2b247b1: link becomes ready Oct 25 15:54:48 Snuts kernel: br-3918c056127b: port 2(veth2b247b1) entered blocking state Oct 25 15:54:48 Snuts kernel: br-3918c056127b: port 2(veth2b247b1) entered forwarding state Oct 25 15:54:48 Snuts rc.docker: nextcloud: started succesfully! Oct 25 15:54:48 Snuts kernel: docker0: port 6(vethee229b0) entered blocking state Oct 25 15:54:48 Snuts kernel: docker0: port 6(vethee229b0) entered disabled state Oct 25 15:54:48 Snuts kernel: device vethee229b0 entered promiscuous mode Oct 25 15:54:48 Snuts kernel: IPv6: ADDRCONF(NETDEV_UP): vethee229b0: link is not ready Oct 25 15:54:48 Snuts kernel: docker0: port 6(vethee229b0) entered blocking state Oct 25 15:54:48 Snuts kernel: docker0: port 6(vethee229b0) entered forwarding state Oct 25 15:54:48 Snuts kernel: eth0: renamed from vethcd7dd32 Oct 25 15:54:48 Snuts kernel: IPv6: ADDRCONF(NETDEV_CHANGE): vethee229b0: link becomes ready Oct 25 15:54:48 Snuts rc.docker: openvpn-as: started succesfully! Oct 25 15:55:02 Snuts sSMTP[19070]: Creating SSL connection to host Oct 25 15:55:02 Snuts sSMTP[19070]: SSL connection using TLS_AES_256_GCM_SHA384 Oct 25 15:55:05 Snuts sSMTP[19070]: Sent mail for [redacted]@gmail.com (221 2.0.0 closing connection j11sm93555pfa.127 - gsmtp) uid=0 username=root outbytes=668 Oct 25 17:10:43 Tower kernel: mdcmd (44): set md_write_method 1 Oct 25 17:10:43 Tower kernel: Oct 25 17:10:43 Tower cache_dirs: Arguments=-l off Oct 25 17:10:43 Tower cache_dirs: Max Scan Secs=10, Min Scan Secs=1 Oct 25 17:10:43 Tower cache_dirs: Scan Type=adaptive Oct 25 17:10:43 Tower cache_dirs: Min Scan Depth=4 Oct 25 17:10:43 Tower cache_dirs: Max Scan Depth=none Oct 25 17:10:43 Tower cache_dirs: Use Command='find -noleaf' Oct 25 17:10:43 Tower cache_dirs: ---------- Caching Directories --------------- Oct 25 17:10:43 Tower cache_dirs: .Trash-99 Oct 25 17:10:43 Tower cache_dirs: Archives Oct 25 17:10:43 Tower cache_dirs: Media Oct 25 17:10:43 Tower cache_dirs: Test Oct 25 17:10:43 Tower cache_dirs: appdata Oct 25 17:10:43 Tower cache_dirs: data Oct 25 17:10:43 Tower cache_dirs: domains Oct 25 17:10:43 Tower cache_dirs: downloads Oct 25 17:10:43 Tower cache_dirs: isos Oct 25 17:10:43 Tower cache_dirs: nextcloud Oct 25 17:10:43 Tower cache_dirs: system Oct 25 17:10:43 Tower cache_dirs: windowsdisk Quote Link to comment
DontWorryScro Posted October 26, 2019 Author Share Posted October 26, 2019 I changed the USB mount point to a different port. The problem persists. Still not sure if this is hardware of software. Quote Link to comment
Spies Posted October 26, 2019 Share Posted October 26, 2019 (edited) Had this initially with the server in my sig, switched to a different (identical) motherboard and problem vanished. Edited October 26, 2019 by Spies Quote Link to comment
DontWorryScro Posted October 26, 2019 Author Share Posted October 26, 2019 2 minutes ago, Spies said: Had this initially with the server in my sig, switched to a different (identical) motherboard and problem vanished. I am fearing but also hoping this is the case. I just don't look forward to the prospect of putting money into yet another (third) motherboard in the hopes that it fixes it without knowing exactly why I'm having the issue. If it's a bad USB? Bad CPU? I'm running memtest86+ again right now and will let it roll for a while. It just is concerning to me that I can run memtests for hours on end without a reboot but something in UnRaid is sinking me. Certain things point to software. Certain things point to hardware. I just don't know enough about servers and computer software/hardware to deduce which. I may scrap it all and start from scratch though. And kick myself because at the end of the day with the money I've put in to fix or replace shoddy ebay items and/or troubleshooting I could have just gone for a more modern architecture with more cores. Quote Link to comment
Spies Posted October 26, 2019 Share Posted October 26, 2019 (edited) 9 minutes ago, DontWorryScro said: I am fearing but also hoping this is the case. I just don't look forward to the prospect of putting money into yet another (third) motherboard in the hopes that it fixes it without knowing exactly why I'm having the issue. If it's a bad USB? Bad CPU? I'm running memtest86+ again right now and will let it roll for a while. It just is concerning to me that I can run memtests for hours on end without a reboot but something in UnRaid is sinking me. Certain things point to software. Certain things point to hardware. I just don't know enough about servers and computer software/hardware to deduce which. I may scrap it all and start from scratch though. And kick myself because at the end of the day with the money I've put in to fix or replace shoddy ebay items and/or troubleshooting I could have just gone for a more modern architecture with more cores. I too was able to run memtest for hours with no errors, just something caused the kernal to panic, I even went as far to record the screen so I could see the console output before it was lost. Edit: infact there was no kernal panic it was just like someone flipped a switch looking at the video I had recorded. The memory test errors I referred to were a bug with the way the SMT mode in memtest86+ worked with my hardware. Here was my thread: Edited October 26, 2019 by Spies Quote Link to comment
DontWorryScro Posted October 26, 2019 Author Share Posted October 26, 2019 Sounds similar. Maybe I'll bite the bullet and just buy a new motherboard. Thanks. Quote Link to comment
DontWorryScro Posted October 29, 2019 Author Share Posted October 29, 2019 Is it really possible that my motherboard kept rebooting because it's date and time were insanely incorrect? It was not syncing with a time server and it was set to sometime in the early 1900's lol. Perhaps coincidentally since I've corrected the date and time in IPMI I've not had a crash. Quote Link to comment
Kevin Posted November 7, 2019 Share Posted November 7, 2019 My server was having random reboots on 6.7.2 (TBS Open Source) over the course of a few days before I noticed. At the time I resolved the problem by reverting back to 6.6.7 (TBS Open Source), no unwanted reboots for months. However, I've recently revisited 6.7.2 and the issue returns. I'm now running (TBS Open Source) 6.8.0 rc5, we'll see how that goes. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.