[Solved] Random rebooting


Recommended Posts

Edit: It was an underpowered PSU.  Should have taken in to account the fact that I had expanded exponentially with HDDs and had a top tier, power hungry CPU.  Swapped out the 550w for a 1000w and flawless ever since.

 

I have a fairly new Unraid server (version 6.7.2) that was running flawlessly until about 3 days ago.  It all started when I was moving ISOs around via mapped drive locations on my Windows 10 PC.  This was the first time the server went straight to shut down as if someone pulled the plug.  I believe the reboot happens because bios is set to return to last state after power failure.  Since that initial reboot things have gone downhill.  Of course after the hard reset the server goes in to a parity check which in and of itself seems to tax the system into crashing and rebooting again.  I have kept an eye on temps and it does seem that some of these reboots are attributed to heating issues.  But it also seems that even when I am specifically watching for heat issues, if I transfer large amounts of data between shares or from mapped drives on my Windows pc it will trigger a reboot.  I still suspect this is a heat issue but I am not sure how else to tackle this.  I notice the CPU fan is often times at 100% when I'm up around 60C.  I'm not overclocked here and I'm hardly doing anything I'd consider CPU intensive.  If nothing is immediately apparent in the diagnostics I fear I'm in for a full tear down and rebuild with special emphasis on thermal paste application.  Do I really need to go the cpu tower cooler route for a server build?  Is something else going on here? I tried to set up a syslog server on my Windows pc to grab info about why my server was crashing but I can't really figure out how to get it up and running.

 

My hardware:

Supermicro X9SCL-F

Intel Xeon E3-1270 v2

32 GiB DDR3 ECC Kingston

Noctua NH L9i

EVGA SuperNOVA G2 550w 80+ Gold PSU

A combo of shucked 10Tb and 8Tb WD white label reds with a smattering of pre-used 5GB/6GB HGST drives

ADATA and Samsung SSDs as unassigned devices (and appdata)

HP EX920 1TB m.2 cache drive

GTX 1050 gpu

Mellanox MNOA19-XTR Connectx2 10 gigabit card

LSI SAS9211 card

Some PCIe nvme card

 

My dockers:

binhex-delugevpn

binhex-jackett

binhex-krusader

binhex-plexpass

binhex-radarr

binhex-sonarr

duckdns

Firefox

letsencrypt

mariadb

nextcloud

openvpn-as

 

Plugins:

CA Auto Turbo Write Mode

CA Auto Update Applications

Ca Backup/Restore Appdata

CA Cleanup Appdata

Community Applications

controlR

Disk Location

Dynamix Active Streams

Dynamix Auto Fan Control

Dynamix Cache Directories

Dynamix Date Time

Dynamix S3 Sleep

Dynamix SCSI Devices

Dynamix SSD Trim

Dynamix System Buttons

Dynamix System Information

Dynamix System Statistics

Dynamix System Tempurature

Fix Common Problems

IPMI support

Nerd Tools

Preclear Disks

Speedtest Command Line Tool

Statistics

Tips and Tweaks

Unassigned Devices

unBALANCE

Unraid Nvidia

User Scripts

DontWorryScro-diagnostics-20191024-0500.zip

Edited by DontWorryScro
solved
Link to comment

I had considered this as well.  I bought the PSU brand new from EVGA when I built this server about 6 weeks ago does but that should not exclude it from being a potential cause.  Unfortunately I'd have to drop some coin on a new, more powerful one to test it.  But as a last resort I'll go that way.  Or try my hand at working with EVGA on some sort of upgrade discount.

Link to comment
3 hours ago, DontWorryScro said:

I had considered this as well.  I bought the PSU brand new from EVGA when I built this server about 6 weeks ago does but that should not exclude it from being a potential cause.  Unfortunately I'd have to drop some coin on a new, more powerful one to test it.  But as a last resort I'll go that way.  Or try my hand at working with EVGA on some sort of upgrade discount.

Hmm, if it's just 6-week old then it's not likely to be the problem. While you are close to 550W, it shouldn't be something affected by a 6-week-old PSU aging.

 

I don't think it's CPU overheating that's the issue. 60 degrees is nothing with Xeon. Your CPU is also rather low power and should be within the limit of the Noctua L9i. I have done some light overclock with the L9i so while it's not D15 level, it should not be overheating your Xeon.

 

Perhaps you can start with setting up Syslog Server (Settings -> Syslog Server) to save log locally and/or mirror to flash so you can see what happens at the point of reboot. Your current diagnostics looks to be after the reboot which doesn't have the syslog before (which would be more useful).

Link to comment

Just crashed again right around here:

 

Oct 24 14:44:57 Snuts rc.docker: openvpn-as: started succesfully!
Oct 24 14:46:32 Snuts autofan: Highest disk temp is 44C, adjusting fan speed from: 56 (21% @ 484rpm) to: 233 (91% @ 1695rpm)
Oct 24 14:47:01 Snuts ool www[28109]: /usr/local/emhttp/plugins/dynamix/scripts/rsyslog_config
Oct 24 14:47:03 Snuts rsyslogd:  [origin software="rsyslogd" swVersion="8.1903.0" x-pid="30855" x-info="https://www.rsyslog.com"] start
Oct 24 14:49:50 Snuts ntpd[1780]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized
Oct 24 14:50:38 Snuts autofan: Highest disk temp is 45C, adjusting fan speed from: 233 (91% @ 1737rpm) to: FULL (100% @ 1890rpm)
Oct 24 14:54:00 Snuts root: Fix Common Problems Version 2019.10.13a
Oct 24 14:54:07 Snuts root: Fix Common Problems Version 2019.10.13a
Oct 24 15:09:39 Snuts kernel: microcode: microcode updated early to revision 0x21, date = 2019-02-13
Oct 24 15:09:39 Snuts kernel: Linux version 4.19.56-Unraid (root@matrix) (gcc version 8.3.0 (GCC)) #1 SMP Thu Jun 27 13:13:13 BST 2019
Oct 24 15:09:39 Snuts kernel: Command line: BOOT_IMAGE=/bzimage initrd=/bzroot
Oct 24 15:09:39 Snuts kernel: x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
Oct 24 15:09:39 Snuts kernel: x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
Oct 24 15:09:39 Snuts kernel: x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
Oct 24 15:09:39 Snuts kernel: x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
Oct 24 15:09:39 Snuts kernel: x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, using 'standard' format.
Oct 24 15:09:39 Snuts kernel: BIOS-provided physical RAM map:
Oct 24 15:09:39 Snuts kernel: BIOS-e820: [mem 0x0000000000000000-0x00000000000987ff] usable
Oct 24 15:09:39 Snuts kernel: BIOS-e820: [mem 0x0000000000098800-0x000000000009ffff] reserved
Oct 24 15:09:39 Snuts kernel: BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] reserved
Oct 24 15:09:39 Snuts kernel: BIOS-e820: [mem 0x0000000000100000-0x00000000ae9ebfff] usable

Link to comment

Seems like the repeating theme I see, though maybe just coincidence, is this line shortly before a reboot:

 

Oct 24 14:44:57 Snuts rc.docker: openvpn-as: started succesfully!

 

Edit:  I've restored appdata from an older backup before all these shenanigans started to see if somehow the Openvpn docker got corrupted and was causing my entire server to reboot.  (But is that possible?)

 

Edit2: Nah, just rebooted again

Edited by DontWorryScro
more info
Link to comment

It looks like this was actually a USB issue.  Whatever made the server crash initially seemed to have corrupted the
USB stick.  I pulled the stick, repaired it in Windows, reformatted with the UnRAID installer, restored the data from a previous flash drive backup and I've been up and running crash-free for the past 10 hours. 

 

At least for now, I'll call this solved.

Link to comment
3 hours ago, DontWorryScro said:

It looks like this was actually a USB issue.  Whatever made the server crash initially seemed to have corrupted the
USB stick.  I pulled the stick, repaired it in Windows, reformatted with the UnRAID installer, restored the data from a previous flash drive backup and I've been up and running crash-free for the past 10 hours. 

 

At least for now, I'll call this solved.

Your USB stick could be shorting the port.

Corrupted data on the USB stick typically doesn't cause reboot (more likely that it just won't boot at all) because the OS is loaded into RAM.

 

I would recommend using a different port.

Link to comment

After 18 hours it indeed rebooted again and then went into such a severe reboot loop that it could not even turn on.  It just kept clicking as it tried to start.  In the process it nuked the USB and seems to have reset to some default state using Tower as the server name (different than what I had it set as).

 

This whole process is really testing my patience. 

Link to comment

ok ill keep replying to my own post in the hopes that something resonates with someone.  But also thank you for your insight and help with troubleshooting, Testdasi.

 

I have swapped USB ports and will see where we wind up. 

 

I will say I've noticed that without fail the reboots happen after the last docker gets init-ed.  I initially thought OpenVPN-as might be a culprit because after the "openvpn-as: started succesfully!" message in the syslog it would reboot.  However I came to find that it wasn't OpenVPN-as that was triggering the reboot.  Openvpn-as was merely the last docker alphabetically in my docker list.  To test this I deleted Openvpn and sure enough the next time around, nextcloud was the docker that loaded last before the system rebooted.

 

So something about initing the dockers is triggering a reboot it seems?  Which to my ignorant mind sounds like software, not hardware. 

 

Here's an example of a crash happening that was so bad that on restart had reverted back to some sort of default state and renamed my server back to Tower.

 

Sure would be nice to have some UnRaid rep chime in here with some help. 

 

Oct 25 15:54:47 Snuts rc.docker: mariadb: started succesfully!
Oct 25 15:54:47 Snuts kernel: br-3918c056127b: port 2(veth2b247b1) entered blocking state
Oct 25 15:54:47 Snuts kernel: br-3918c056127b: port 2(veth2b247b1) entered disabled state
Oct 25 15:54:47 Snuts kernel: device veth2b247b1 entered promiscuous mode
Oct 25 15:54:47 Snuts kernel: IPv6: ADDRCONF(NETDEV_UP): veth2b247b1: link is not ready
Oct 25 15:54:47 Snuts kernel: br-3918c056127b: port 2(veth2b247b1) entered blocking state
Oct 25 15:54:47 Snuts kernel: br-3918c056127b: port 2(veth2b247b1) entered forwarding state
Oct 25 15:54:47 Snuts kernel: br-3918c056127b: port 2(veth2b247b1) entered disabled state
Oct 25 15:54:48 Snuts kernel: eth0: renamed from veth6d65f8b
Oct 25 15:54:48 Snuts kernel: IPv6: ADDRCONF(NETDEV_CHANGE): veth2b247b1: link becomes ready
Oct 25 15:54:48 Snuts kernel: br-3918c056127b: port 2(veth2b247b1) entered blocking state
Oct 25 15:54:48 Snuts kernel: br-3918c056127b: port 2(veth2b247b1) entered forwarding state
Oct 25 15:54:48 Snuts rc.docker: nextcloud: started succesfully!
Oct 25 15:54:48 Snuts kernel: docker0: port 6(vethee229b0) entered blocking state
Oct 25 15:54:48 Snuts kernel: docker0: port 6(vethee229b0) entered disabled state
Oct 25 15:54:48 Snuts kernel: device vethee229b0 entered promiscuous mode
Oct 25 15:54:48 Snuts kernel: IPv6: ADDRCONF(NETDEV_UP): vethee229b0: link is not ready
Oct 25 15:54:48 Snuts kernel: docker0: port 6(vethee229b0) entered blocking state
Oct 25 15:54:48 Snuts kernel: docker0: port 6(vethee229b0) entered forwarding state
Oct 25 15:54:48 Snuts kernel: eth0: renamed from vethcd7dd32
Oct 25 15:54:48 Snuts kernel: IPv6: ADDRCONF(NETDEV_CHANGE): vethee229b0: link becomes ready
Oct 25 15:54:48 Snuts rc.docker: openvpn-as: started succesfully!
Oct 25 15:55:02 Snuts sSMTP[19070]: Creating SSL connection to host
Oct 25 15:55:02 Snuts sSMTP[19070]: SSL connection using TLS_AES_256_GCM_SHA384
Oct 25 15:55:05 Snuts sSMTP[19070]: Sent mail for [redacted]@gmail.com (221 2.0.0 closing connection j11sm93555pfa.127 - gsmtp) uid=0 username=root outbytes=668
Oct 25 17:10:43 Tower kernel: mdcmd (44): set md_write_method 1
Oct 25 17:10:43 Tower kernel:
Oct 25 17:10:43 Tower cache_dirs: Arguments=-l off
Oct 25 17:10:43 Tower cache_dirs: Max Scan Secs=10, Min Scan Secs=1
Oct 25 17:10:43 Tower cache_dirs: Scan Type=adaptive
Oct 25 17:10:43 Tower cache_dirs: Min Scan Depth=4
Oct 25 17:10:43 Tower cache_dirs: Max Scan Depth=none
Oct 25 17:10:43 Tower cache_dirs: Use Command='find -noleaf'
Oct 25 17:10:43 Tower cache_dirs: ---------- Caching Directories ---------------
Oct 25 17:10:43 Tower cache_dirs: .Trash-99
Oct 25 17:10:43 Tower cache_dirs: Archives
Oct 25 17:10:43 Tower cache_dirs: Media
Oct 25 17:10:43 Tower cache_dirs: Test
Oct 25 17:10:43 Tower cache_dirs: appdata
Oct 25 17:10:43 Tower cache_dirs: data
Oct 25 17:10:43 Tower cache_dirs: domains
Oct 25 17:10:43 Tower cache_dirs: downloads
Oct 25 17:10:43 Tower cache_dirs: isos
Oct 25 17:10:43 Tower cache_dirs: nextcloud
Oct 25 17:10:43 Tower cache_dirs: system
Oct 25 17:10:43 Tower cache_dirs: windowsdisk

Link to comment
2 minutes ago, Spies said:

Had this initially with the server in my sig, switched to a different (identical) motherboard and problem vanished.

I am fearing but also hoping this is the case.  I just don't look forward to the prospect of putting money into yet another (third) motherboard in the hopes that it fixes it without knowing exactly why I'm having the issue.  If it's a bad USB?  Bad CPU?

 

I'm running memtest86+ again right now and will let it roll for a while.  It just is concerning to me that I can run memtests for hours on end without a reboot but something in UnRaid is sinking me.  Certain things point to software.  Certain things point to hardware.  I just don't know enough about servers and computer software/hardware to deduce which.

 

I may scrap it all and start from scratch though.  And kick myself because at the end of the day with the money I've put in to fix or replace shoddy ebay items and/or troubleshooting I could have just gone for a more modern architecture with more cores.

Link to comment
9 minutes ago, DontWorryScro said:

I am fearing but also hoping this is the case.  I just don't look forward to the prospect of putting money into yet another (third) motherboard in the hopes that it fixes it without knowing exactly why I'm having the issue.  If it's a bad USB?  Bad CPU?

 

I'm running memtest86+ again right now and will let it roll for a while.  It just is concerning to me that I can run memtests for hours on end without a reboot but something in UnRaid is sinking me.  Certain things point to software.  Certain things point to hardware.  I just don't know enough about servers and computer software/hardware to deduce which.

 

I may scrap it all and start from scratch though.  And kick myself because at the end of the day with the money I've put in to fix or replace shoddy ebay items and/or troubleshooting I could have just gone for a more modern architecture with more cores.

I too was able to run memtest for hours with no errors, just something caused the kernal to panic, I even went as far to record the screen so I could see the console output before it was lost.

 

Edit: infact there was no kernal panic it was just like someone flipped a switch looking at the video I had recorded. The memory test errors I referred to were a bug with the way the SMT mode in memtest86+ worked with my hardware.

 

Here was my thread: 

 

Edited by Spies
Link to comment
  • 2 weeks later...

My server was having random reboots on 6.7.2 (TBS Open Source) over the course of a few days before I noticed.  At the time I resolved the problem by reverting back to 6.6.7 (TBS Open Source), no unwanted reboots for months.

However, I've recently revisited 6.7.2 and the issue returns.  I'm now running (TBS Open Source) 6.8.0 rc5, we'll see how that goes.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.